Skip to content

Commit

Permalink
fix: Jupyuterhub, workbench and model serving related tests (red-hat-…
Browse files Browse the repository at this point in the history
…data-services#971)

* fix minimal-cuda-test

* fix workbenches

* fix model serving

* Linter fixes

* PR Fixes

* PR fixes

* fix PR comment: Generic accelerator setter

* Add a rerun migration for accelerators in gpu deploy script

* PR fixes

* Update ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* PR fixes

* delete one pod instead the complete dashboard replica set

* Pod deletion more clean and smart

* modify delete dashboard error message

* Rollback restart instead of pod deletion

* fix typo in variable in workbenches

* Delete unused variables in gpu_deploy script

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
  • Loading branch information
2 people authored and Shilpa Chugh committed Nov 27, 2023
1 parent 90390d9 commit 57f68b7
Show file tree
Hide file tree
Showing 7 changed files with 121 additions and 61 deletions.
24 changes: 24 additions & 0 deletions ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,33 @@ function wait_until_gpu_pods_are_running() {

}

function rerun_accelerator_migration() {
# As we are adding the GPUs after installing the RHODS operator, those GPUs are not discovered automatically.
# In order to rerun the migration we need to
# 1. Delete the migration configmap
# 2. Rollout restart dashboard deployment, so the configmap is created again and the migration run again
# Context: https://github.com/opendatahub-io/odh-dashboard/issues/1938

echo "Deleting configmap migration-gpu-status"
if ! oc delete configmap migration-gpu-status -n redhat-ods-applications;
then
printf "ERROR: When trying to delete the migration-gpu-status configmap\n"
return 1
fi

echo "Rollout restart rhods-dashboard deployment"
if ! oc rollout restart deployment.apps/rhods-dashboard -n redhat-ods-applications;
then
printf "ERROR: When trying to rollout restart rhods-dashboard deployment\n"
return 1
fi

}

wait_until_gpu_pods_are_running
oc apply -f ${GPU_INSTALL_DIR}/nfd_deploy.yaml
oc get csv -n nvidia-gpu-operator $CSVNAME -ojsonpath={.metadata.annotations.alm-examples} | jq .[0] > clusterpolicy.json
oc apply -f clusterpolicy.json
rerun_accelerator_migration


55 changes: 38 additions & 17 deletions ods_ci/tests/Resources/Page/ODH/JupyterHub/JupyterHubSpawner.robot
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,12 @@ ${KFNBC_SPAWNER_HEADER_XPATH} = //h1[.="Start a notebook server"]
${JUPYTERHUB_DROPDOWN_XPATH} = //button[@aria-label="Options menu"]
${KFNBC_CONTAINER_SIZE_TITLE} = //div[.="Deployment size"]/..//span[.="Container Size"]
${KFNBC_CONTAINER_SIZE_DROPDOWN_XPATH} = //label[@for="modal-notebook-container-size"]/../..//button[@aria-label="Options menu"]
${KFNBC_GPU_DROPDOWN_XPATH} = //button[contains(@aria-labelledby, "gpu-numbers")]
${KFNBC_ACCELERATOR_HEADER_XPATH} = //span[text()='Accelerator']
${KFNBC_ACCELERATOR_DROPDOWN_XPATH} = //label[@for='modal-notebook-accelerator']/ancestor::div[@class='pf-c-form__group']/descendant::button
${KFNBC_ACCELERATOR_INPUT_XPATH} = //input[@aria-label='Number of accelerators']
${KFNBC_ACCELERATOR_LESS_BUTTON_XPATH} = ${KFNBC_ACCELERATOR_INPUT_XPATH}/preceding-sibling::button
${KFNBC_ACCELERATOR_PLUS_BUTTON_XPATH} = ${KFNBC_ACCELERATOR_INPUT_XPATH}/following-sibling::button
${KFNBC_MAX_ACCELERATOR_WARNING_XPATH} = //div[@aria-label='Warning Alert']//h4[contains(text(), 'accelerator detected')]
${KFNBC_MODAL_HEADER_XPATH} = //div[@aria-label="Starting server modal"]
${KFNBC_MODAL_CANCEL_XPATH} = ${KFNBC_MODAL_HEADER_XPATH}//button[.="Cancel"]
${KFNBC_MODAL_CLOSE_XPATH} = ${KFNBC_MODAL_HEADER_XPATH}//button[.="Close"]
Expand Down Expand Up @@ -92,26 +97,41 @@ Select Container Size
Click Element xpath:${JUPYTERHUB_DROPDOWN_XPATH}\[1]
Click Element xpath://span[.="${container_size}"]/../..

Wait Until GPU Dropdown Exists
[Documentation] Verifies that the dropdown to select the no. of GPUs exists
Wait Until Page Contains Number of GPUs
Wait Until Accelerator Dropdown Exists
[Documentation] Verifies that the dropdown to select the Accelerator exists
Page Should Not Contain All GPUs are currently in use, try again later.
Wait Until Page Contains Element xpath:${KFNBC_GPU_DROPDOWN_XPATH}
... error=GPU selector is not present in JupyterHub Spawner
Wait Until Page Contains Element xpath:${KFNBC_ACCELERATOR_DROPDOWN_XPATH}
... error=Accelerator selector is not present in JupyterHub Spawner

Set GPU Accelerator
[Documentation] Set Accelerator type
[Arguments] ${accelerator_type}='Nvidia GPU'
Click Element xpath:${KFNBC_ACCELERATOR_DROPDOWN_XPATH}
Click Element xpath://div[@class and text()=${accelerator_type}]

Set Number Of Required Accelerators
[Documentation] Sets the Accelerators count based on the ${accelerators} argument
[Arguments] ${accelerators}
${acc_num} = Get Value xpath:${KFNBC_ACCELERATOR_INPUT_XPATH}
Log Actual num of Accelerators: ${acc_num}
IF ${acc_num} != ${accelerators}
Input Text ${KFNBC_ACCELERATOR_INPUT_XPATH} ${accelerators}
END

Set Number Of Required GPUs
[Documentation] Sets the gpu count based on the ${gpus} argument
[Arguments] ${gpus}
Click Element xpath:${KFNBC_GPU_DROPDOWN_XPATH}
Click Element xpath:${KFNBC_GPU_DROPDOWN_XPATH}/../..//button[.="${gpus}"]

Fetch Max Number Of GPUs In Spawner Page
[Documentation] Returns the maximum number of GPUs a user can request from the spawner
${gpu_visible} = Run Keyword And Return Status Wait Until GPU Dropdown Exists
${gpu_visible} = Run Keyword And Return Status Wait Until Accelerator Dropdown Exists
IF ${gpu_visible}==True
Click Element xpath:${KFNBC_GPU_DROPDOWN_XPATH}
${maxGPUs} = Get Text xpath://li[@class="pf-c-select__menu-wrapper"][last()]/button
${maxGPUs} = Convert To Integer ${maxGPUs}
Set GPU Accelerator
${max_operator_detected} = Run Keyword And Return Status Page Should Contain Element xpath=${KFNBC_MAX_ACCELERATOR_WARNING_XPATH}
WHILE not ${max_operator_detected}
Click Element xpath:${KFNBC_ACCELERATOR_PLUS_BUTTON_XPATH}
${max_operator_detected} = Run Keyword And Return Status Page Should Contain Element xpath=${KFNBC_MAX_ACCELERATOR_WARNING_XPATH}
${maxGPUs} = Get Value xpath:${KFNBC_ACCELERATOR_INPUT_XPATH}
${maxGPUs} = Convert To Integer ${maxGPUs}
${maxGPUs} = Set Variable ${maxGPUs-1}
END
ELSE
${maxGPUs} = Set Variable ${0}
END
Expand Down Expand Up @@ -262,9 +282,10 @@ Spawn Notebook With Arguments # robocop: disable
IF ${spawner_ready}==True
Select Notebook Image ${image} ${version}
Select Container Size ${size}
${gpu_visible} = Run Keyword And Return Status Wait Until GPU Dropdown Exists
${gpu_visible} = Run Keyword And Return Status Wait Until Accelerator Dropdown Exists
IF ${gpu_visible}==True and ${gpus}>0
Set Number Of Required GPUs ${gpus}
Set GPU Accelerator
Set Number Of Required Accelerators ${gpus}
ELSE IF ${gpu_visible}==False and ${gpus}>0
IF ${index} < ${retries}
Sleep 30s reason=Wait for GPU to free up
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ ${S3_BUCKET_DC_INPUT_XP}= xpath=//input[@aria-label="AWS field AWS_S3_BUCKET
${REPLICAS_PLUS_BTN_XP}= xpath=//div/button[@aria-label="Plus"]
${REPLICAS_MIN_BTN_XP}= xpath=//div/button[@aria-label="Minus"]
${SERVING_RUNTIME_NAME}= xpath=//input[@id="serving-runtime-name-input"]
${GPU_SECTION_TITLE}= xpath=//span[.="Model server GPUs"]
${GPU_SECTION_INPUT}= ${GPU_SECTION_TITLE}/../../..//input
${GPU_SECTION_PLUS}= ${GPU_SECTION_TITLE}/../../..//button[@aria-label="Plus"]
${GPU_SECTION_MINUS}= ${GPU_SECTION_TITLE}/../../..//button[@aria-label="Minus"]

${SERVING_ACCELERATOR_DROPDOWN_XPATH}= xpath=//label[@for='modal-notebook-accelerator']/ancestor::div[@class='pf-c-form__group']/descendant::button
${SERVING_ACCELERATOR_INPUT_XPATH}= xpath=//input[@aria-label='Number of accelerators']
${SERVING_ACCELERATOR_MINUS_BUTTON_XPATH}= xpath=${SERVING_INPUT_XPATH}/preceding-sibling::button
${SERVING_ACCELERATOR_PLUS_BUTTON_XPATH}= xpath=${SERVING_ACCELERATOR_INPUT_XPATH}/following-sibling::button
${SERVING_MODEL_SERVERS_SIDE_MENU}= xpath=//span[text()='Models and model servers']

*** Keywords ***
Create Model Server
Expand All @@ -42,6 +42,7 @@ Create Model Server
Log GPU requested but not available
Fail
END
Set Accelerator
Set Number of GPU With Buttons ${no_gpus}
END
IF ${ext_route}==${TRUE}
Expand Down Expand Up @@ -88,13 +89,18 @@ Set Server Size

Verify GPU Selector Is Usable
[Documentation] Verifies that the GPU selector is present and enabled
Page Should Contain Element ${GPU_SECTION_TITLE}
Element Should Be Enabled ${GPU_SECTION_INPUT}
Page Should Contain Element ${SERVING_ACCELERATOR_DROPDOWN_XPATH}

Set Accelerator
[Documentation] Set Accelerator
[Arguments] ${accelerator}='Nvidia GPU'
Click Element ${SERVING_ACCELERATOR_DROPDOWN_XPATH}
Click Element xpath=//div[@class and text()=${accelerator}]

Set Number of GPU With Buttons
[Documentation] Select the number of GPUs to attach to the model server
[Arguments] ${no_gpus}
${current}= Get Element Attribute ${GPU_SECTION_INPUT} value
${current}= Get Element Attribute ${SERVING_ACCELERATOR_INPUT_XPATH} value
${difference}= Evaluate int(${no_gpus})-int(${current})
${op}= Set Variable plus
IF ${difference}<${0}
Expand All @@ -108,16 +114,16 @@ Set Number of GPU With Buttons
Click GPU Minus Button
END
END
${current}= Get Element Attribute ${GPU_SECTION_INPUT} value
${current}= Get Element Attribute ${SERVING_ACCELERATOR_INPUT_XPATH} value
Should Be Equal As Integers ${current} ${no_gpus}

Click GPU Plus Button
[Documentation] Click the plus button in the GPU selector
Click Element ${GPU_SECTION_PLUS}
Click Element ${SERVING_ACCELERATOR_PLUS_BUTTON_XPATH}

Click GPU Minus Button
[Documentation] Click the minus button in the GPU selector
Click Element ${GPU_SECTION_MINUS}
Click Element ${SERVING_ACCELERATOR_MINUS_BUTTON_XPATH}

Verify Displayed GPU Count
[Documentation] Verifies the number of GPUs displayed in the Model Server table
Expand All @@ -128,8 +134,10 @@ Verify Displayed GPU Count
IF ${expanded}==False
Click Element xpath://button[@aria-expanded="false"]/span[.="${server_name}"]
END
Page Should Contain Element xpath://span[.="${server_name}"]/../../../..//span[.="Number of GPUs"]
Page Should Contain Element xpath://span[.="${server_name}"]/../../../..//span[.="Number of GPUs"]/../../dd/div[.="${no_gpus}"]
Click Element ${SERVING_MODEL_SERVERS_SIDE_MENU}
Sleep 5s reason=Sometimes the number of current Accelerators take a few seconds to update
${current_accs}= Get Text xpath://span[text()="${server_name}"]/../../../following-sibling::tr//td[@data-label]/div/dl/div[4]/dd/div
Should Match ${current_accs} ${no_gpus}

Set Model Server Runtime
[Documentation] Selects a given Runtime for the model server
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,12 @@ ${WORKBENCH_NAME_INPUT_XP}= xpath=//input[@name="workbench-name"]
${WORKBENCH_DESCR_TXT_XP}= xpath=//textarea[@name="workbench-description"]
${WORKBENCH_IMAGE_MENU_BTN_XP}= xpath=//section[@id="notebook-image"]//div[@id="workbench-image-stream-selection"]/button # robocop: disable
${WORKBENCH_IMAGE_ITEM_BTN_XP}= xpath=//div[@id="workbench-image-stream-selection"]//li//div
# ${WORKBENCH_IMAGE_ITEM_SPAN_XP}= xpath=//ul[@id="workbench-image-stream-selection"]/li//span
${WORKBENCH_SIZE_MENU_BTN_XP}= xpath=//section[@id="deployment-size"]//button # Removing the attribute in case it changes like it did for the image dropdown
${WORKBENCH_SIZE_SIDE_MENU_BTN}= xpath=//nav[@aria-label="Jump to section"]//span[text()="Deployment size"]
${WORKBENCH_ACCELERATOR_DROPDOWN_XPATH}= xpath=//label[@for='modal-notebook-accelerator']/ancestor::div[@class='pf-c-form__group']/descendant::button
${WORKBENCH_ACCELERATOR_INPUT_XPATH}= xpath=//input[@aria-label='Number of accelerators']
${WORKBENCH_ACCELERATOR_LESS_BUTTON_XPATH}= xpath=${WORKBENCH_ACCELERATOR_INPUT_XPATH}/preceding-sibling::button
${WORKBENCH_ACCELERATOR_PLUS_BUTTON_XPATH}= xpath=${WORKBENCH_ACCELERATOR_INPUT_XPATH}/following-sibling::button
${WORKBENCH_SIZE_ITEM_BTN_XP}= xpath=//ul[@data-id="container-size-select"]/li/button
${WORKBENCH_GPU_MENU_BTN_XP}= xpath=//section[@id="deployment-size"]//button[contains(@aria-labelledby,"gpu-numbers")] # robocop: disable
${WORKBENCH_GPU_ITEM_BTN_XP}= xpath=//ul[@data-id="gpu-select"]/li/button
Expand Down Expand Up @@ -210,9 +214,10 @@ Select Workbench Image Version
Select Workbench Container Size
[Documentation] Selects the container size in the workbench creation page
[Arguments] ${size_name}=Small
Click Element //a[@href="#deployment-size"]
Wait Until Page Contains Element ${WORKBENCH_SIZE_SIDE_MENU_BTN}
Click Element ${WORKBENCH_SIZE_SIDE_MENU_BTN}
Wait Until Page Contains Element ${WORKBENCH_SIZE_MENU_BTN_XP}
Click Button ${WORKBENCH_SIZE_MENU_BTN_XP}
Click Element ${WORKBENCH_SIZE_MENU_BTN_XP}
Wait Until Page Contains Element ${WORKBENCH_SIZE_ITEM_BTN_XP}/span[text()="${size_name}"]
Click Element ${WORKBENCH_SIZE_ITEM_BTN_XP}/span[text()="${size_name}"]

Expand Down Expand Up @@ -341,7 +346,7 @@ Handle Stop Workbench Confirmation Modal
Run Keyword And Continue On Failure
... Page Should Contain Are you sure you want to stop the workbench? Any changes without saving will be erased.
Run Keyword And Continue On Failure Page Should Contain To save changes, access your
Run Keyword And Continue On Failure Page Should Contain Element xpath=//a[.="workbench"]
Run Keyword And Continue On Failure Page Should Contain Element xpath=//a[.="workbench"]
END
Run Keyword And Continue On Failure Page Should Contain Element xpath=//input[@id="dont-show-again"]
Run Keyword And Continue On Failure Click Element xpath=//input[@id="dont-show-again"]
Expand Down Expand Up @@ -460,11 +465,31 @@ Page Should Contain Event Log

Select Workbench Number Of GPUs
[Documentation] Selects the container size in the workbench creation page
[Arguments] ${gpus}
Wait Until Page Contains Element ${WORKBENCH_GPU_MENU_BTN_XP}
Click Button ${WORKBENCH_GPU_MENU_BTN_XP}
Wait Until Page Contains Element ${WORKBENCH_GPU_ITEM_BTN_XP}/self::*[text()="${gpus}"]
Click Element ${WORKBENCH_GPU_ITEM_BTN_XP}/self::*[text()="${gpus}"]
[Arguments] ${gpus} ${gpu_type}='Nvidia GPU'
Wait Until Page Contains Element ${WORKBENCH_SIZE_SIDE_MENU_BTN}
Click Element ${WORKBENCH_SIZE_SIDE_MENU_BTN}
Wait Until Page Contains Element ${WORKBENCH_ACCELERATOR_DROPDOWN_XPATH}
Click Element ${WORKBENCH_ACCELERATOR_DROPDOWN_XPATH}
IF "${gpus}" == "0"
Click Element xpath=//a[text()='None']
ELSE
# Select Accelerator Technology
Wait Until Page Contains Element xpath=//div[@class and text()=${gpu_type}]
Click Element xpath=//div[@class and text()=${gpu_type}]
# Select number of GPU units
${actual_gpus}= Get Value ${WORKBENCH_ACCELERATOR_INPUT_XPATH}
${actual_gpus}= Convert To Integer ${actual_gpus}
${gpus}= Convert To Integer ${gpus}
WHILE ${actual_gpus} != ${gpus}
IF ${actual_gpus} < ${gpus}
Click Element ${WORKBENCH_ACCELERATOR_PLUS_BUTTON_XPATH}
ELSE
Click Element ${WORKBENCH_ACCELERATOR_LESS_BUTTON_XPATH}
END
${actual_gpus}= Get Value ${WORKBENCH_ACCELERATOR_INPUT_XPATH}
${actual_gpus}= Convert To Integer ${actual_gpus}
END
END

Edit GPU Number
[Documentation] Edit a workbench
Expand Down Expand Up @@ -498,12 +523,6 @@ Delete Workbench From CLI
END
END

GPU Dropdown Should Be Disabled
[Documentation] Checks if the GPU dropdown is not able editable
[Arguments] ${workbench_title}
Workbenches.Click Action From Actions Menu item_title=${workbench_title} item_type=workbench action=Edit
Wait Until Page Contains Element ${WORKBENCH_GPU_MENU_BTN_XP}
Element Should Be Disabled ${WORKBENCH_GPU_MENU_BTN_XP}

Get Workbench Pod
[Documentation] Retrieves info of a workbench pod: namespace, CR resource name and pod definition
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,20 +59,10 @@ Verify GPU Operator Deployment # robocop: disable
# Before GPU Node is added to the cluster
# NS
Verify Namespace Status label=kubernetes.io/metadata.name=redhat-nvidia-gpu-addon
# Node-Feature-Discovery Operator
Verify Operator Status label=operators.coreos.com/ose-nfd.redhat-nvidia-gpu-addon
... operator_name=ose-nfd.*
Verify Namespace Status label=kubernetes.io/metadata.name=nvidia-gpu-operator
# GPU Operator
Verify Operator Status label=operators.coreos.com/gpu-operator-certified.redhat-nvidia-gpu-addon
Verify Operator Status label=operators.coreos.com/gpu-operator-certified.nvidia-gpu-operator
... operator_name=gpu-operator-certified.v*
# nfd-controller-manager
Verify Deployment Status label=operators.coreos.com/ose-nfd.redhat-nvidia-gpu-addon
... dname=nfd-controller-manager
# nfd-master
Verify DaemonSet Status label=app=nfd-master dsname=nfd-master
# nfd-worker
Verify DaemonSet Status label=app=nfd-worker dsname=nfd-worker

# After GPU Node is added to the cluster
Verify DaemonSet Status label=app=gpu-feature-discovery dsname=gpu-feature-discovery
Expand All @@ -84,7 +74,6 @@ Verify GPU Operator Deployment # robocop: disable
# Verify DaemonSet Status label=app=nvidia-driver-daemonset-* dsname=nvidia-driver-daemonset-*
Verify DaemonSet Status label=app=nvidia-node-status-exporter dsname=nvidia-node-status-exporter
Verify DaemonSet Status label=app=nvidia-operator-validator dsname=nvidia-operator-validator
Verify CR Status crd=NodeFeatureDiscovery cr_name=ocp-gpu-addon

Verify That Prometheus Image Is A CPaaS Built Image
[Documentation] Verifies the images used for prometheus
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,6 @@ Verify User Can Remove GPUs From Workbench
... pv_description=${EMPTY} pv_size=${PV_SIZE} gpus=1
Run Keyword And Continue On Failure Wait Until Workbench Is Started workbench_title=${WORKBENCH_TITLE_GPU}
Sleep 10s reason=There is some delay in updating the GPU availability in Dashboard
Run Keyword And Continue On Failure GPU Dropdown Should Be Disabled workbench_title=${WORKBENCH_TITLE_GPU}
Click Button ${GENERIC_CANCEL_BTN_XP}
Stop Workbench workbench_title=${WORKBENCH_TITLE_GPU}
Run Keyword And Continue On Failure Wait Until Workbench Is Stopped workbench_title=${WORKBENCH_TITLE_GPU}
Wait Until Keyword Succeeds 10 times 5s
Expand Down
3 changes: 2 additions & 1 deletion ods_ci/tests/Tests/500__jupyterhub/autoscaling-gpus.robot
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ Spawn Notebook And Trigger Autoscale
... of the GPU node.
Select Notebook Image ${NOTEBOOK_IMAGE}
Select Container Size Small
Set Number Of Required GPUs 1
Set NVidia GPU Accelerator
Set Number Of Required Accelerators 1
Spawn Notebook spawner_timeout=20 minutes expect_autoscaling=${True}
Run Keyword And Warn On Failure Wait Until Page Contains Log in with OpenShift timeout=15s
${oauth_prompt_visible} = Is OpenShift OAuth Login Prompt Visible
Expand Down

0 comments on commit 57f68b7

Please sign in to comment.