Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update KFTO multi-node test names according to recent updates in orig… #2164

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

abhijeet-dhumal
Copy link
Contributor

Update KFTO multi-node test names according to recent updates in original test names

Related to : opendatahub-io/distributed-workloads#299

@@ -48,30 +48,30 @@
Run Training Operator KFTO Test TestPyTorchJobFailureWithROCm ${ROCM_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node CPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node CPU test for Training operator using PyTorch job with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 CPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (170/120)

Run Training operator KFTO_MNIST multi-node test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 1 GPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (166/120)

Run Training operator KFTO_MNIST multi-node test with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 1 GPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (164/120)
Copy link
Contributor

github-actions bot commented Jan 9, 2025

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass %
594 0 0 594 100

@ChughShilpa
Copy link
Contributor

What about other 2 test scenarios
TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ?
Will you add it in another PR ?

@abhijeet-dhumal
Copy link
Contributor Author

abhijeet-dhumal commented Jan 9, 2025

What about other 2 test scenarios TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ? Will you add it in another PR ?

@ChughShilpa Actually the remaining MultiNode/MultiGPUs tests requires 2 cluster-nodes with minimum 2 GPUs each (GPU instance like g4dn.12xlarge - A100 GPUs), which I'm not sure whether will be available during QG tests..
Even after this pre-requisite, is it ok to add these tests here?
cc: @sutaakar

@sutaakar
Copy link
Contributor

sutaakar commented Jan 9, 2025

We can add the tests to ODS CI, just we can't run them as part of QG, only as part of our own jobs.

@ChughShilpa
Copy link
Contributor

What about other 2 test scenarios TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ? Will you add it in another PR ?

@ChughShilpa Actually the remaining MultiNode/MultiGPUs tests requires 2 cluster-nodes with minimum 2 GPUs each (GPU instance like g4dn.12xlarge - A100 GPUs), which I'm not sure whether will be available during QG tests.. Even after this pre-requisite, is it ok to add these tests here? cc: @sutaakar

g4dn.12xlarge instance is used in qe-jenkins, and we also have Resources-2GPUS tag and can be used for this requirement, the only thing is we might need to inform the devtestops team for this

Copy link

sonarqubecloud bot commented Jan 9, 2025

Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeWithROCm ${ROCM_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node multi-gpu test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 GPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (176/120)
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeMultiGpuWithCuda ${CUDA_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node multi-gpu test with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 2 GPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (174/120)
Copy link

openshift-ci bot commented Jan 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abhijeet-dhumal, sutaakar

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants