From 1c4c2beccad384bf6ceb486e478f3d08b150973f Mon Sep 17 00:00:00 2001 From: abhijeet-dhumal Date: Thu, 9 Jan 2025 12:17:39 +0530 Subject: [PATCH 1/2] Update KFTO multi-node test names according to recent updates in original test names --- .../test-run-training-stack-tests.robot | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/ods_ci/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot b/ods_ci/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot index c72735772..763aa01a1 100644 --- a/ods_ci/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot +++ b/ods_ci/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot @@ -48,30 +48,30 @@ Run Training operator KFTO error handling test with AMD ROCm image Run Training Operator KFTO Test TestPyTorchJobFailureWithROCm ${ROCM_TRAINING_IMAGE} Run Training operator KFTO_MNIST multi-node CPU test with NVIDIA CUDA image - [Documentation] Run Go KFTO_MNIST multi-node CPU test for Training operator using PyTorch job with NVIDIA CUDA image + [Documentation] Run Go KFTO_MNIST multi-node CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 CPUs each [Tags] RHOAIENG-16556 ... Sanity ... DistributedWorkloads ... Training ... TrainingOperator - Run Training Operator KFTO Test TestPyTorchJobMnistCpu ${CUDA_TRAINING_IMAGE} + Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeCpu ${CUDA_TRAINING_IMAGE} Run Training operator KFTO_MNIST multi-node test with NVIDIA CUDA image - [Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with NVIDIA CUDA image + [Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 1 GPUs each [Tags] Resources-GPU NVIDIA-GPUs ... RHOAIENG-16556 ... Tier1 ... DistributedWorkloads ... Training ... TrainingOperator - Run Training Operator KFTO Test TestPyTorchJobMnistWithCuda ${CUDA_TRAINING_IMAGE} + Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeWithCuda ${CUDA_TRAINING_IMAGE} Run Training operator KFTO_MNIST multi-node test with AMD ROCm image - [Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with AMD ROCm image + [Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 1 GPUs each [Tags] Resources-GPU AMD-GPUs ROCm ... RHOAIENG-16556 ... Tier1 ... DistributedWorkloads ... Training ... TrainingOperator - Run Training Operator KFTO Test TestPyTorchJobMnistWithROCm ${ROCM_TRAINING_IMAGE} + Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeWithROCm ${ROCM_TRAINING_IMAGE} From d8d75d4553019ec05f2ca15b8531eff413d20bf6 Mon Sep 17 00:00:00 2001 From: abhijeet-dhumal Date: Thu, 9 Jan 2025 16:20:23 +0530 Subject: [PATCH 2/2] Add KFTO pytorch multi-node multi-gpu tests for GPUs with AMD ROCm and NVIDIA Cuda --- .../test-run-training-stack-tests.robot | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/ods_ci/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot b/ods_ci/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot index 763aa01a1..48c04d8fd 100644 --- a/ods_ci/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot +++ b/ods_ci/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot @@ -75,3 +75,15 @@ Run Training operator KFTO_MNIST multi-node test with AMD ROCm image ... Training ... TrainingOperator Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeWithROCm ${ROCM_TRAINING_IMAGE} + +Run Training operator KFTO_MNIST multi-node multi-gpu test with NVIDIA CUDA image + [Documentation] Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 GPUs each + [Tags] Kfto-MultiNodeMultiGpu + ... Training + Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeMultiGpuWithCuda ${CUDA_TRAINING_IMAGE} + +Run Training operator KFTO_MNIST multi-node multi-gpu test with AMD ROCm image + [Documentation] Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 2 GPUs each + [Tags] Kfto-MultiNodeMultiGpu + ... Training + Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ${ROCM_TRAINING_IMAGE}