FSDP sample fails with CUDA initialization error on HyperPod EKS #467

shimomut · 2024-10-28T23:29:34Z

When running the FSDP sample app on HyperPod EKS cluster, I got this error.

[W CUDAFunctions.cpp:108] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (function operator())
Traceback (most recent call last):
  File "/fsdp/train.py", line 281, in <module>
    main(args)
  File "/fsdp/train.py", line 144, in main
    dist.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1339, in _new_process_group_helper
    backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Cluster spec:

ml.g5.8xlarge x 8
ml.g5.12xlarge x 2

The text was updated successfully, but these errors were encountered:

sean-smith · 2024-10-28T23:43:59Z

looks like it's missing the nvidia gpu device plugin or you didn't specify the number of devices in your yaml like so:

resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU

iankouls-aws · 2024-11-01T00:21:20Z

Reproduced the error with 2 x g5.8xlarge nodes. I believe the root cause is:
"system has unsupported display driver / cuda driver combination"

mhuguesaws · 2024-11-04T15:10:52Z

Related to #475

mhuguesaws · 2024-11-04T15:11:37Z

@shimomut please add link to test case. There are multiple FSDP test cases in the repo.

shimomut · 2024-11-04T17:36:46Z

This is the FSDP sample I faced the issue:
https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/README-EKS.md

Updated the original post as well.

mhuguesaws · 2024-11-04T17:57:27Z

Removing the Cuda compat package will likely resolve this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP sample fails with CUDA initialization error on HyperPod EKS #467

FSDP sample fails with CUDA initialization error on HyperPod EKS #467

shimomut commented Oct 28, 2024 •

edited

Loading

sean-smith commented Oct 28, 2024

iankouls-aws commented Nov 1, 2024

mhuguesaws commented Nov 4, 2024

mhuguesaws commented Nov 4, 2024

shimomut commented Nov 4, 2024

mhuguesaws commented Nov 4, 2024

FSDP sample fails with CUDA initialization error on HyperPod EKS #467

FSDP sample fails with CUDA initialization error on HyperPod EKS #467

Comments

shimomut commented Oct 28, 2024 • edited Loading

sean-smith commented Oct 28, 2024

iankouls-aws commented Nov 1, 2024

mhuguesaws commented Nov 4, 2024

mhuguesaws commented Nov 4, 2024

shimomut commented Nov 4, 2024

mhuguesaws commented Nov 4, 2024

shimomut commented Oct 28, 2024 •

edited

Loading