Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP sample fails with CUDA initialization error on HyperPod EKS #467

Open
shimomut opened this issue Oct 28, 2024 · 6 comments
Open

FSDP sample fails with CUDA initialization error on HyperPod EKS #467

shimomut opened this issue Oct 28, 2024 · 6 comments

Comments

@shimomut
Copy link
Collaborator

shimomut commented Oct 28, 2024

When running the FSDP sample app on HyperPod EKS cluster, I got this error.

[W CUDAFunctions.cpp:108] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (function operator())
Traceback (most recent call last):
  File "/fsdp/train.py", line 281, in <module>
    main(args)
  File "/fsdp/train.py", line 144, in main
    dist.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1339, in _new_process_group_helper
    backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Cluster spec:

  • ml.g5.8xlarge x 8
  • ml.g5.12xlarge x 2
@sean-smith
Copy link
Contributor

looks like it's missing the nvidia gpu device plugin or you didn't specify the number of devices in your yaml like so:

resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU

@iankouls-aws
Copy link
Contributor

Reproduced the error with 2 x g5.8xlarge nodes. I believe the root cause is:
"system has unsupported display driver / cuda driver combination"

@mhuguesaws
Copy link
Contributor

Related to #475

@mhuguesaws
Copy link
Contributor

@shimomut please add link to test case. There are multiple FSDP test cases in the repo.

@shimomut
Copy link
Collaborator Author

shimomut commented Nov 4, 2024

This is the FSDP sample I faced the issue:
https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/README-EKS.md

Updated the original post as well.

@mhuguesaws
Copy link
Contributor

Removing the Cuda compat package will likely resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants