-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP sample fails with CUDA initialization error on HyperPod EKS #467
Comments
looks like it's missing the nvidia gpu device plugin or you didn't specify the number of devices in your yaml like so:
|
Reproduced the error with 2 x g5.8xlarge nodes. I believe the root cause is: |
Related to #475 |
@shimomut please add link to test case. There are multiple FSDP test cases in the repo. |
This is the FSDP sample I faced the issue: Updated the original post as well. |
Removing the Cuda compat package will likely resolve this. |
When running the FSDP sample app on HyperPod EKS cluster, I got this error.
Cluster spec:
The text was updated successfully, but these errors were encountered: