-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q: v0.9.0: 'std::invalid_argument' (core dump) #1186
Comments
Hi @sklages, Do you have |
|
This doesn't appear to be a setting we account for - please try setting this to the integer cuda id of the device instead while we investigate fixing this. |
@malton-ont - But is this somethings that has been changed in From the error message I would expect that I have used a wrong parameter .. Nvidia supports both, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars in |
Yes, some of the internal gpu monitoring code changed to use NVML, which doesn't respect
|
@malton-ont - thanks for the explanation! So maybe I can let this as a feature request / improvement: support for GPU/MIG UUIDs for |
Other developers appear to have had a similar issue, and I found this workaround: |
I'm having the exact same issue, but the workaround doesn't seem to work? echo DEVICES $CUDA_VISIBLE_DEVICES ID=$(nvidia-smi --id=$UUID --query-gpu=index --format=csv,noheader) (although we are at the limit of my understanding!) |
I don't have any MIG devices set up to verify this I'm afraid - presumably the expected ID is different in the case of MIGs. You can try running |
Thanks for the rapid reply! I got the UUID, but when I try to get the numerical ID, it comes back with nothing. nvidia-smi -L I'm just going to try a previous version of dorado as 0.7 works (on a different cluster admittedly), and a priori 0.8.3 works (as state by OP). If I'm hijacking this issue please let me know so I can create a separate one. It just seemed like I'm experiencing the same problem. |
Yeah - works fine in version 0.8.3, so is a bug in this version as stated... :) |
Thanks for clarifying. I suspect this workaround just doesn't work for MIG devices, since they're "virtual" GPUs. We're working on a fix. |
Perfect - many thanks. I'll continue with 0.8.3 for now |
Dorado 0.9.1 has just been released. This version should correctly handle @DntBScrdDv, it would be good if you could test this in your environment as we don't have a MIG device set up here. |
Hi @malton-ont Thanks for the update - I'll see if I find time today to test and I'll let you know! |
Just tested and appears to be working fine! Thanks again team for fixing this. |
Great news, thanks @DntBScrdDv. |
Just wanted to run version
v0.9.0
(0.9.0+9dc15a85) and ran into a "invalid argument" which I cannot spot in my command .. :-)error:
The same command works with
v0.8.3
.. so something has obviously changed ..Can you point me to what am I missing here?
The text was updated successfully, but these errors were encountered: