Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: v0.9.0: 'std::invalid_argument' (core dump) #1186

Closed
sklages opened this issue Dec 18, 2024 · 17 comments
Closed

Q: v0.9.0: 'std::invalid_argument' (core dump) #1186

sklages opened this issue Dec 18, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@sklages
Copy link

sklages commented Dec 18, 2024

Just wanted to run version v0.9.0 (0.9.0+9dc15a85) and ran into a "invalid argument" which I cannot spot in my command .. :-)

dorado basecaller \
  sup,5mCG_5hmCG \
  /path/to/pod5data \
  --device cuda:all \
  --batchsize 0 
  --trim all 
  --kit-name SQK-NBD114-96 \
  --barcode-both-ends \
  --sample-sheet ../samplesheet.csv 
  > file.bam

error:

[2024-12-18 16:19:02.589] [info] 
Running: "basecaller" "sup,5mCG_5hmCG" "/path/to/pod5data" 
"--device" "cuda:all" "--batchsize" "0" "--trim" "all" 
"--kit-name" "SQK-NBD114-96" "--barcode-both-ends" 
"--sample-sheet" "../samplesheet.csv"

terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi

./call: line 3:  6319 Aborted (core dumped) 
/path/to/bin/dorado basecaller sup,5mCG_5hmCG /path/to/pod5data 
--device cuda:all --batchsize 0 --trim all --kit-name SQK-NBD114-96 
--barcode-both-ends --sample-sheet ../samplesheet.csv > file.bam

The same command works with v0.8.3 .. so something has obviously changed ..
Can you point me to what am I missing here?

@malton-ont
Copy link
Collaborator

malton-ont commented Dec 18, 2024

Hi @sklages,

Do you have CUDA_VISIBLE_DEVICES set? If so, ensure that this is simply a comma-delimited set of integers - e.g. export CUDA_VISIBLE_DEVICES=0,1

@sklages
Copy link
Author

sklages commented Dec 19, 2024

@malton-ont

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-c02c4ecb-2dbd-0567-a809-dcc00241a59e)

CUDA_VISIBLE_DEVICES is set using the UUID of the GPU: export CUDA_VISIBLE_DEVICES=GPU-c02c4ecb-2dbd-0567-a809-dcc00241a59e

@malton-ont
Copy link
Collaborator

@sklages,

This doesn't appear to be a setting we account for - please try setting this to the integer cuda id of the device instead while we investigate fixing this.

@sklages
Copy link
Author

sklages commented Dec 19, 2024

@malton-ont - CUDA_VISIBLE_DEVICES is controlled here by the cluster manager; I can override this on single-GPU/unpartitioned GPU nodes only, knowing that there is only one GPU (partition) available.

But is this somethings that has been changed in v0.9.0? Using UUID is working in all other versions.

From the error message I would expect that I have used a wrong parameter ..

Nvidia supports both, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars in CUDA_VISIBLE_DEVICES.

@malton-ont
Copy link
Collaborator

malton-ont commented Dec 19, 2024

@sklages,

Yes, some of the internal gpu monitoring code changed to use NVML, which doesn't respect CUDA_VISIBLE_DEVICES, so we had to put in some code to deal with it ourselves and it looks like we missed this use case.

std::invalid_argument is an internal C++ exception type - it's occurring because we attempt to parse what we expect to be a numeric ID but we're actually being passed a UUID string, which is an invalid_argument to the std::stoi method. It has nothing to do with the arguments begin passed to dorado.

@sklages
Copy link
Author

sklages commented Dec 19, 2024

@malton-ont - thanks for the explanation!

So maybe I can let this as a feature request / improvement: support for GPU/MIG UUIDs for CUDA_VISIBLE_DEVICES in dorado basecaller :-)

@malton-ont
Copy link
Collaborator

Other developers appear to have had a similar issue, and I found this workaround:
microsoft/DeepSpeed#5278 (comment)

@malton-ont malton-ont added the bug Something isn't working label Dec 19, 2024
@DntBScrdDv
Copy link

DntBScrdDv commented Dec 23, 2024

I'm having the exact same issue, but the workaround doesn't seem to work?

echo DEVICES $CUDA_VISIBLE_DEVICES
DEVICES MIG-5fd3bc62-f356-5e69-a692-1bb3cf702658

ID=$(nvidia-smi --id=$UUID --query-gpu=index --format=csv,noheader)
echo ID $ID
ID No devices were found

(although we are at the limit of my understanding!)

@malton-ont
Copy link
Collaborator

@DntBScrdDv,

I don't have any MIG devices set up to verify this I'm afraid - presumably the expected ID is different in the case of MIGs.

You can try running nvidia-smi -L, nvidia-smi mig -lgi or nvidia-smi mig -lci to list the available devices and work out the appropriate UUID from there?

@DntBScrdDv
Copy link

Thanks for the rapid reply!

I got the UUID, but when I try to get the numerical ID, it comes back with nothing.

nvidia-smi -L
GPU 0: NVIDIA H100 NVL (UUID: GPU-8b4c0ca1-b445-30cd-512c-ec1131c5295d)
MIG 3g.47gb Device 0: (UUID: MIG-5fd3bc62-f356-5e69-a692-1bb3cf702658)

I'm just going to try a previous version of dorado as 0.7 works (on a different cluster admittedly), and a priori 0.8.3 works (as state by OP).

If I'm hijacking this issue please let me know so I can create a separate one. It just seemed like I'm experiencing the same problem.

@DntBScrdDv
Copy link

Yeah - works fine in version 0.8.3, so is a bug in this version as stated... :)

@malton-ont
Copy link
Collaborator

@DntBScrdDv,

Thanks for clarifying. I suspect this workaround just doesn't work for MIG devices, since they're "virtual" GPUs. We're working on a fix.

@DntBScrdDv
Copy link

Perfect - many thanks. I'll continue with 0.8.3 for now

@malton-ont
Copy link
Collaborator

Dorado 0.9.1 has just been released. This version should correctly handle CUDA_VISIBLE_DEVICES using either UUIDs or integers to define the available resources.

@DntBScrdDv, it would be good if you could test this in your environment as we don't have a MIG device set up here.

@DntBScrdDv
Copy link

Hi @malton-ont

Thanks for the update - I'll see if I find time today to test and I'll let you know!

@DntBScrdDv
Copy link

Just tested and appears to be working fine! Thanks again team for fixing this.

@malton-ont
Copy link
Collaborator

Great news, thanks @DntBScrdDv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants