Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basecalling/demux+modbases: std::bad_alloc #1213

Open
sklages opened this issue Jan 9, 2025 · 6 comments
Open

Basecalling/demux+modbases: std::bad_alloc #1213

sklages opened this issue Jan 9, 2025 · 6 comments
Labels
barcode Issues related to barcoding bug Something isn't working

Comments

@sklages
Copy link

sklages commented Jan 9, 2025

I have a strange issue with v0.8.3 when basecalling/demux in sup mode with modbases on a Nvidia A100/40G.

[2025-01-07 07:44:46.218] [info] Running:
"basecaller" "sup,5mCG_5hmCG" "/dev/shm/mxqd/mnt/job/53410651"
"--device" "cuda:all" "--batchsize" "0" "--trim" "all" "--verbose"
"--kit-name" "SQK-NBD114-96" "--sample-sheet" "/path/to/samplesheet.csv"
[2025-01-07 07:44:46.566] [debug] set models directory to: 
'/path/to/v0.8.3-Release/models' from 'DORADO_MODELS_DIRECTORY' environment variable
[2025-01-07 07:44:46.657] [info] > Creating basecall pipeline

<..>

[2025-01-09 11:59:27.922] [info] > Simplex reads basecalled: 152677113
[2025-01-09 11:59:27.922] [info] > Simplex reads filtered: 3048
[2025-01-09 11:59:27.922] [info] > Basecalled @ Samples/s: 7.592271e+06
[2025-01-09 11:59:27.922] [debug] > Including Padding @ Samples/s: 1.051e+07 (72.25%)
[2025-01-09 11:59:27.922] [info] > 154785331 reads demuxed @ classifications/s: 8.230199e+02
[2025-01-09 11:59:27.922] [debug] Barcode distribution :
[2025-01-09 11:59:27.922] [debug] SQK-NBD114-96_barcode70 : 48370444
[2025-01-09 11:59:27.922] [debug] SQK-NBD114-96_barcode71 : 37403828
[2025-01-09 11:59:27.922] [debug] SQK-NBD114-96_barcode72 : 32852492
[2025-01-09 11:59:27.922] [debug] SQK-NBD114-96_barcode73 : 29870372
[2025-01-09 11:59:27.922] [debug] unclassified : 6288195
[2025-01-09 11:59:27.941] [debug] Classified rate 95.93747%
[2025-01-09 12:00:07.015] [info] > Finished
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

.. directly after basecalling has finished (after appr 53h) ..

  • free disk space 76TB
  • RAM usage was below 20G (system RAM is 384G, single user, single job)

That happened with two datasets, short insert libraries, many reads. Never seen this with dorado before.

What did dorado cause to crash immediately after basecalling has finished?

Result files seem to be complete though, e.g.:

# dorado reports
[info] > 154785331 reads demuxed

# samtools reports on BAM file
154785331 + 0 primary

Any idea what is going wrong here, what is causing the std::bad_alloc?

@HalfPhoton
Copy link
Collaborator

Hi @sklages,

Can you repro this with a small dataset e.g. ~10k reads?

Best regards,
Rich

@HalfPhoton HalfPhoton added bug Something isn't working barcode Issues related to barcoding labels Jan 13, 2025
@sklages
Copy link
Author

sklages commented Jan 14, 2025

@HalfPhoton - I tried two different 10K subsets and both succeeded .. I will re-run the large dataset with both v0.8.3 and current v0.9.0 in order to see if I can actually reproduce the issue and if both version behave differently ..

@sklages
Copy link
Author

sklages commented Jan 18, 2025

@HalfPhoton - Using the v0.8.3 the error is reproducible, just after basecalling, as before.

Version v0.9.0 quits very early:

<..>
[2025-01-17 11:09:51.039] [info] Running: "basecaller" "sup,5mCG_5hmCG" "/dev/shm/mxqd/mnt/job/53455719" "--device" "cuda:all" "--batchsize" "0" "  --trim" "all" "--verbose" "--kit-name" "SQK-NBD114-96" "--sample-sheet" "/path/to/samplesheet.csv"

<..>
  [2025-01-17 11:10:03.348] [debug] Largest batch size for cuda:0: 960, time per chunk 0.443559 ms
  [2025-01-17 11:10:03.348] [debug] Final batch size for cuda:0[0]: 480
  [2025-01-17 11:10:03.348] [debug] Final batch size for cuda:0[1]: 960
  [2025-01-17 11:10:03.348] [info] cuda:0 using chunk size 11520, batch size 480
  [2025-01-17 11:10:03.348] [debug] cuda:0 Model memory 32.12GB
  [2025-01-17 11:10:03.348] [debug] cuda:0 Decode memory 3.90GB
  [2025-01-17 11:10:04.448] [info] cuda:0 using chunk size 5760, batch size 960
  [2025-01-17 11:10:04.448] [debug] cuda:0 Model memory 32.12GB
  [2025-01-17 11:10:04.448] [debug] cuda:0 Decode memory 3.90GB
  [2025-01-17 11:10:05.060] [debug] BasecallerNode chunk size 11520
  [2025-01-17 11:10:05.060] [debug] BasecallerNode chunk size 5760
  [2025-01-17 11:10:05.084] [debug] Load reads from file /dev/shm/mxqd/mnt/job/53455719/f3c0a0d2ace8.pod5
  [2025-01-17 11:10:06.451] [debug] > Kits to evaluate: 1
  terminate called after throwing an instance of 'std::bad_alloc'
    what():  std::bad_alloc
    
<..>(core dumped) /path/to/v0.9.0-Release/bin/dorado basecaller $DORADO_BC_MODEL_PM $POD5_TMPDIR --device cuda:all --batchsize 0 --trim all $DORADO_DEBUG $DEMUX_PRM > $BAM_OUT

Any idea where to start looking for the problem? It may be dataset-specific, system-specific ..

Both with small memory footprint (less than 20G) and plenty of free diskspace.

I will run a different dataset which worked before, to exclude the latter ..

@sklages
Copy link
Author

sklages commented Jan 18, 2025

I ran both versions on another (smaller) dataset, both finished successfully .. so it seems somehow dataset(size)-related ..

@MueFab
Copy link

MueFab commented Jan 20, 2025

` [2025-01-19 18:57:20.000] [info] Running: "basecaller" "/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "." "--modified-bases-models" "/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v3" "--device" "cuda:all"

[2025-01-19 18:57:20.148] [info] > Creating basecall pipeline
[2025-01-19 18:57:34.593] [info] cuda:0 using chunk size 11520, batch size 960
[2025-01-19 18:57:34.659] [info] cuda:1 using chunk size 11520, batch size 960
[2025-01-19 18:57:35.792] [info] cuda:0 using chunk size 5760, batch size 960
[2025-01-19 18:57:35.798] [info] cuda:1 using chunk size 5760, batch size 960
[2025-01-20 00:48:50.799] [error] Failed to get read 436 signal: Invalid: Input data failed to decompress using zstd: (18446744073709551552 Allocation error : not enough memory)
terminate called after throwing an instance of 'std::bad_alloc'
`

I’m encountering the same issue on my end. Certain datasets cause a crash with std::bad_alloc (not enough memory), despite having over 500GB of free RAM and plenty of disk space available. This behavior only occurs with some datasets. I’m still investigating whether there’s a pattern.

@malton-ont
Copy link
Collaborator

@MueFab - that looks like a bug handling corrupt/invalid data. We're trying to allocate 18446744073709551552 bytes (~18500 PB!) which makes me think we've got a small negative number (that value is uint64_max - 63) being passed to something that isn't expecting it.

Is your dataset both small and something you're able to share with us?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
barcode Issues related to barcoding bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants