You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently doing a basecalling model refinement and am running into some issues in the pre-processing basecalling step surrounding memory management.
Specifically, I have a folder of pod5 files totalling ~24GB I am passing to bonito basecaller in the following way: bonito basecaller dna_r9.4.1_e8_hac@v3.3 --save-ctc --min-accuracy-save-ctc 0.9 -v --alignment-threads 10 --device 'cuda' --reference ~/Documents/genomes/T7_V01146.1.fasta ./T7/pod5s/ > ./T7/bonito_mapped_hac_ctc/basecalls_ctc.bam
However, after the initial basecalling, the process is killed due to a maxing out of available RAM on my machine.
For now, I am attempting to subset the initial data, but this is obviously not ideal as I am discarding potentially useful signal from being used in the training step. It appears to me that bonito train only accepts a single --directory, so breaking up the basecalling by pod5 or similar would also not work. Is there an alternate approach?
Thanks in advance for your input.
All the best,
Falko Noé
The text was updated successfully, but these errors were encountered:
You could run bonito basecalling on separate directories and then merge the .npy files after and then train. There's already a post with the code to merge .npy files.
Hello,
I am currently doing a basecalling model refinement and am running into some issues in the pre-processing basecalling step surrounding memory management.
Specifically, I have a folder of pod5 files totalling ~24GB I am passing to
bonito basecaller
in the following way:bonito basecaller dna_r9.4.1_e8_hac@v3.3 --save-ctc --min-accuracy-save-ctc 0.9 -v --alignment-threads 10 --device 'cuda' --reference ~/Documents/genomes/T7_V01146.1.fasta ./T7/pod5s/ > ./T7/bonito_mapped_hac_ctc/basecalls_ctc.bam
However, after the initial basecalling, the process is killed due to a maxing out of available RAM on my machine.
For now, I am attempting to subset the initial data, but this is obviously not ideal as I am discarding potentially useful signal from being used in the training step. It appears to me that
bonito train
only accepts a single--directory
, so breaking up the basecalling by pod5 or similar would also not work. Is there an alternate approach?Thanks in advance for your input.
All the best,
Falko Noé
The text was updated successfully, but these errors were encountered: