Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bonito basecall model refinement preprocessing memory issues #361

Open
CodingKaiser opened this issue Aug 28, 2023 · 1 comment
Open

bonito basecall model refinement preprocessing memory issues #361

CodingKaiser opened this issue Aug 28, 2023 · 1 comment

Comments

@CodingKaiser
Copy link

Hello,

I am currently doing a basecalling model refinement and am running into some issues in the pre-processing basecalling step surrounding memory management.

Specifically, I have a folder of pod5 files totalling ~24GB I am passing to bonito basecaller in the following way:
bonito basecaller dna_r9.4.1_e8_hac@v3.3 --save-ctc --min-accuracy-save-ctc 0.9 -v --alignment-threads 10 --device 'cuda' --reference ~/Documents/genomes/T7_V01146.1.fasta ./T7/pod5s/ > ./T7/bonito_mapped_hac_ctc/basecalls_ctc.bam

However, after the initial basecalling, the process is killed due to a maxing out of available RAM on my machine.

> reading pod5
> outputting aligned bam
> loading model dna_r9.4.1_e8_hac@v3.3
> model basecaller params: {'batchsize': 512, 'chunksize': 10000, 'overlap': 500, 'quantize': None}
> loading reference
> calling: 1290710 reads [59:34, 361.04 reads/s]Killed

For now, I am attempting to subset the initial data, but this is obviously not ideal as I am discarding potentially useful signal from being used in the training step. It appears to me that bonito train only accepts a single --directory, so breaking up the basecalling by pod5 or similar would also not work. Is there an alternate approach?

Thanks in advance for your input.

All the best,
Falko Noé

@andrewgalbraith21
Copy link

Hello Falko,

You could run bonito basecalling on separate directories and then merge the .npy files after and then train. There's already a post with the code to merge .npy files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants