Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nbases argument ignored? #2054

Open
hjarnek opened this issue Nov 15, 2024 · 5 comments
Open

nbases argument ignored? #2054

hjarnek opened this issue Nov 15, 2024 · 5 comments

Comments

@hjarnek
Copy link

hjarnek commented Nov 15, 2024

The nbases argument in learnErrors is set to 1e8 by default, i.e. 100 million = 100 000 000 bases to use for error rate learning. Still, I often see messages like this:

2304931075 total bases in 6237328 reads from 1 samples will be used for learning the error rates.

The nbases cutoff never seems to be applied, it's always some odd number of bases being used, way over the nbases limit. How is this actually supposed to work?

@benjjneb
Copy link
Owner

nbases is only applied after a sample is read in. It will not break samples up. So if the first samples has 6237328 reads, each of which is 100+ nts, then nbases=1e8 will kick in and stop learnErrors after that first sample. But it won't cut samples in to pieces.

@hjarnek
Copy link
Author

hjarnek commented Nov 15, 2024

Ok, I see. But wouldn't it be desirable if it did? Break samples up, that is, and actually apply the limit it sets out to apply. What's the recommended way of dealing with really large samples, otherwise? Seems like you would have to split the fastq files manually, just for the sake of error rate learning?

@benjjneb
Copy link
Owner

Ok, I see. But wouldn't it be desirable if it did? Break samples up, that is, and actually apply the limit it sets out to apply.

In some cases yes, in others no. If single samples are larger than the nbases limit, it would be desirable to subsample the single sample being used. If multiple samples are being used, it is preferred to keep full samples and go over the nbases limit. The first case has become more relevant with the rise of higher-throughput technologies like NovaSeq.

@hjarnek
Copy link
Author

hjarnek commented Nov 26, 2024

Ok. Would be good to have a limit yes :)

I'm curious though why it's not desirable when multiple samples are read in? Does the error learning algorithm take which sample each read belongs to into account? I thought it pooled all the reads it used to learn the error rates.

@benjjneb
Copy link
Owner

benjjneb commented Dec 2, 2024

I thought it pooled all the reads it used to learn the error rates.

The samples themselves are not pooled. The error rates learned from each sample (individually) are averaged across samples, but only in that sense is there "pooling" by default in learnErrors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants