nbases argument ignored? #2054

hjarnek · 2024-11-15T11:51:52Z

The nbases argument in learnErrors is set to 1e8 by default, i.e. 100 million = 100 000 000 bases to use for error rate learning. Still, I often see messages like this:

2304931075 total bases in 6237328 reads from 1 samples will be used for learning the error rates.

The nbases cutoff never seems to be applied, it's always some odd number of bases being used, way over the nbases limit. How is this actually supposed to work?

The text was updated successfully, but these errors were encountered:

benjjneb · 2024-11-15T11:59:08Z

nbases is only applied after a sample is read in. It will not break samples up. So if the first samples has 6237328 reads, each of which is 100+ nts, then nbases=1e8 will kick in and stop learnErrors after that first sample. But it won't cut samples in to pieces.

hjarnek · 2024-11-15T14:17:28Z

Ok, I see. But wouldn't it be desirable if it did? Break samples up, that is, and actually apply the limit it sets out to apply. What's the recommended way of dealing with really large samples, otherwise? Seems like you would have to split the fastq files manually, just for the sake of error rate learning?

benjjneb · 2024-11-22T20:03:41Z

Ok, I see. But wouldn't it be desirable if it did? Break samples up, that is, and actually apply the limit it sets out to apply.

In some cases yes, in others no. If single samples are larger than the nbases limit, it would be desirable to subsample the single sample being used. If multiple samples are being used, it is preferred to keep full samples and go over the nbases limit. The first case has become more relevant with the rise of higher-throughput technologies like NovaSeq.

hjarnek · 2024-11-26T20:51:47Z

Ok. Would be good to have a limit yes :)

I'm curious though why it's not desirable when multiple samples are read in? Does the error learning algorithm take which sample each read belongs to into account? I thought it pooled all the reads it used to learn the error rates.

benjjneb · 2024-12-02T17:08:54Z

I thought it pooled all the reads it used to learn the error rates.

The samples themselves are not pooled. The error rates learned from each sample (individually) are averaged across samples, but only in that sense is there "pooling" by default in learnErrors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nbases argument ignored? #2054

nbases argument ignored? #2054

hjarnek commented Nov 15, 2024

benjjneb commented Nov 15, 2024

hjarnek commented Nov 15, 2024

benjjneb commented Nov 22, 2024

hjarnek commented Nov 26, 2024

benjjneb commented Dec 2, 2024

nbases argument ignored? #2054

nbases argument ignored? #2054

Comments

hjarnek commented Nov 15, 2024

benjjneb commented Nov 15, 2024

hjarnek commented Nov 15, 2024

benjjneb commented Nov 22, 2024

hjarnek commented Nov 26, 2024

benjjneb commented Dec 2, 2024