-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
phred score mis-detection by dada2 #773
Comments
UPD - another place where qualities should be specified is DADA2_FILTNTRIM - I am also unsure if qualityType should be specified in DADA2_QUALITY - the main function |
Thanks for reporting, and thanks for the update. The main developer is on vacation, so nothing will happen for a few weeks. We might come back to you with further questions when he's back. |
And another small update - unfortunately I won't be able to finish the test run. The reverse reads in my dataset had very low qualities, so did not pass DADA2_DENOISING (1_2.dada.rds file was not generated). As a result DADA2_STATS module failed with |
Hi there,
that sounds like a good addition to me. However, test data (that could also be uploaded to our test data repo https://github.com/nf-core/test-datasets/tree/ampliseq) would be extremely important. DADA2_QUALITY is primarily for visualization and also aids to find length cutoffs for DADA2_FILTNTRIM, so it seems not surprising that DADA2_QUALITY has no quality type setting. |
Hi @d4straub - it seems that DADA2_QUALITY also auto-infers Phred score, but in a different way - I'd be happy to provide some test data, but for now I only have forward reads of decent quality, while all reverse reads have low quality - I'll need to either generate fake Phred scores or wait for better data to appear. Please also keep in mind that my data is not compatible with downstream applications like QIIME, so it might be easier to replace a few Phred chars corresponding to base qualities between 41-55 in your existing test datasets |
If only forward reads are ok, isnt that also sufficient for implementing phred score flexibility? I would think so. |
Hi @d4straub, apologies for the delay. It seems that we might get proper paired end data soon. I'd prefer to provide that rather than just single end data - our usage of ampliseq relies heavily on the paired reads. Would this extra delay work for you? If not, I could introduce some higher base quality values, e.g. into https://github.com/nf-core/test-datasets/blob/ampliseq/testdata/1_S103_L001_R1_001.fastq.gz and its reverse counterpart - this should be accurate enough representation of ElemBio data... |
Hi @amakunin , real data would be much better, than we could use a small subset for continuous integration testing. Please update in case you are able to share data. |
I had a more detailed look at the code. Previously
if that works as expected, only some test data is still missing to make it perfect. |
Dear @amakunin , could you test the solution as described above? I would like to have that solved or abandoned. |
Dear @d4straub - thanks for providing the update and pinging me, I'll aim to test this asap |
Thanks so much @d4straub , the updated workflow works as expected and produces reasonable DADA2 outputs and QC plots with base qualities around 45-46. I've prepared test data of 1523 read pairs that do contain a lot of Q45 bases and do end up reconstructing around 35 consensus sequences - these are from multiplex amplicon sequencing. Should I submit these data as a pull request in the test data repo https://github.com/nf-core/test-datasets/tree/ampliseq or maybe just post here? |
Thanks!
A PR would be great, but if thats too much then just post them here. |
Description of the bug
I’ve been testing ampliseq on some Element Biosciences data using standard Illumina settings and discovered an interesting issue where DADA2_ERR fails with error that looks like this
Turns out, the readFastq function that is being called under the hood mis-interprets ElemBio Phred scores as Phred+64, not Phred+33 - probably because maximium score in ElemBio is 55, while for Illumina it is 41.
I was able to work around this by changing
qualityType = "Auto"
toqualityType = "FastqQuality"
in DADA2_ERR module config.In addition, I needed to adjust DADA2_DENOISING code to use appropriate qualityType in reading fastq files:
I would appreciate if it would be possible to explicitly set qualityType in DADA2_ERR and DADA2_DENOISING modules - though I’m not sure if this should be:
Command used and terminal output
Relevant files
No response
System information
linux x86_64
nextflow =24.04.02
ampliseq =2.11.0
singularity =3.10.0
local executor
The text was updated successfully, but these errors were encountered: