-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error estimating error rates and plotting quality profiles #2045
Comments
My first guess is that the main difference between the two -- cutadapt -- is contributing to this issue. If you use a test dataset, perhaps just one sample that throws this error, does running the ITS workflow without the cutadapt part work? |
Hello @benjjneb, Thank you for your suggestions. I have tried to run the
After running cutadapt with the
Nonetheless, I have realised that after the cutadapt step there are some very short sequences of about 10 bp in my samples.
Could be this the issue why i am not able to run
However, in the ITS workflow I do not have such a flag:
Should I combine the two approaches? Thank you very much for your help |
Those poly-G tails you are seeing are probably coming from a two-color Illumina sequencing instrument (e.g. NextSeq), where the lack of a signal is read as a G. This leads to sometimes many reads having long strings of Gs at the end, where due to some other reason the instrument stopped seeing signal from that spot and misinterpreted it as G after G. If cutadapt is effectively removing those sequences, that can work. ANd then, yes, adding a minimum length cutoff should get rid of the super-low length sequences that could cause problems with downstream steps. You can set that flag in cutadapt, or in Of note, the dada2 R package also has functionality for identifying and removing "low complexity" sequences, like those with these long polyG tails. |
Hello @benjjneb, Thank you for your suggestions. Cutadapt succesfully removed those low complexity sequences. I have also plot the complexity of one sample as per your suggestion with
I have also manage to run the error models for both forward and reverse sequences. Is the model model for the Rev sequences OK?
Moreover, I have manged to build a the sequence table after the ASV inference step, however, it has a very big number of ASVs (i.e., over 100K). Is this normal ?
I was thinking of dereplicating the sequences before the ASV inference. I have read on other post that Thank you very much for your help!! Guillermo |
This was all really helpful ! I'm adapting the ITS pipeline for trnL markers and had the same problem with plotting qc for post-cutadapt sequences. Removing the low complexity sequences using the filterAndTrim flag was not sufficient to resolve this for me, I also had to use the cutadapt flags you suggested for mininmmum length. |
It is preferable to not explicitly dereplicate sequences with |
Hi @benjjneb,
I am following the ITS tutorial for ITS-2 dataset generated with Illumina V3 reaction. Everything progressed correctly until I had to learn the error rates in my seqs. The error rates for the forward sequences was calculated correctly with:
errF <- learnErrors(filtFs, multithread = TRUE)
However, when I try to run the same line of code for the reverse reads I get the following error:
I have seen in another issue that this can be caused by "a posteriori" manual assignment of the quality scores. I have inspected some of the quality profiles of my reverse sequences and they look correct. For example:
LM130_R2.pdf
Nonetheless, I am also having issues to plot the quality profile for some of the samples. It seems that some samples are somehow bad, as if some of the quality scores are missing. This is the error I am getting:
I have run dada2 twice on this dadaset. One following the tutorial for ITS and another one following the General Workflow. When using the general Workflow I did not have any problems with the
plotQualityProfile()
function nor with learning the errors. When using the General Workflow, I first used Cutadapt to cut off the primer sequences as in this issue:[https://github.com//issues/2038]
I have seen that a way to work around the
learnErrors
issue can be adding the flagUSE_QUALS=FALSE
. If I understood correctly in this case the quality scores are not used to calculate the error model and therefore it is less accurate.How is it possible that
learnErrors
works with the forward read but not with reverse reads?Should I be worried about this issue? As i found that solution to overcome this problem I continued down the pipeline
How is possible that I am finding these issues when following the ITS workflow but not the general one?
I would appreciate some guidance in this matter.
Thank you very much in advance.
The text was updated successfully, but these errors were encountered: