Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error estimating error rates and plotting quality profiles #2045

Open
Guillermouceda opened this issue Oct 23, 2024 · 6 comments
Open

Error estimating error rates and plotting quality profiles #2045

Guillermouceda opened this issue Oct 23, 2024 · 6 comments

Comments

@Guillermouceda
Copy link

Hi @benjjneb,

I am following the ITS tutorial for ITS-2 dataset generated with Illumina V3 reaction. Everything progressed correctly until I had to learn the error rates in my seqs. The error rates for the forward sequences was calculated correctly with:

errF <- learnErrors(filtFs, multithread = TRUE)

However, when I try to run the same line of code for the reverse reads I get the following error:

> errR <- learnErrors(filtRs_filtered, multithread = TRUE)
402300210 total bases in 1337989 reads from 2 samples will be used for learning the error rates.
Error rates could not be estimated (this is usually because of very few reads).
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

I have seen in another issue that this can be caused by "a posteriori" manual assignment of the quality scores. I have inspected some of the quality profiles of my reverse sequences and they look correct. For example:
LM130_R2.pdf

Nonetheless, I am also having issues to plot the quality profile for some of the samples. It seems that some samples are somehow bad, as if some of the quality scores are missing. This is the error I am getting:


> plotQualityProfile(cutRs[2])
Error: BiocParallel errors
  1 remote errors, element index: 1
  0 unevaluated and other errors
  first remote error:
Error in density.default(qscore): 'x' contains missing values
In addition: Warning message:
In serialize(data, node$con) :
  'package:stats' may not be available when loading

I have run dada2 twice on this dadaset. One following the tutorial for ITS and another one following the General Workflow. When using the general Workflow I did not have any problems with the plotQualityProfile() function nor with learning the errors. When using the General Workflow, I first used Cutadapt to cut off the primer sequences as in this issue:

[https://github.com//issues/2038]

I have seen that a way to work around the learnErrors issue can be adding the flag USE_QUALS=FALSE. If I understood correctly in this case the quality scores are not used to calculate the error model and therefore it is less accurate.

How is it possible that learnErrors works with the forward read but not with reverse reads?
Should I be worried about this issue? As i found that solution to overcome this problem I continued down the pipeline

How is possible that I am finding these issues when following the ITS workflow but not the general one?

I would appreciate some guidance in this matter.

Thank you very much in advance.

@benjjneb
Copy link
Owner

How is possible that I am finding these issues when following the ITS workflow but not the general one?

My first guess is that the main difference between the two -- cutadapt -- is contributing to this issue.

If you use a test dataset, perhaps just one sample that throws this error, does running the ITS workflow without the cutadapt part work?

@Guillermouceda
Copy link
Author

Hello @benjjneb,

Thank you for your suggestions.

I have tried to run the plotQualityProfile() function over the samples before the cutadapt step and it works. Nonetheless, the learnErrors function does not work with the samples before cutadapt (i.e., samples filtered for Ns). I have also been exploring a bit the dataset and there is something odd with the sequences. I have found that my sequences are weirdly looking before the cutadapt step. For example:

@VH01456:36:AAFGH3WM5:1:1101:38985:1000 1:N:0:GGACCAGTGG+TAGGCTCGCG
GGCGCTTCTCCGAGCCCACGAGACGGACCAGTGGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
55CCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCC*C*C*CCC***CCCC5CCC*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH01456:36:AAFGH3WM5:1:1101:40840:1000 1:N:0:GGACCAGTGG+TAGGCTCGCG
GGCGCATCTCCGAGCCCACGAGACGGACCAGTGGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
C55CC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5C5*CCC*C5CCCCCC*CCCC*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH01456:36:AAFGH3WM5:1:1101:46218:1000 1:N:0:GGCCCAGTGG+TAGGCTCGCG
GGCGCTTCTCCGAGCCCACGAGACGGACCAGTGGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
C5CCCCCCCCCCCCCCC*CCCCCCCCCCCCCCCCCCCCCC*CCC*5CCCCC*CC**CCCCCCCCC5*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH01456:36:AAFGH3WM5:1:1101:47543:1000 1:N:0:GGACCAGTGG+TAGGCTCGCG
GGCACATCTCCGAGCCCACGAGACGGACCAGTGGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
C*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCC*CCCCC*C5CCC*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH01456:36:AAFGH3WM5:1:1101:47771:1000 1:N:0:GGACCAGTGG+TAGGCTCGCG
GTCACATCTCCGAGCCCACGAGACGGACCAGTGGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
C*5CCCC5CCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCC55CCCCCCCCCCCC*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCCC

After running cutadapt with the --discard untrimmed flag. Cutadapt discards those weirdly looking sequences:

@VH01456:36:AAFGH3WM5:1:1101:24934:3007 1:N:0:ATGTAACGTT+CAGGAATCGT
TGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCAACCCCTTAGGGGGTCGGACGGGACGGATGATGGCCTCCCGTGTGCCCCGTCACGCGGCTGGCATAAATGCCGAGTCCCCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTGTCCTGTCGTGCGCGGGCGTCTCCGGGGGGCTCTGACATACGCGTGTCGATCATTCGA
+
55CC*5555555C5555CC**C5CCC5C**C5555*55C5555**5555C*555555555*55555555555*5*55555*55555CCC*5C*C5*C5555CCCC*5555C5C5CC*C5C5*55CCC5CCCCC5*CCC5C55CCCC5C*C555CCCCCCCCCCC5C*CCCC*CCC5CCCCC**C5*5CC*C555CC55CCCCC5*C55CCCC5C**C5*C555555555C5*CC5555*5C5C5C*5*CCC*C*55C*55C*55*5*5555555555C5*5
@VH01456:36:AAFGH3WM5:1:1101:20825:4427 1:N:0:ATGTAACGTT+CAGCAATCGT
TGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCAACCCCTTCGGGGGTCGGACGGGACGGATGATGGCCTCCCGTGTGCCCCGTCACGCGGCTGGCATAAATGCCGAGTCCCCGGCGACCAACGCCACGACAATCGGTGGTTCTCAAACCTCGGTGTCCTGTCGTGCGCGCGCGTCTCCGGGGCGCTCTCACATACGCGTGTCGATCATTCGA
+
CCCCCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH01456:36:AAFGH3WM5:1:1101:39742:4824 1:N:0:ATGTAACGTT+CAGCAATCGT
TGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCAACCCCTTAGGGGGTCGGACGGGACGGATGATGGCCTCCCGTGTGCCCCGTCACGCGGCTGGCATAAATGCCGAGTCCCCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTGTCCTGTCGTGCGCGCGCGTCTCCGGGGCGCTCTCACATACGCGTGTCGATCATTCGA
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC*C5CCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCC*
@VH01456:36:AAFGH3WM5:1:1101:72235:8573 1:N:0:ATGTAACGTT+CAGCAATCGT
TGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCAACCCCTTAGGGGGTCGGACGGGACGGATGATGGCCTCCCGTGTGCCCCGTCACGCGGCTGGCATAAATGCCGAGTCCCCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTGTCCTGTCGTGCGCGCGCGTCTCCGGGGCGCTCTCACATACGCGTGTCGATCATTCGA
+
CCCCCCCCCCCCCCCCCCCCC*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCCCCCCCCC*CCCCCCCCCCCCCCCC**CCCCCC5*CC*
@VH01456:36:AAFGH3WM5:1:1101:63658:9538 1:N:0:ATGTAACGTT+CAGCAATCGT
TGCAGAATCCCGTGAACCATCGAGTTTTTGAACGCAAGTTGCGCCCGAAGCCATTCGGCCGAGGGCACGCCTGCCTGGGCGTCACGCATCGCGTCGCCTCCCCATCATATTTATCCTACCAGCAGGTACTCATGGTGAAGGGGGCGGATATTGGTTTCCCGTACTTGTTGCGGTTGGCCTAAAAAGGAGTCCCCTTCGGCGGACACACGATTAGTGGTGGTTGAACAGACCCTCGTCCTTATCTTGTGTCGTGAGCTGCTAGGGAAGCCCTCATCAAAGAC
+
CCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCCCCCC*5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCCCCCCCCCCCCCCCCCC

Nonetheless, I have realised that after the cutadapt step there are some very short sequences of about 10 bp in my samples.

@VH01456:36:AAFGH3WM5:1:1101:35235:1189 2:N:0:CACAGCGGTC+CAATAGGAAT
GGCTGTTGGT
+
CCCCCCCCCC

Could be this the issue why i am not able to run plotQualityProfile() and the learnErrors function? The difference between the two cutadapt steps is that in the general workflow I included a flag to keep the sequences between 100 and 350 bp. Like this:

for i in *_R1.fastq.gz
do
FILENAME=basename $i _R1.fastq.gz;
echo $FILENAME;
cutadapt -g ^ATGCGATACTTGGTGTGAAT -G ^GACGCTTCTCCAGACTACAAT
-m 100 -M 350
--pair-filter=both -q 10
--discard-untrimmed
--cores=6
-o ${FILENAME}_R1_trimmed.fastq.gz -p ${FILENAME}_R2_trimmed.fastq.gz
${FILENAME}_R1.fastq.gz ${FILENAME}_R2.fastq.gz
done

However, in the ITS workflow I do not have such a flag:

for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(
    R1.flags, R2.flags,
    "-n", 2,                      # -n 2 required to remove FWD and REV from reads
    "--discard-untrimmed",        # Add --discard-untrimmed flag
    "-o", fnFs.cut[i], "-p", fnRs.cut[i],  # output files
    fnFs.filtN[i], fnRs.filtN[i]  # input files
  ))
}

Should I combine the two approaches?

Thank you very much for your help

@benjjneb
Copy link
Owner

benjjneb commented Nov 1, 2024

Those poly-G tails you are seeing are probably coming from a two-color Illumina sequencing instrument (e.g. NextSeq), where the lack of a signal is read as a G. This leads to sometimes many reads having long strings of Gs at the end, where due to some other reason the instrument stopped seeing signal from that spot and misinterpreted it as G after G.

If cutadapt is effectively removing those sequences, that can work. ANd then, yes, adding a minimum length cutoff should get rid of the super-low length sequences that could cause problems with downstream steps. You can set that flag in cutadapt, or in filterAndTrim(..., minLen=XXX).

Of note, the dada2 R package also has functionality for identifying and removing "low complexity" sequences, like those with these long polyG tails. plotComplexity will visualize the distribution of sequence complexity in a fastq file, and rm.lowcomplex is a flag in filterAndTrim that can remove sequences with low complexity blocks like those polyG tails.

@Guillermouceda
Copy link
Author

Hello @benjjneb,

Thank you for your suggestions. Cutadapt succesfully removed those low complexity sequences. I have also plot the complexity of one sample as per your suggestion with plotComplexity See the attached images.

I have also manage to run the error models for both forward and reverse sequences. Is the model model for the Rev sequences OK?

Moreover, I have manged to build a the sequence table after the ASV inference step, however, it has a very big number of ASVs (i.e., over 100K). Is this normal ?

> dim(seqtab)
[1]    162 121909

I was thinking of dereplicating the sequences before the ASV inference. I have read on other post that dada function already dereplicates the sequences automatically. However, I have also seen on the big data tutorial that the sequences are dereplicated and the the ASVs are inferred. Would it be correct to dereplicate my sequences right after the the learnErrors step and the infer the ASVs? or Should I directly go to chimera removal?

Thank you very much for your help!!

Guillermo

@jorondo1
Copy link

This was all really helpful ! I'm adapting the ITS pipeline for trnL markers and had the same problem with plotting qc for post-cutadapt sequences. Removing the low complexity sequences using the filterAndTrim flag was not sufficient to resolve this for me, I also had to use the cutadapt flags you suggested for mininmmum length.

@benjjneb
Copy link
Owner

Would it be correct to dereplicate my sequences right after the the learnErrors step and the infer the ASVs? or Should I directly go to chimera removal?

It is preferable to not explicitly dereplicate sequences with derepFastq, but to instead pass the filenames to learnErrors/dada. Dereplication still happens "on-the-fly" within those functions, in a manner which requires less memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants