Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do the filtN files do? #2072

Open
kdbchau opened this issue Jan 8, 2025 · 7 comments
Open

What do the filtN files do? #2072

kdbchau opened this issue Jan 8, 2025 · 7 comments

Comments

@kdbchau
Copy link

kdbchau commented Jan 8, 2025

I am confused about the pre-filtering step. After filtering the files for ambiguous sequences, these files are placed in the filtN folder. Can I assume that these are sequences removed from the original fastq files or are they just highlighted - as in - when running primerHits it is just looking at the filtered files to see where those ambiguous sequences are in the original files?

I am just confused because I run cutadapt in terminal on my fastq files but I don't see how it is using the filtered files in filtN.

@benjjneb
Copy link
Owner

benjjneb commented Jan 8, 2025

The fastq files in the filtN directory (assuming you are running the ITS tutorial workflow) are the sets of sequences from each sample that had no ambiguous nucleotides in them. So they are a subset of the original fastq files, retaining only reads with no Ns.

@kdbchau
Copy link
Author

kdbchau commented Jan 8, 2025

Oh in my case those files were much smaller....were those the ones I should have used for further analysis? I ended up using the original files and got pretty decent results from those. For example, one sample has 12154 seqs, but the filtered file has only 2...Almost all are like these.

@kdbchau
Copy link
Author

kdbchau commented Jan 8, 2025

I think I am not understanding something because the primerHits values I get, and when I compare to the number of sequences in my files don't quite match up.

library(ShortRead)
#packageVersion("ShortRead")
library(dada2)
#packageVersion("dada2")
library(Biostrings)
#packageVersion("Biostrings")


rm(list=ls())
path <- "~/../Desktop/PostDoc/ALL_BEES_ITS1/"
list.files(path)

# cutadapt is initalized as a virtualenv in the folder, using Miniforge; run following commands within the bash shell
# conda activate cutadapt
# cutadapt --version  # says version 4.9, and (cutadapt) is shown in terminal

myfiles <- sort(list.files(path, pattern=".fastq", full.names=TRUE)) # because these are single-end reads I think?

# Identify primers
FWD <- "CTTGGTCATTTAGAGGAAGTAA"
REV <- "GCTGCGTTCTTCATCGATGC"

# Verify presence and orientation of the primers on the reads

allOrients <- function(primer) {
  # Create all orientations of the input sequence
  require(Biostrings)
  dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
  orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
               RevComp = Biostrings::reverseComplement(dna))
  return(sapply(orients, toString))  # Convert back to character vector
}

FWD.orients <- allOrients(FWD); FWD.orients
REV.orients <- allOrients(REV); REV.orients

# Pre-filter; Remove ambiguous Ns in sequencing reads
myfiles.filtN <- file.path(path, "filtN", basename(myfiles)) # Put N-filtered files in filtN subdirectory (create this beforehand)
filterAndTrim(myfiles, myfiles.filtN, maxN = 0, multithread = TRUE)


# Count how many times primers appear in the reads, considering all possible primer orientations
primerHits <- function(primer, fn) {
  # Counts number of reads in which the primer is found
  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
  return(sum(nhits > 0))
}
 #Example sample 100
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = myfiles.filtN[[100]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = myfiles.filtN[[100]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = myfiles.filtN[[100]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = myfiles.filtN[[100]]))

The above will return

                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0   12169
REV.ReverseReads       0          0       0   12169

I continue with my script in R and then in terminal as follows:

# In some cases the FWD primer is matching to FWD.Forward/ReverseReads in Forward orientation
# the REV primer is matching to REV.Forward/ReverseReads in its RevComp
FWD <- FWD.orients[["Forward"]]; FWD
REV <- REV.orients[["RevComp"]]; REV

### IN TERMINAL because my R cannot find my cutadapt, I ran the following on the ORIGINAL fastq files:
for files in *.fastq; do /home/chauk/.local/bin/cutadapt -a CTTGGTCATTTAGAGGAAGTAA...GCATCGATGAAGAACGCAGC -n 2 -o output/${files%%.fastq}_cut.fastq $files; done

## Continue in R
path.cut <- "~/../Desktop/PostDoc/ALL_BEES_ITS1/output/"
myfiles.cut <- file.path(path.cut, paste0(strsplit(basename(myfiles),".fastq"),"_cut.fastq"))

# Sanity check
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = myfiles.cut[[100]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = myfiles.cut[[100]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = myfiles.cut[[100]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = myfiles.cut[[100]]))

Above returns

                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       0          0       0       0

If I check how many sequences are in the original file for sample 100, and compare it to the filtN file I get:

Original = 12,586 seqs
FiltN = 7 seqs

But in the first primerHits command, it was returning 121,169 RevComp primer hits. Does that mean the primer is just matching to several parts within the same sequence? And if most of my sequences do contain ambiguous reads it should be alright to continue using them? Otherwise I am getting samples going from 1000s of reads to 10 or less.

@benjjneb
Copy link
Owner

benjjneb commented Jan 9, 2025

A couple things:

You appear to have single-end data, not paired-end data, so there are no Forward reads/Reverse reads here. A fixed version of the rbind line for single-end data would be something like:

rbind(FWD.primer = sapply(FWD.orients, primerHits, fn = myfiles.cut[[100]]), 
      REV.primer = sapply(REV.orients, primerHits, fn = myfiles.cut[[100]])

But in the first primerHits command, it was returning 121,169 RevComp primer hits.

The value was 12,169, which is close to the input number of reads (12,586). This indicates that in almost all reads the reverse-complement orientation of the REV primer is being detected, which is expected if the sequenced amplicon is usually shorter than your read length. So that is fairy normal.

The main issue seems to be that almost all your reads contain N(s). While this can be worked around at the primer removal stage, the main dada2 denoising method does not allow for Ns. What does plotQualityProfile look like for an example sample? What are the first couple sequences from an example sample?

@kdbchau
Copy link
Author

kdbchau commented Jan 9, 2025

I do carry out the rest of my analysis after trimming the primers in qiime2 and I think everything seemed fine?

But here is one plot of my cut.fastq files (not using the filtered files).

image

Some additional samples:

image

@benjjneb
Copy link
Owner

benjjneb commented Jan 9, 2025

That looks fine. My guess is that the Ns are happening in your reads after the reverse primer is detected, and so are being truncated off the reads by cutadapt, in which case you are fine to move forward.

@kdbchau
Copy link
Author

kdbchau commented Jan 9, 2025

Amazing, thanks so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants