What do the filtN files do? #2072

kdbchau · 2025-01-08T16:49:10Z

I am confused about the pre-filtering step. After filtering the files for ambiguous sequences, these files are placed in the filtN folder. Can I assume that these are sequences removed from the original fastq files or are they just highlighted - as in - when running primerHits it is just looking at the filtered files to see where those ambiguous sequences are in the original files?

I am just confused because I run cutadapt in terminal on my fastq files but I don't see how it is using the filtered files in filtN.

benjjneb · 2025-01-08T17:55:46Z

The fastq files in the filtN directory (assuming you are running the ITS tutorial workflow) are the sets of sequences from each sample that had no ambiguous nucleotides in them. So they are a subset of the original fastq files, retaining only reads with no Ns.

kdbchau · 2025-01-08T18:01:02Z

Oh in my case those files were much smaller....were those the ones I should have used for further analysis? I ended up using the original files and got pretty decent results from those. For example, one sample has 12154 seqs, but the filtered file has only 2...Almost all are like these.

kdbchau · 2025-01-08T18:11:31Z

I think I am not understanding something because the primerHits values I get, and when I compare to the number of sequences in my files don't quite match up.

library(ShortRead)
#packageVersion("ShortRead")
library(dada2)
#packageVersion("dada2")
library(Biostrings)
#packageVersion("Biostrings")


rm(list=ls())
path <- "~/../Desktop/PostDoc/ALL_BEES_ITS1/"
list.files(path)

# cutadapt is initalized as a virtualenv in the folder, using Miniforge; run following commands within the bash shell
# conda activate cutadapt
# cutadapt --version  # says version 4.9, and (cutadapt) is shown in terminal

myfiles <- sort(list.files(path, pattern=".fastq", full.names=TRUE)) # because these are single-end reads I think?

# Identify primers
FWD <- "CTTGGTCATTTAGAGGAAGTAA"
REV <- "GCTGCGTTCTTCATCGATGC"

# Verify presence and orientation of the primers on the reads

allOrients <- function(primer) {
  # Create all orientations of the input sequence
  require(Biostrings)
  dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
  orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
               RevComp = Biostrings::reverseComplement(dna))
  return(sapply(orients, toString))  # Convert back to character vector
}

FWD.orients <- allOrients(FWD); FWD.orients
REV.orients <- allOrients(REV); REV.orients

# Pre-filter; Remove ambiguous Ns in sequencing reads
myfiles.filtN <- file.path(path, "filtN", basename(myfiles)) # Put N-filtered files in filtN subdirectory (create this beforehand)
filterAndTrim(myfiles, myfiles.filtN, maxN = 0, multithread = TRUE)


# Count how many times primers appear in the reads, considering all possible primer orientations
primerHits <- function(primer, fn) {
  # Counts number of reads in which the primer is found
  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
  return(sum(nhits > 0))
}
 #Example sample 100
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = myfiles.filtN[[100]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = myfiles.filtN[[100]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = myfiles.filtN[[100]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = myfiles.filtN[[100]]))

The above will return

                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0   12169
REV.ReverseReads       0          0       0   12169

I continue with my script in R and then in terminal as follows:

# In some cases the FWD primer is matching to FWD.Forward/ReverseReads in Forward orientation
# the REV primer is matching to REV.Forward/ReverseReads in its RevComp
FWD <- FWD.orients[["Forward"]]; FWD
REV <- REV.orients[["RevComp"]]; REV

### IN TERMINAL because my R cannot find my cutadapt, I ran the following on the ORIGINAL fastq files:
for files in *.fastq; do /home/chauk/.local/bin/cutadapt -a CTTGGTCATTTAGAGGAAGTAA...GCATCGATGAAGAACGCAGC -n 2 -o output/${files%%.fastq}_cut.fastq $files; done

## Continue in R
path.cut <- "~/../Desktop/PostDoc/ALL_BEES_ITS1/output/"
myfiles.cut <- file.path(path.cut, paste0(strsplit(basename(myfiles),".fastq"),"_cut.fastq"))

# Sanity check
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = myfiles.cut[[100]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = myfiles.cut[[100]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = myfiles.cut[[100]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = myfiles.cut[[100]]))

Above returns

                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       0          0       0       0

If I check how many sequences are in the original file for sample 100, and compare it to the filtN file I get:

Original = 12,586 seqs
FiltN = 7 seqs

But in the first primerHits command, it was returning 121,169 RevComp primer hits. Does that mean the primer is just matching to several parts within the same sequence? And if most of my sequences do contain ambiguous reads it should be alright to continue using them? Otherwise I am getting samples going from 1000s of reads to 10 or less.

benjjneb · 2025-01-09T17:10:47Z

A couple things:

You appear to have single-end data, not paired-end data, so there are no Forward reads/Reverse reads here. A fixed version of the rbind line for single-end data would be something like:

rbind(FWD.primer = sapply(FWD.orients, primerHits, fn = myfiles.cut[[100]]), 
      REV.primer = sapply(REV.orients, primerHits, fn = myfiles.cut[[100]])

But in the first primerHits command, it was returning 121,169 RevComp primer hits.

The value was 12,169, which is close to the input number of reads (12,586). This indicates that in almost all reads the reverse-complement orientation of the REV primer is being detected, which is expected if the sequenced amplicon is usually shorter than your read length. So that is fairy normal.

The main issue seems to be that almost all your reads contain N(s). While this can be worked around at the primer removal stage, the main dada2 denoising method does not allow for Ns. What does plotQualityProfile look like for an example sample? What are the first couple sequences from an example sample?

kdbchau · 2025-01-09T17:17:34Z

I do carry out the rest of my analysis after trimming the primers in qiime2 and I think everything seemed fine?

But here is one plot of my cut.fastq files (not using the filtered files).

Some additional samples:

benjjneb · 2025-01-09T17:22:38Z

That looks fine. My guess is that the Ns are happening in your reads after the reverse primer is detected, and so are being truncated off the reads by cutadapt, in which case you are fine to move forward.

kdbchau · 2025-01-09T17:23:02Z

Amazing, thanks so much for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What do the filtN files do? #2072

What do the filtN files do? #2072

kdbchau commented Jan 8, 2025

benjjneb commented Jan 8, 2025

kdbchau commented Jan 8, 2025

kdbchau commented Jan 8, 2025 •

edited

Loading

benjjneb commented Jan 9, 2025

kdbchau commented Jan 9, 2025

benjjneb commented Jan 9, 2025

kdbchau commented Jan 9, 2025

What do the filtN files do? #2072

What do the filtN files do? #2072

Comments

kdbchau commented Jan 8, 2025

benjjneb commented Jan 8, 2025

kdbchau commented Jan 8, 2025

kdbchau commented Jan 8, 2025 • edited Loading

benjjneb commented Jan 9, 2025

kdbchau commented Jan 9, 2025

benjjneb commented Jan 9, 2025

kdbchau commented Jan 9, 2025

kdbchau commented Jan 8, 2025 •

edited

Loading