Skip to content

Week 4 tests

Nelly-Wambui edited this page Jan 27, 2022 · 3 revisions

What we did:

  • Testing of the nf-core pipeline using different datasets(ITS data and 18S data)
  • Running Yosef's DADA2 pipeline
  • Running 16S-Accreditation DADA2 pipeline

Challenges

Obtaining test data

  • The metagenomic studies datasets we got did not have metadata

nf-core/ampliseq

16S Datasets with nf-core/ampliseq

Stingless bee microbiome data

Command

nextflow run nf-core/ampliseq -profile singularity --input "/node/cohort4/ckigen/data/honeybee/microbiome-data" --FW_primer "CCTACGGGNGGCWGCAG" --RV_primer "GACTACHVGGGTATCTAATCC" --metadata "../metadata.tsv" -resume

Runtime

28 min 49 s

Output

.
├── cutadapt
├── dada2
├── fastqc
├── multiqc
├── overall_summary.tsv
├── pipeline_info
└── qiime2

ITS Datasets with nf-core/ampliseq

Food metagenomic dataset

Command

 nextflow run nf-core/ampliseq -profile singularity --input "/node/cohort4/ckigen/ITS-metagenomics/dish-metagenomics/data/seq" --FW_primer "GGAAGTAAAAGTCGTAACAAGG" --RV_primer "GCTGCGTTCTTCATCGATGC" --illumina_pe_its --metadata "./metadata.tsv" -resume

Dummy Metadata

ID    treatment
SRR3003942      a
SRR3003943      b
SRR3003944      c
SRR3003945      d

Runtime

28 min 5 s

Output

.
├── cutadapt
├── dada2
├── fastqc
├── multiqc
├── overall_summary.tsv
├── pipeline_info
└── qiime2

Yosef's DADA2 pipeline

Error 1: Reading the sample metadata into R

  1. While running the nf-core data, we got this error in reading the sample metadata into R :
Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘a’, ‘b’ 
  1. After trying to fix it by adding row.names="samples", an idea we borrowed from the 16S-Accreditation DADA2 pipeline, we got this error:
Error in data[[rowvar]] : 
  attempt to select less than one element in get1index
  • Solution was adding "row.names=NULL"
Initially:
sdata <- read.csv("stingless_bee_sample_metadata.csv", sep = ',', header = TRUE)

Changed to:
sdata <- read.table("/node/cohort4/nelly/16s-rRNA-Project/metadata.tsv", sep = '\t', header = TRUE, row.names=NULL)

Error 2: Extracting the filtered taxonomy and feature tables for barplot plotting

  1. The parameters in making the taxonomy table were not agreeing to what we had in our phyloseq object:
Error in phyloseq_to_df(physeq1, addtax = T, addtot = F, addmaxrank = F) : 
  Error: Sample names in 'physeq' could not be automatically converted to the syntactically valid column names in data.frame (see 'make.names'). Consider renaming with 'sample_names'.
  • We changed the addtax = T to addtax = F and it worked
Initially:
tax_table <- phyloseq_to_df(physeq1, addtax = T, addtot = F, addmaxrank = F)

Changed to:
tax_table <- phyloseq_to_df(physeq1, addtax = F, addtot = F, addmaxrank = F)
  1. We encountered the same error in creating the silva_classified object and solved it in the same way
Initially:
silva_classified <- phyloseq_to_df(physeq2, addtax = T, addtot = F, addmaxrank = F)

Changed to:
silva_classified <- phyloseq_to_df(physeq2, addtax = F, addtot = F, addmaxrank = F)

Error 3: BLAST for all Microbiota

  • In merging the silva classified taxonomy with the blast classified ones, the silva classified data object was being merged with a data object that had not yet been created(merged_data)
  • We solved this by merging with the cumulation object instead, which had already been created:
Initially:
silva_blast_raw <- as.data.frame(bind_rows(silva_classified, merged_data))

Changed to:
silva_blast_raw <- as.data.frame(bind_rows(silva_classified, cumulation))

Error 4: Removing the first staxids and making the OTU column the first

  1. There was an object blast_out_1 mentioned that had not been created (might have been a typo)
  • Solution was changing it to blast_out, which had already been created:
Initially:
blast_results <- merge(blast_taxa, blast_out_1, by = "OTU", all = FALSE)

Changed to:
blast_results <- merge(blast_taxa, blast_out, by = "OTU", all = FALSE)
  1. On solving that, we got this error:
Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column
  • We ran the objects being merged individually and realized that only the blast_taxa object had OTU
  • We solved the problem by naming the first column of the blast_out as OTU instead of qseqid:
Initially:
colnames <- c("qseqid",
              "sseqid",
              "evalue",
              "bitscore",
              "sgi",
              "sacc")

Changed to:
colnames <- c("OTU",
              "sseqid",
              "evalue",
              "bitscore",
              "sgi",
              "sacc")
  • As a result, the blast_taxa$OTU <- blast_out$qseqid could not run, but seeing that its intended purpose was not fulfilled, we commented it out

Error 5: Assigning to different genera/clusters

  • This was the last point we got to in running this pipeline and we encountered the following error:
Error: object 'Lactobacillus' not found
  • The problem hasn't been solved yet

MBBU/16S-Accreditation pipeline

Error 1: Creating a phyloseq object

  • We encountered the following errors:
Error in data[[rowvar]] : attempt to select less than one element
In addition: Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on '../demo.txt'
Error in validObject(.Object) : invalid class “phyloseq” object: Component sample names do not match. Try sample_names()
  • We solved them by specifying the sampleID names in the metadata to one variable and providing it as the row.names object
Initially:
samdata <- read.table("set5_meta.txt",header = TRUE,row.names = "sample")

Changed to:
sample <- c("Dog1", "Dog2", "Dog3", "Dog8", "Dog9", "Dog10", "Dog15", "Dog16", "Dog17", "Dog22", "Dog23", "Dog24", "Dog29")
samdata <- read.table("/node/cohort4/nelly/16s-rRNA-Project/practice.dataset1.meta.txt",header = TRUE,row.names = sample)

Error 2: Rooting the tree

  • In the cloned script, this part came before the part for creating a phyloseq object yet it required a phyloseq object
  • The solution we had for this was shifting its position to make it go after creating the phyloseq object

Error 3: Missing BV variable

  • We encountered this error in the following parts of the script:
#Taxonomy plots.
#Phylum
phylum_plot <-  plot_bar(phyloseq_object, x="sample", fill="Phylum") + facet_wrap(~BV, scales="free_x")

#Genus
Genus_plot <-  plot_bar(ps.top50, x="sample", fill="Genus") + facet_wrap(~BV, scales="free_x")

#visualising alpha diversity
top30plot <- plot_bar(ps.top30, x="Age", fill="Inflammation") + facet_wrap(~BV, scales="free_x")

  • Due to lack of knowledge of what should go into the facet_wrap, we were not able to solve that particular BV issue

Breakthroughs

nf-core pipeline

  • We successfully ran pipeline using:

    • Another set of 16S data(Stingless bee microbiome dataset)
    • ITS data(Food metagenomic dataset)
  • We also realized the pipeline has a website in which we launched the pipeline version that allowed us to input values as per our data

Yosef's DADA2 pipeline

  • We tested it using 2 sets of data (Stingless bee microbiome data and the nf-core data)
  • We were able to solve the most of challenges indicated above
  • Changes that should be made to be specific to one's data are in the following steps:
    • Loading the fastq files into R: input the path to the fastq files
    • Extracting the file names: should match the files format
    • Primer trimming: input primers specific to the sequences
    • Path to cutadapt: input the path to the available cutadapt version
    • Filtering and trimming data: change the parameters to fit the sequences being run
    • Taxonomic classification: indicate paths to silva_nr_v138_train_set.fa.gz and silva_species_assignment_v138.fa.gz
    • Reading the sample metadata into R: provide metadata file path(note separator in the file and update accordingly) and metadata column names
    • Create BIOM file: change path to the biom file
    • Extracting the filtered taxonomy and feature tables for barplot plotting: tax_table object and silva_classified object parameters might change for different datasets
    • BLAST for all Microbiota: change the M_blast_abundance object data.frame column parameters and M_to_blast_dada2_SB_sequences oject path
    • Running blast: provide paths to blastn, blast_db, and input objects
    • Removing .1-9 string to get the rightful accession numbers: provide paths to taxaId and blast_taxa objects
    • Merging the blast taxonomic classification to blast abundance table: provide path to merged_data_all_1 object

16S-Accreditation DADA2 pipeline

  • We tested the pipeline using 2 sets of data(Dog data and nf-core data)
  • We progressed from where we had got stuck in the last code-review(creating a phyloseq object)
  • Changes that should be made to be specific to one's data are in the following steps:
    • Setwd("path to working directory"): input the path to the fastq files
    • Reading sample names using r: should match the files format
    • One holding the file names of all the forward and reverse reads: should match the files format
    • Variables holding file names for the forward and reverse filtered reads: should match the files format
    • Looking at a specific samples quality profile: should be according to the number of samples
    • Plotting quality profiles for random samples: should be according to the number of samples
    • Trimming and filtering: change the parameters to fit the sequences being run
    • Checking the quality profile again: should be according to the number of samples
    • Plotting Quality Profiles For Random Samples After Filtering: should be according to the number of samples
    • Assigning Taxonomy: indicate paths to silva_nr_v138_train_set.fa.gz and silva_species_assignment_v138.fa.gz
    • Alternative training database in assigning taxonomy: indicate path to rdp_train_set_18.fa
    • Creating a phyloseq object: provide metadata file path(note separator in the file and update accordingly) and specify and metadata column names to the sample variable