-
Notifications
You must be signed in to change notification settings - Fork 4
Week 4 tests
Nelly-Wambui edited this page Jan 27, 2022
·
3 revisions
- Testing of the nf-core pipeline using different datasets(ITS data and 18S data)
- Running Yosef's DADA2 pipeline
- Running 16S-Accreditation DADA2 pipeline
- The metagenomic studies datasets we got did not have metadata
nextflow run nf-core/ampliseq -profile singularity --input "/node/cohort4/ckigen/data/honeybee/microbiome-data" --FW_primer "CCTACGGGNGGCWGCAG" --RV_primer "GACTACHVGGGTATCTAATCC" --metadata "../metadata.tsv" -resume
28 min 49 s
.
├── cutadapt
├── dada2
├── fastqc
├── multiqc
├── overall_summary.tsv
├── pipeline_info
└── qiime2
nextflow run nf-core/ampliseq -profile singularity --input "/node/cohort4/ckigen/ITS-metagenomics/dish-metagenomics/data/seq" --FW_primer "GGAAGTAAAAGTCGTAACAAGG" --RV_primer "GCTGCGTTCTTCATCGATGC" --illumina_pe_its --metadata "./metadata.tsv" -resume
ID treatment SRR3003942 a SRR3003943 b SRR3003944 c SRR3003945 d
28 min 5 s
. ├── cutadapt ├── dada2 ├── fastqc ├── multiqc ├── overall_summary.tsv ├── pipeline_info └── qiime2
- While running the nf-core data, we got this error in reading the sample metadata into R :
Error in `.rowNamesDF<-`(x, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘a’, ‘b’
- After trying to fix it by adding
row.names="samples"
, an idea we borrowed from the 16S-Accreditation DADA2 pipeline, we got this error:
Error in data[[rowvar]] :
attempt to select less than one element in get1index
- Solution was adding
"row.names=NULL"
Initially:
sdata <- read.csv("stingless_bee_sample_metadata.csv", sep = ',', header = TRUE)
Changed to:
sdata <- read.table("/node/cohort4/nelly/16s-rRNA-Project/metadata.tsv", sep = '\t', header = TRUE, row.names=NULL)
- The parameters in making the taxonomy table were not agreeing to what we had in our phyloseq object:
Error in phyloseq_to_df(physeq1, addtax = T, addtot = F, addmaxrank = F) :
Error: Sample names in 'physeq' could not be automatically converted to the syntactically valid column names in data.frame (see 'make.names'). Consider renaming with 'sample_names'.
- We changed the
addtax = T
toaddtax = F
and it worked
Initially:
tax_table <- phyloseq_to_df(physeq1, addtax = T, addtot = F, addmaxrank = F)
Changed to:
tax_table <- phyloseq_to_df(physeq1, addtax = F, addtot = F, addmaxrank = F)
- We encountered the same error in creating the silva_classified object and solved it in the same way
Initially:
silva_classified <- phyloseq_to_df(physeq2, addtax = T, addtot = F, addmaxrank = F)
Changed to:
silva_classified <- phyloseq_to_df(physeq2, addtax = F, addtot = F, addmaxrank = F)
- In merging the silva classified taxonomy with the blast classified ones, the silva classified data object was being merged with a data object that had not yet been created(merged_data)
- We solved this by merging with the cumulation object instead, which had already been created:
Initially:
silva_blast_raw <- as.data.frame(bind_rows(silva_classified, merged_data))
Changed to:
silva_blast_raw <- as.data.frame(bind_rows(silva_classified, cumulation))
- There was an object
blast_out_1
mentioned that had not been created (might have been a typo)
- Solution was changing it to blast_out, which had already been created:
Initially:
blast_results <- merge(blast_taxa, blast_out_1, by = "OTU", all = FALSE)
Changed to:
blast_results <- merge(blast_taxa, blast_out, by = "OTU", all = FALSE)
- On solving that, we got this error:
Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column
- We ran the objects being merged individually and realized that only the blast_taxa object had OTU
- We solved the problem by naming the first column of the blast_out as OTU instead of qseqid:
Initially:
colnames <- c("qseqid",
"sseqid",
"evalue",
"bitscore",
"sgi",
"sacc")
Changed to:
colnames <- c("OTU",
"sseqid",
"evalue",
"bitscore",
"sgi",
"sacc")
- As a result, the
blast_taxa$OTU <- blast_out$qseqid
could not run, but seeing that its intended purpose was not fulfilled, we commented it out
- This was the last point we got to in running this pipeline and we encountered the following error:
Error: object 'Lactobacillus' not found
- The problem hasn't been solved yet
- We encountered the following errors:
Error in data[[rowvar]] : attempt to select less than one element
In addition: Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on '../demo.txt'
Error in validObject(.Object) : invalid class “phyloseq” object: Component sample names do not match. Try sample_names()
- We solved them by specifying the sampleID names in the metadata to one variable and providing it as the row.names object
Initially:
samdata <- read.table("set5_meta.txt",header = TRUE,row.names = "sample")
Changed to:
sample <- c("Dog1", "Dog2", "Dog3", "Dog8", "Dog9", "Dog10", "Dog15", "Dog16", "Dog17", "Dog22", "Dog23", "Dog24", "Dog29")
samdata <- read.table("/node/cohort4/nelly/16s-rRNA-Project/practice.dataset1.meta.txt",header = TRUE,row.names = sample)
- In the cloned script, this part came before the part for creating a phyloseq object yet it required a phyloseq object
- The solution we had for this was shifting its position to make it go after creating the phyloseq object
- We encountered this error in the following parts of the script:
#Taxonomy plots.
#Phylum
phylum_plot <- plot_bar(phyloseq_object, x="sample", fill="Phylum") + facet_wrap(~BV, scales="free_x")
#Genus
Genus_plot <- plot_bar(ps.top50, x="sample", fill="Genus") + facet_wrap(~BV, scales="free_x")
#visualising alpha diversity
top30plot <- plot_bar(ps.top30, x="Age", fill="Inflammation") + facet_wrap(~BV, scales="free_x")
- Due to lack of knowledge of what should go into the facet_wrap, we were not able to solve that particular BV issue
-
We successfully ran pipeline using:
- Another set of 16S data(Stingless bee microbiome dataset)
- ITS data(Food metagenomic dataset)
-
We also realized the pipeline has a website in which we launched the pipeline version that allowed us to input values as per our data
- We tested it using 2 sets of data (Stingless bee microbiome data and the nf-core data)
- We were able to solve the most of challenges indicated above
- Changes that should be made to be specific to one's data are in the following steps:
- Loading the fastq files into R: input the path to the fastq files
- Extracting the file names: should match the files format
- Primer trimming: input primers specific to the sequences
- Path to cutadapt: input the path to the available cutadapt version
- Filtering and trimming data: change the parameters to fit the sequences being run
- Taxonomic classification: indicate paths to silva_nr_v138_train_set.fa.gz and silva_species_assignment_v138.fa.gz
- Reading the sample metadata into R: provide metadata file path(note separator in the file and update accordingly) and metadata column names
- Create BIOM file: change path to the biom file
- Extracting the filtered taxonomy and feature tables for barplot plotting: tax_table object and silva_classified object parameters might change for different datasets
- BLAST for all Microbiota: change the M_blast_abundance object data.frame column parameters and M_to_blast_dada2_SB_sequences oject path
- Running blast: provide paths to blastn, blast_db, and input objects
- Removing .1-9 string to get the rightful accession numbers: provide paths to taxaId and blast_taxa objects
- Merging the blast taxonomic classification to blast abundance table: provide path to merged_data_all_1 object
- We tested the pipeline using 2 sets of data(Dog data and nf-core data)
- We progressed from where we had got stuck in the last code-review(creating a phyloseq object)
- Changes that should be made to be specific to one's data are in the following steps:
- Setwd("path to working directory"): input the path to the fastq files
- Reading sample names using r: should match the files format
- One holding the file names of all the forward and reverse reads: should match the files format
- Variables holding file names for the forward and reverse filtered reads: should match the files format
- Looking at a specific samples quality profile: should be according to the number of samples
- Plotting quality profiles for random samples: should be according to the number of samples
- Trimming and filtering: change the parameters to fit the sequences being run
- Checking the quality profile again: should be according to the number of samples
- Plotting Quality Profiles For Random Samples After Filtering: should be according to the number of samples
- Assigning Taxonomy: indicate paths to silva_nr_v138_train_set.fa.gz and silva_species_assignment_v138.fa.gz
- Alternative training database in assigning taxonomy: indicate path to rdp_train_set_18.fa
- Creating a phyloseq object: provide metadata file path(note separator in the file and update accordingly) and specify and metadata column names to the sample variable