On Biowulf, RENEE comes bundled with the following pre-built GENCODE1 reference genomes:
As of RENEE v2.6.0, all hg19 and hg38 indices were built using the NCI Genomic Data Commons reference fasta, which contains the primary genome from Encode plus virus and decoy sequences. The hg38 fasta files were downloaded from the GDC with virus and decoy sequences already added, while these sequences were manually added to the hg19 fasta from Encode. See details here: https://github.com/CCBR/build-renee-refs
You can run renee run --help to view the most up-to-date list of genome annotations available in your installation of RENEE.
Note: Newer annotations versions may be added upon request and may be already available. Please contact Vishal Koparde for details.
However, building new reference genomes is easy!
If you do not have access to Biowulf or you are looking for a reference genome and/or annotation that is currently not available, it can be built with RENEE's build sub-command. Given a genomic FASTA file (ref.fa) and a GTF file (genes.gtf), renee build will create all of the required reference files to run the RENEE pipeline. Once the build pipeline completes, you can supply the newly generated reference.json to the --genome of renee run. For more information, please see the help page for the run and build sub commands.
The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the NIH Intramural Research Program. If you publish research that involved significant use of Biowulf, please cite the cluster.
Suggested citation text:
This work utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov)
+ Resources - RENEE Documentation
On Biowulf, RENEE comes bundled with the following pre-built GENCODE1 reference genomes:
As of RENEE v2.6.0, all hg19 and hg38 indices were built using the NCI Genomic Data Commons reference fasta, which contains the primary genome from Encode plus virus and decoy sequences. The hg38 fasta files were downloaded from the GDC with virus and decoy sequences already added, while these sequences were manually added to the hg19 fasta from Encode. See details here: https://github.com/CCBR/build-renee-refs
You can run renee run --help to view the most up-to-date list of genome annotations available in your installation of RENEE.
Note: Newer annotations versions may be added upon request and may be already available. Please contact Vishal Koparde for details.
However, building new reference genomes is easy!
If you do not have access to Biowulf or you are looking for a reference genome and/or annotation that is currently not available, it can be built with RENEE's build sub-command. Given a genomic FASTA file (ref.fa) and a GTF file (genes.gtf), renee build will create all of the required reference files to run the RENEE pipeline. Once the build pipeline completes, you can supply the newly generated reference.json to the --genome of renee run. For more information, please see the help page for the run and build sub commands.
The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the NIH Intramural Research Program. If you publish research that involved significant use of Biowulf, please cite the cluster.
Suggested citation text:
This work utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov)
1. Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74. 2. Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data. 3. Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12. 4. Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46. 5. Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385. 6. Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338. 7. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21. 8. Bushnell, B., Rood, J., & Singer, E. (2017). BBMerge - Accurate paired shotgun read merging via overlap. PloS one, 12(10), e0185056. 9. Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294. 10. The Picard toolkit. https://broadinstitute.github.io/picard/. 11. Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7. 12. Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079. 13. Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185. 14. Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323. 15. Uhrig, S., et al. (2021). "Accurate and efficient detection of gene fusions from RNA sequencing data". Genome Res. 31(3): 448-460. 16. Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048.
\ No newline at end of file
+--->
\ No newline at end of file
diff --git a/2.6/RNA-seq/TLDR-RNA-seq/index.html b/2.6/RNA-seq/TLDR-RNA-seq/index.html
index 77a5b94..0bd8364 100644
--- a/2.6/RNA-seq/TLDR-RNA-seq/index.html
+++ b/2.6/RNA-seq/TLDR-RNA-seq/index.html
@@ -1,4 +1,4 @@
- Getting started - RENEE Documentation
When processing RNA-sequencing data, there are often many steps that we must repeat. These are usually steps like removing adapter sequences, aligning reads against a reference genome, checking the quality of the data, and quantifying counts. RENEE is composed of several sub commands or convenience functions to automate these repetitive steps.
With RENEE, you can run your samples through our highly-reproducible pipeline, build resources for new reference genomes, and more!
This page contains information for building reference files and running the RENEE pipeline. For more information about each of the available sub commands, please see the usage section.
RENEE has two dependencies: singularity and snakemake. These dependencies can be installed by a sysadmin; however, snakemake is readily available through conda. Before running the pipeline or any of the commands below, please ensure singularity and snakemake are in your $PATH. Please see follow the instructions below for getting started with the RENEE pipeline.
When processing RNA-sequencing data, there are often many steps that we must repeat. These are usually steps like removing adapter sequences, aligning reads against a reference genome, checking the quality of the data, and quantifying counts. RENEE is composed of several sub commands or convenience functions to automate these repetitive steps.
With RENEE, you can run your samples through our highly-reproducible pipeline, build resources for new reference genomes, and more!
This page contains information for building reference files and running the RENEE pipeline. For more information about each of the available sub commands, please see the usage section.
RENEE has two dependencies: singularity and snakemake. These dependencies can be installed by a sysadmin; however, snakemake is readily available through conda. Before running the pipeline or any of the commands below, please ensure singularity and snakemake are in your $PATH. Please see follow the instructions below for getting started with the RENEE pipeline.
# Setup Step 1.) Please do not run RENEE on the head node!
@@ -65,4 +65,4 @@
--modeslurm\--star-2-pass-basic\--dry-run
-
An email notification will be sent out when the pipeline starts and ends.
\ No newline at end of file
+
An email notification will be sent out when the pipeline starts and ends.
\ No newline at end of file
diff --git a/2.6/RNA-seq/Theory/index.html b/2.6/RNA-seq/Theory/index.html
index d10778c..53de994 100644
--- a/2.6/RNA-seq/Theory/index.html
+++ b/2.6/RNA-seq/Theory/index.html
@@ -1 +1 @@
- Theory - RENEE Documentation
RNA-sequencing (RNA-seq) has a wide variety of applications; this transcriptome profiling method can be used to quantify gene and isoform expression, find changes in alternative splicing, detect gene-fusion events, call variants and much more.
It is also worth noting that RNA-seq can be coupled with other biochemical assays to analyze many other aspects of RNA biology, such as RNA–protein binding (CLIP-seq, RIP-seq), RNA structure (SHAPE-seq), or RNA–RNA interactions (CLASH-seq). These applications are, however, beyond the scope of this documentation as we focus on typical RNA-seq project (i.e. quantifying expression and gene fusions). Our focus is to outline current standards and resources for the bioinformatics analysis of RNA-seq data. We do not aim to provide an exhaustive compilation of resources or software tools. Rather, we aim to provide a guideline and conceptual overview for RNA-seq data analysis based on our best-practices RNA-seq pipeline.
Here we review all of the typical major steps in RNA-seq data analysis, starting from experimental design, quality control, read alignment, quantification of gene and transcript levels, and visualization.
Just like any other scientific experiment, a good RNA-seq experiment is hypothesis-driven. If you cannot describe the problem you are trying to address, throwing NGS at the problem is not a cure-all solution. Fishing for results is a waste of your time and is bad science. As so, designing a well-thought-out experiment around a testable question will maximize the likelihood of generating high-impact results.
The data that is generated will determine whether you have the potential to answer your biological question of interest. As a prerequisite, you need to think about how you will construct your libraries; the correct sequencing depth to address your question of interest; the number of replicates, and strategies to reduce/mitigate batch effects.
rRNA can comprise up to 80% of the RNA in a cell. An important consideration is the RNA extraction protocol that will be used to remove the highly abundant ribosomal RNA (rRNA). For eukaryotic cells, there are two major considerations: choosing whether to enrich for mRNA or whether to deplete rRNA.
Poly-(A) selection is a common method used to enrich for mRNA. This method generates the highest percentage of reads which will ultimately map to protein-coding genes-- making it a common choice for most applications. That being said, poly(A)-selection requires your RNA to be of high quality with minimal degradation. Degraded samples that are followed with ploy(A)-selection may result in a 3’ bias, which in effect, may introduce downstream biases into your results.
The second method captures total RNA through the depletion of rRNA. This method allows you to examine both mRNA and other non-coding RNA species such as lncRNAs. Again, depending on the question you are trying to answer this may be the right method for you. Although, it should be noted that both methods, mRNA and total RNA, require RINs (>8). But if you samples do contain slightly degraded RNA, you might be able to use the total RNA method over poly(A)-selection.
Sequencing depth or library size is another important design factor. As sequencing depth is increased, more transcripts will be detected (up until a saturation point), and their relative abundance will be quantified more accurately.
At the end of the day, the targeted sequencing depth depends on the aims of the experiment. Are you trying to quantify differences in gene expression, are you trying to quantify differential isoform usage or alternative splicing events? The numbers quoted below are more or less tailored to quantify differences in gene expression. If you are trying to quantify changes in alternative splicing or isoform regulation, you are going to much higher coverage (~ 100M paired-end reads).
For mRNA libraries or libraries generated from a prep kit using poly-(A) selection, we recommend a minimum sequencing depth of 10-20M paired-end reads (or 20-40M reads). RNA must be of high quality or a 3' bias may be observed.
For total RNA libraries, we recommend a sequencing depth of 25-60M paired-end reads (or 50-120M reads). RNA must be of high quality.
Note: In the sections above and below, when I say to paired-end reads I am referring to read pairs generated from paired-end sequencing of a given cDNA fragment. You will sometimes see reads reported as pairs of reads or total reads.
We recommend 4 biological replicates per experimental condition or group. Having more replicates is good for several reasons because in the real world problems arise. If you have a bad sample that cannot be used due to severe QC issues, you are still left with 3 biological replicates. This allows you to drop a bad sample without comprising statistical power downstream.
Batch effects represent unwanted sources of technical variation. Batch effects introduce non-biological variation into your data, which if not accounted for can influence the results. Through the process of library preparation to sequencing, there are a number of steps (such as RNA extraction to adapter ligation to lane loading, etc.) that might introduce biases into the resulting data.
As a general rule of thumb, the best way to reduce the introduction of batch effects is through uniform processing-- meaning you need to ensure that differences in sample handling are minimal. This means that samples should be processed by the same lab technician and everything should be done in a uniform manner. That being said, do not extract your RNA at different times, do not use different lots of reagents! If a large number of samples are being processed and everything cannot be done at the same time, process representative samples from each biological group at the same time. This will ensure that batches and your variable of interest do not become confounded. Also, keep note of which samples belong to each batch. This information will be needed for batch correction.
To reduce the possibility of introducing batch effects from sequencing, all samples should be multiplexed together on the same lane(s).
Sample
Group
Batch
Batch*
Treatment_rep_1
KO
1
1
Treatment_rep_2
KO
2
1
Treatment_rep_3
KO
1
1
Treatment_rep_4
KO
2
1
Control_rep_1
WT
1
2
Control_rep_2
WT
2
2
Control_rep_3
WT
1
2
Control_rep_4
WT
2
2
Batch = properly balanced batches, easily corrected Batch* = groups and batch totally confounded, cannot be corrected
That being said, some problems cannot be bioinformatically corrected. If your variable of interest is totally confounded with your batches, applying batch correction to fix the problem is not going to work, and will lead to undesired results (i.e. Batch* column). If batches must be introduced due to other constraining factors, please keep note which samples belong to each batch, and please put some thought into how to properly balance samples across your batches.
Quality-control (QC) is extremely important! As the old adage goes: Garbage in, Garbage out! If there is one thing that to take away from this document, let it be that. Performing QC checks will help ensure that your results are reliable and reproducible.
It is worth noting that there is a large variety of open-source tools that can be used to assess the quality of your data so there is no reason to re-invent the wheel. Please keep this in mind but also be aware that there are many wheels per se, and you will need to know which to use and when. In this next section, we will cover different quality-control checks that can be applied at different stages of your RNA-seq analysis. These recommendations are based on a few tools our best-practices RNA-seq pipeline employs.
Before drawing biological conclusions, it is important to perform quality control checks to ensure that there are no signs of sequencing error, biases in your data, or other sources of contamination. Modern high-throughput sequencers generate millions of reads per run, and in the real world, problems can arise.
The general idea is to assess the quality of your reads before and after adapter removal and to check for different sources of contamination before proceeding to alignment. Here are a few of the tools that we use and recommend.
To assess the sequencing quality of your data, we recommend running FastQC before and after adapter trimming. FastQC generates a set of basic statistics to identify problems that can arise during sequencing or library preparation. FastQC will summarize per base and per read QC metrics such as quality scores and GC content (ideally, this plot should have a normal distribution with no forms of bimodality). It will also summarize the distribution of sequence lengths and will report the presence of adapter sequences, which is one reason we run it after removing adapters.
During the process of sample collection to library preparation, there is a risk for introducing wanted sources of DNA. FastQ Screen compares your sequencing data to a set of different reference genomes to determine if there is contamination. It allows a user to see if the composition of your library matches what you expect. If your data has high levels of human, mouse, fungi, or bacterial contamination, FastQ Screen will tell you. FastQ Screen will tell you what percentage of your library aligns against different reference genomes.
If there are high levels of microbial contamination, Kraken will provide an estimation of the taxonomic composition. Kraken can be used in conjunction with Krona to produce interactive reports.
Note: Due to high levels of homology between organisms, there may be a small portion of your reads that align to an unexpected reference genome. Again, this should be a minimal percentage of your reads.
Again, there are many tools available to assess the quality of your data post-alignment, and as stated before, there is no need to re-invent the wheel. Please see the table below for a generalized set of guidelines for different pre/post QC metrics.
Preseq can be used to estimate the complexity of a library for each of your samples. If the duplication rate is very high, the overall library complexity will be low. Low library complexity could signal an issue with library preparation or sample preparation (FFPE samples) where very little input RNA was over-amplified or the sample may be degraded.
Picard has a particularly useful sub-command called CollectRNAseqMetrics which reports the number and percentage of reads that align to various regions: such as coding, intronic, UTR, intergenic and ribosomal regions. This is particularly useful as you would expect a library constructed with ploy(A)-selection to have a high percentage of reads that map to coding regions. Picard CollectRNAseqMetrics will also report the uniformity of coverage across all genes, which is useful for determining whether a sample has a 3' bias (observed in libraries containing degraded RNA).
This is another particularity useful package that is tailored for RNA-seq data. The package is made up of over 20 sub-module that can be used to do things like calculate the average insert size between paired-end reads (which is useful for GEO upload), annotate the percentage of reads spanning known or novel splice junctions, convert a BAM file into a normalized BigWig file, and infer RNA quality.
Here is a set of generalized guidelines for different QC metrics. Some of these metrics will vary genome-to-genome depending on the quality of the assembly and annotation but that has been taken into consideration for our set of supported reference genomes.
Starting from raw data (FastQ files), how do we get a raw counts matrix, or how do we get a list of differential expressed genes? Before feeding your data into an R package for differential expression analysis, it needs to be processed to add biological context to it. In this section, we will talk about the data processing pipeline in more detail-- more specifically focusing on primary and secondary analysis.
One of the first steps in this process is to remove any unwanted adapters sequences from your reads in before alignment. Adapters are composed of synthetic sequences and should be removed prior to alignment. Adapter removal is especially important in certain protocols, such as miRNA-seq. When smaller fragments are sequenced it is almost certain there will be some form of adapter contamination.
In the alignment step, we add biological context to the raw data. In this step, we align reads to the reference genome to find where the sequenced fragments originate.
Accurate alignment of the cDNA fragments (which are derived from RNA) is difficult. Alternative splicing introduces the problem of aligning to non-contiguous regions, and using traditional genomic alignment algorithms can produce inaccurate or low-quality alignments due to the combination of alternative splicing and genomic variation (substitutions, insertions, and deletions). This has lead to the development of splice-aware aligners like STAR, which are designed to overcome these issues. STAR can also be run in a two-pass mode for enhanced detection of reads mapping to novel splice junctions.
In the quantification step, the number of reads that mapped to a particular genomic feature (such as a gene or isoform) is counted. It is important to keep in mind that raw counts are biased by a number of factors such as library size, feature-length, and other compositional biases. As so, it is important to normalize your data to remove these biases before summarizing differences between groups of samples.
\ No newline at end of file
+ Theory - RENEE Documentation
RNA-sequencing (RNA-seq) has a wide variety of applications; this transcriptome profiling method can be used to quantify gene and isoform expression, find changes in alternative splicing, detect gene-fusion events, call variants and much more.
It is also worth noting that RNA-seq can be coupled with other biochemical assays to analyze many other aspects of RNA biology, such as RNA–protein binding (CLIP-seq, RIP-seq), RNA structure (SHAPE-seq), or RNA–RNA interactions (CLASH-seq). These applications are, however, beyond the scope of this documentation as we focus on typical RNA-seq project (i.e. quantifying expression and gene fusions). Our focus is to outline current standards and resources for the bioinformatics analysis of RNA-seq data. We do not aim to provide an exhaustive compilation of resources or software tools. Rather, we aim to provide a guideline and conceptual overview for RNA-seq data analysis based on our best-practices RNA-seq pipeline.
Here we review all of the typical major steps in RNA-seq data analysis, starting from experimental design, quality control, read alignment, quantification of gene and transcript levels, and visualization.
Just like any other scientific experiment, a good RNA-seq experiment is hypothesis-driven. If you cannot describe the problem you are trying to address, throwing NGS at the problem is not a cure-all solution. Fishing for results is a waste of your time and is bad science. As so, designing a well-thought-out experiment around a testable question will maximize the likelihood of generating high-impact results.
The data that is generated will determine whether you have the potential to answer your biological question of interest. As a prerequisite, you need to think about how you will construct your libraries; the correct sequencing depth to address your question of interest; the number of replicates, and strategies to reduce/mitigate batch effects.
rRNA can comprise up to 80% of the RNA in a cell. An important consideration is the RNA extraction protocol that will be used to remove the highly abundant ribosomal RNA (rRNA). For eukaryotic cells, there are two major considerations: choosing whether to enrich for mRNA or whether to deplete rRNA.
Poly-(A) selection is a common method used to enrich for mRNA. This method generates the highest percentage of reads which will ultimately map to protein-coding genes-- making it a common choice for most applications. That being said, poly(A)-selection requires your RNA to be of high quality with minimal degradation. Degraded samples that are followed with ploy(A)-selection may result in a 3’ bias, which in effect, may introduce downstream biases into your results.
The second method captures total RNA through the depletion of rRNA. This method allows you to examine both mRNA and other non-coding RNA species such as lncRNAs. Again, depending on the question you are trying to answer this may be the right method for you. Although, it should be noted that both methods, mRNA and total RNA, require RINs (>8). But if you samples do contain slightly degraded RNA, you might be able to use the total RNA method over poly(A)-selection.
Sequencing depth or library size is another important design factor. As sequencing depth is increased, more transcripts will be detected (up until a saturation point), and their relative abundance will be quantified more accurately.
At the end of the day, the targeted sequencing depth depends on the aims of the experiment. Are you trying to quantify differences in gene expression, are you trying to quantify differential isoform usage or alternative splicing events? The numbers quoted below are more or less tailored to quantify differences in gene expression. If you are trying to quantify changes in alternative splicing or isoform regulation, you are going to much higher coverage (~ 100M paired-end reads).
For mRNA libraries or libraries generated from a prep kit using poly-(A) selection, we recommend a minimum sequencing depth of 10-20M paired-end reads (or 20-40M reads). RNA must be of high quality or a 3' bias may be observed.
For total RNA libraries, we recommend a sequencing depth of 25-60M paired-end reads (or 50-120M reads). RNA must be of high quality.
Note: In the sections above and below, when I say to paired-end reads I am referring to read pairs generated from paired-end sequencing of a given cDNA fragment. You will sometimes see reads reported as pairs of reads or total reads.
We recommend 4 biological replicates per experimental condition or group. Having more replicates is good for several reasons because in the real world problems arise. If you have a bad sample that cannot be used due to severe QC issues, you are still left with 3 biological replicates. This allows you to drop a bad sample without comprising statistical power downstream.
Batch effects represent unwanted sources of technical variation. Batch effects introduce non-biological variation into your data, which if not accounted for can influence the results. Through the process of library preparation to sequencing, there are a number of steps (such as RNA extraction to adapter ligation to lane loading, etc.) that might introduce biases into the resulting data.
As a general rule of thumb, the best way to reduce the introduction of batch effects is through uniform processing-- meaning you need to ensure that differences in sample handling are minimal. This means that samples should be processed by the same lab technician and everything should be done in a uniform manner. That being said, do not extract your RNA at different times, do not use different lots of reagents! If a large number of samples are being processed and everything cannot be done at the same time, process representative samples from each biological group at the same time. This will ensure that batches and your variable of interest do not become confounded. Also, keep note of which samples belong to each batch. This information will be needed for batch correction.
To reduce the possibility of introducing batch effects from sequencing, all samples should be multiplexed together on the same lane(s).
Sample
Group
Batch
Batch*
Treatment_rep_1
KO
1
1
Treatment_rep_2
KO
2
1
Treatment_rep_3
KO
1
1
Treatment_rep_4
KO
2
1
Control_rep_1
WT
1
2
Control_rep_2
WT
2
2
Control_rep_3
WT
1
2
Control_rep_4
WT
2
2
Batch = properly balanced batches, easily corrected Batch* = groups and batch totally confounded, cannot be corrected
That being said, some problems cannot be bioinformatically corrected. If your variable of interest is totally confounded with your batches, applying batch correction to fix the problem is not going to work, and will lead to undesired results (i.e. Batch* column). If batches must be introduced due to other constraining factors, please keep note which samples belong to each batch, and please put some thought into how to properly balance samples across your batches.
Quality-control (QC) is extremely important! As the old adage goes: Garbage in, Garbage out! If there is one thing that to take away from this document, let it be that. Performing QC checks will help ensure that your results are reliable and reproducible.
It is worth noting that there is a large variety of open-source tools that can be used to assess the quality of your data so there is no reason to re-invent the wheel. Please keep this in mind but also be aware that there are many wheels per se, and you will need to know which to use and when. In this next section, we will cover different quality-control checks that can be applied at different stages of your RNA-seq analysis. These recommendations are based on a few tools our best-practices RNA-seq pipeline employs.
Before drawing biological conclusions, it is important to perform quality control checks to ensure that there are no signs of sequencing error, biases in your data, or other sources of contamination. Modern high-throughput sequencers generate millions of reads per run, and in the real world, problems can arise.
The general idea is to assess the quality of your reads before and after adapter removal and to check for different sources of contamination before proceeding to alignment. Here are a few of the tools that we use and recommend.
To assess the sequencing quality of your data, we recommend running FastQC before and after adapter trimming. FastQC generates a set of basic statistics to identify problems that can arise during sequencing or library preparation. FastQC will summarize per base and per read QC metrics such as quality scores and GC content (ideally, this plot should have a normal distribution with no forms of bimodality). It will also summarize the distribution of sequence lengths and will report the presence of adapter sequences, which is one reason we run it after removing adapters.
During the process of sample collection to library preparation, there is a risk for introducing wanted sources of DNA. FastQ Screen compares your sequencing data to a set of different reference genomes to determine if there is contamination. It allows a user to see if the composition of your library matches what you expect. If your data has high levels of human, mouse, fungi, or bacterial contamination, FastQ Screen will tell you. FastQ Screen will tell you what percentage of your library aligns against different reference genomes.
If there are high levels of microbial contamination, Kraken will provide an estimation of the taxonomic composition. Kraken can be used in conjunction with Krona to produce interactive reports.
Note: Due to high levels of homology between organisms, there may be a small portion of your reads that align to an unexpected reference genome. Again, this should be a minimal percentage of your reads.
Again, there are many tools available to assess the quality of your data post-alignment, and as stated before, there is no need to re-invent the wheel. Please see the table below for a generalized set of guidelines for different pre/post QC metrics.
Preseq can be used to estimate the complexity of a library for each of your samples. If the duplication rate is very high, the overall library complexity will be low. Low library complexity could signal an issue with library preparation or sample preparation (FFPE samples) where very little input RNA was over-amplified or the sample may be degraded.
Picard has a particularly useful sub-command called CollectRNAseqMetrics which reports the number and percentage of reads that align to various regions: such as coding, intronic, UTR, intergenic and ribosomal regions. This is particularly useful as you would expect a library constructed with ploy(A)-selection to have a high percentage of reads that map to coding regions. Picard CollectRNAseqMetrics will also report the uniformity of coverage across all genes, which is useful for determining whether a sample has a 3' bias (observed in libraries containing degraded RNA).
This is another particularity useful package that is tailored for RNA-seq data. The package is made up of over 20 sub-module that can be used to do things like calculate the average insert size between paired-end reads (which is useful for GEO upload), annotate the percentage of reads spanning known or novel splice junctions, convert a BAM file into a normalized BigWig file, and infer RNA quality.
Here is a set of generalized guidelines for different QC metrics. Some of these metrics will vary genome-to-genome depending on the quality of the assembly and annotation but that has been taken into consideration for our set of supported reference genomes.
Starting from raw data (FastQ files), how do we get a raw counts matrix, or how do we get a list of differential expressed genes? Before feeding your data into an R package for differential expression analysis, it needs to be processed to add biological context to it. In this section, we will talk about the data processing pipeline in more detail-- more specifically focusing on primary and secondary analysis.
One of the first steps in this process is to remove any unwanted adapters sequences from your reads in before alignment. Adapters are composed of synthetic sequences and should be removed prior to alignment. Adapter removal is especially important in certain protocols, such as miRNA-seq. When smaller fragments are sequenced it is almost certain there will be some form of adapter contamination.
In the alignment step, we add biological context to the raw data. In this step, we align reads to the reference genome to find where the sequenced fragments originate.
Accurate alignment of the cDNA fragments (which are derived from RNA) is difficult. Alternative splicing introduces the problem of aligning to non-contiguous regions, and using traditional genomic alignment algorithms can produce inaccurate or low-quality alignments due to the combination of alternative splicing and genomic variation (substitutions, insertions, and deletions). This has lead to the development of splice-aware aligners like STAR, which are designed to overcome these issues. STAR can also be run in a two-pass mode for enhanced detection of reads mapping to novel splice junctions.
In the quantification step, the number of reads that mapped to a particular genomic feature (such as a gene or isoform) is counted. It is important to keep in mind that raw counts are biased by a number of factors such as library size, feature-length, and other compositional biases. As so, it is important to normalize your data to remove these biases before summarizing differences between groups of samples.
\ No newline at end of file
diff --git a/2.6/RNA-seq/build/index.html b/2.6/RNA-seq/build/index.html
index 40d7e69..fe73f29 100644
--- a/2.6/RNA-seq/build/index.html
+++ b/2.6/RNA-seq/build/index.html
@@ -1,4 +1,4 @@
- build - RENEE Documentation
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee build sub command in more detail. With minimal configuration, the build sub command enables you to build new reference files for the renee run pipeline.
Setting up the RENEE build pipeline is fast and easy! In its most basic form, renee build only has five required inputs.
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee build sub command in more detail. With minimal configuration, the build sub command enables you to build new reference files for the renee run pipeline.
Setting up the RENEE build pipeline is fast and easy! In its most basic form, renee build only has five required inputs.
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee cache sub command in more detail. With minimal configuration, the cache sub command enables you to cache remote resources for the RENEE pipeline. Caching remote resources allows the pipeline to run in an offline mode.
renee cache when run successfully submits a SLURM job to the job schedule and quits. squeue can then be used to track the progress of the caching.
The cache sub command creates local cache on the filesysytem for resources hosted on DockerHub or AWS S3. These resources are normally pulled onto the filesystem when the pipeline runs; however, due to network issues or DockerHub pull rate limits, it may make sense to pull the resources once so a shared cache can be created and re-used. It is worth noting that a singularity cache cannot normally be shared across users. Singularity strictly enforces that its cache is owned by the user. To get around this issue, the cache subcommand can be used to create local SIFs on the filesystem from images on DockerHub.
Caching remote resources for the RENEE pipeline is fast and easy! In its most basic form, renee cache only has one required input.
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee cache sub command in more detail. With minimal configuration, the cache sub command enables you to cache remote resources for the RENEE pipeline. Caching remote resources allows the pipeline to run in an offline mode.
renee cache when run successfully submits a SLURM job to the job schedule and quits. squeue can then be used to track the progress of the caching.
The cache sub command creates local cache on the filesysytem for resources hosted on DockerHub or AWS S3. These resources are normally pulled onto the filesystem when the pipeline runs; however, due to network issues or DockerHub pull rate limits, it may make sense to pull the resources once so a shared cache can be created and re-used. It is worth noting that a singularity cache cannot normally be shared across users. Singularity strictly enforces that its cache is owned by the user. To get around this issue, the cache subcommand can be used to create local SIFs on the filesystem from images on DockerHub.
Caching remote resources for the RENEE pipeline is fast and easy! In its most basic form, renee cache only has one required input.
The synopsis for each command shows its parameters and their usage. Optional parameters are shown in square brackets.
A user must provide a directory to cache remote Docker images via the --sif-cache argument. Once the cache has pipeline completed, the local sif cache can be passed to the --sif-cache option of the renee build and renee run subcomand. This enables the build and run pipeline to run in an offline mode.
Use you can always use the -h option for information on a specific command.
Path where a local cache of SIFs will be stored. type: path
Any images defined in config/containers/images.json will be pulled into the local filesystem. The path provided to this option can be passed to the --sif-cache option of the renee build and renee run subcomand. This allows for running the build and run pipelines in an offline mode where no requests are made to external sources. This is useful for avoiding network issues or DockerHub pull rate limits. Please see renee build and run for more information.
# Step 0.) Grab an interactive node (do not run on head node)
srun-N1-n1--time=12:00:00-pinteractive--mem=8gb--cpus-per-task=4--ptybash
@@ -11,4 +11,4 @@
# Step 2.) Cache remote resources locally
reneecache--sif-cache/data/$USER/cache
-
\ No newline at end of file
+
\ No newline at end of file
diff --git a/2.6/RNA-seq/gui/index.html b/2.6/RNA-seq/gui/index.html
index 6001f95..0e7460c 100644
--- a/2.6/RNA-seq/gui/index.html
+++ b/2.6/RNA-seq/gui/index.html
@@ -1,4 +1,4 @@
- Graphical Interface - RENEE Documentation
RENEE pipeline can be executed from either using the command line interface (CLI) or graphical user interface (GUI). GUI offers a more interactive way for the user to provide input and adjust parameter settings. This part of the documentation describes how to run renee using the GUI (with screenshots). See Command Line tab to read more about the renee executable and running RENEE pipeline using the CLI.
RENEE pipeline can be executed from either using the command line interface (CLI) or graphical user interface (GUI). GUI offers a more interactive way for the user to provide input and adjust parameter settings. This part of the documentation describes how to run renee using the GUI (with screenshots). See Command Line tab to read more about the renee executable and running RENEE pipeline using the CLI.
NOTE: Make sure to add --tunnel flag to the sinteractive command for correct display settings. See details here: https://hpc.nih.gov/docs/tunneling/
# Setup Step 2.) Please do not run RENEE on the head node!
@@ -37,4 +37,4 @@
###########################################################################
To enter the location of the input folder containing FASTQ files and the location where the results should be created, either simply type the absolute paths
or use the Browse tab to choose the input and output directories
Next, from the drop down menu select the reference genome (hg38/mm10)
After all the information is filled out, press Submit.
If the pipeline detects no errors and the run was submitted, a new window appears that has the output of a "dry-run" which summarizes each step of the pipeline.
Click OK
A dialogue box will popup to confirm submitting the job to slurm.
Click Yes
An email notification will be sent out when the pipeline starts and ends.
4. Special instructions regarding X11 Window System¶
RENEE GUI natively uses the X11 Window System to run RENEE pipeline and display the graphics on a personal desktop or laptop. The X11 Window System can be used to run a program on Biowulf and display the graphics on a desktop or laptop. However, X11 can be unreliable and fail with many graphics applications used on Biowulf. The HPC staff recommends NoMachine (NX) for users who need to run graphics applications.
Please see details here on how to install and connect to Biowulf on your local computer using NoMachine.
Once connected to Biowulf using NX, right click to open a terminal connection
and start an interactive session (with --tunnel flag).
Similar to the instructions above, load the ccbrpipeliner module and enter renee gui to launch the RENEE gui.
\ No newline at end of file
+
and it will launch the RENEE window.
Note: Please wait until window created! message appears on the terminal.
To enter the location of the input folder containing FASTQ files and the location where the results should be created, either simply type the absolute paths
or use the Browse tab to choose the input and output directories
Next, from the drop down menu select the reference genome (hg38/mm10)
After all the information is filled out, press Submit.
If the pipeline detects no errors and the run was submitted, a new window appears that has the output of a "dry-run" which summarizes each step of the pipeline.
Click OK
A dialogue box will popup to confirm submitting the job to slurm.
Click Yes
An email notification will be sent out when the pipeline starts and ends.
4. Special instructions regarding X11 Window System¶
RENEE GUI natively uses the X11 Window System to run RENEE pipeline and display the graphics on a personal desktop or laptop. The X11 Window System can be used to run a program on Biowulf and display the graphics on a desktop or laptop. However, X11 can be unreliable and fail with many graphics applications used on Biowulf. The HPC staff recommends NoMachine (NX) for users who need to run graphics applications.
Please see details here on how to install and connect to Biowulf on your local computer using NoMachine.
Once connected to Biowulf using NX, right click to open a terminal connection
and start an interactive session (with --tunnel flag).
Similar to the instructions above, load the ccbrpipeliner module and enter renee gui to launch the RENEE gui.
\ No newline at end of file
diff --git a/2.6/RNA-seq/output/index.html b/2.6/RNA-seq/output/index.html
index e187fed..40c2286 100644
--- a/2.6/RNA-seq/output/index.html
+++ b/2.6/RNA-seq/output/index.html
@@ -1,4 +1,4 @@
- Expected Output - RENEE Documentation
After a successful renee run execution for multisample paired-end data, the following files and folders are created in the output folder.
renee_output/
├──bams
├──config
├──config.json# Contains the configuration and parameters used for this specific RENEE run
@@ -112,4 +112,4 @@
.
├──sampleN.R1.trim.fastq.gz
└──sampleN.R2.trim.fastq.gz
-
\ No newline at end of file
diff --git a/2.6/RNA-seq/run/index.html b/2.6/RNA-seq/run/index.html
index 1ed722f..d4011ce 100644
--- a/2.6/RNA-seq/run/index.html
+++ b/2.6/RNA-seq/run/index.html
@@ -1,4 +1,4 @@
- run - RENEE Documentation
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee run sub command in more detail. With minimal configuration, the run sub command enables you to start running the data processing and quality-control pipeline.
Setting up the RENEE pipeline is fast and easy! In its most basic form, renee run only has three required inputs.
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee run sub command in more detail. With minimal configuration, the run sub command enables you to start running the data processing and quality-control pipeline.
Setting up the RENEE pipeline is fast and easy! In its most basic form, renee run only has three required inputs.
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee unlock sub command in more detail. With minimal configuration, the unlock sub command enables you to unlock a pipeline output directory.
If the pipeline fails ungracefully, it maybe required to unlock the working directory before proceeding again. Snakemake will inform a user when it maybe necessary to unlock a working directory with an error message stating: Error: Directory cannot be locked.
Please verify that the pipeline is not running before running this command. If the pipeline is currently running, the workflow manager will report the working directory is locked. The is the default behavior of snakemake, and it is normal. Do NOT run this command if the pipeline is still running! Please kill the master job and it's child jobs prior to running this command.
Unlocking an RENEE pipeline output directory is fast and easy! In its most basic form, renee run only has one required inputs.
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee unlock sub command in more detail. With minimal configuration, the unlock sub command enables you to unlock a pipeline output directory.
If the pipeline fails ungracefully, it maybe required to unlock the working directory before proceeding again. Snakemake will inform a user when it maybe necessary to unlock a working directory with an error message stating: Error: Directory cannot be locked.
Please verify that the pipeline is not running before running this command. If the pipeline is currently running, the workflow manager will report the working directory is locked. The is the default behavior of snakemake, and it is normal. Do NOT run this command if the pipeline is still running! Please kill the master job and it's child jobs prior to running this command.
Unlocking an RENEE pipeline output directory is fast and easy! In its most basic form, renee run only has one required inputs.
The synopsis for this command shows its parameters and their usage. Optional parameters are shown in square brackets.
A user must provide an output directory to unlock via --output argument. After running the unlock sub command, you can resume the build or run pipeline from where it left off by re-running it.
Use you can always use the -h option for information on a specific command.
Path to a previous run's output directory to unlock. This will remove a lock on the working directory. Please verify that the pipeline is not running before running this command. Example:--output /data/$USER/RNA_hg38
# Step 0.) Grab an interactive node (do not run on head node)
srun-N1-n1--time=12:00:00-pinteractive--mem=8gb--cpus-per-task=4--ptybash
modulepurge
@@ -6,4 +6,4 @@
# Step 1.) Unlock a pipeline output directory
reneeunlock--output/data/$USER/RNA_hg38
-