Merge pull request #14 from KoesGroup/develop

Develop
KoesGroup · Dec 18, 2018 · c7d8fdb · c7d8fdb
2 parents 4bb5d7d + f7a9e5a
commit c7d8fdb
Show file tree

Hide file tree

Showing 36 changed files with 1,200,768 additions and 20 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+*~*
+results/
+logs/
+*.DS_Store
+.DS_Store
diff --git a/README.md b/README.md
@@ -1,48 +1,56 @@
 # RNA_seq_Snakemake
-A snakemake pipeline for the analysis of RNA-seq data with the use of hisat2 and DESeq2
+A snakemake pipeline for the analysis of RNA-seq data that makes use of [hisat2](https://ccb.jhu.edu/software/hisat2/index.shtml) and [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html).
 
 [![Snakemake](https://img.shields.io/badge/snakemake-≥5.2.0-brightgreen.svg)](https://snakemake.bitbucket.io)
 [![Miniconda](https://img.shields.io/badge/miniconda-blue.svg)](https://conda.io/miniconda)
 
 # Aim
-Map, count, normalize and get differential expressions of paired-end Illumina RNA-seq data.
+To align, count, normalize couts and compute gene differential expressions between conditions using paired-end Illumina RNA-seq data.
 
 # Description
-Different branches are available with some different stratagies.
-The master contains a pipe line that solely makes use of an existing transcriptome that can be specified in the config file.
+This pipeline analyses the raw RNA-seq data and produce a file containing normalized counts, differential expression and functions of transcripts. The raw fastq files will be trimmed for adaptors and quality checked with trimmomatic. Next, the necessary genome fasta sequence and transcriptome references will be downloaded and the trimmed reads will be mapped against it  using hisat2. With stringtie and a reference annotation a new annotation will be created. This new annotation will be used to obtain the raw counts and do a local blast to a transcriptome fasta containing predicted functions. The counts are normalized and differential expressions are calculated using DESeq2. This data is combined with the predicted functions to get the final results table.
 
 
 # Content
-- Snakefile containing the targeted output and the rules to generate them from the input files.
-- config/ , folder containing the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.
-- data/, folder containing samples.txt (sample descriptions) and subsetted paired-end fastq files used to test locally the pipeline. Generated using [Seqtk](https://github.com/lh3/seqtk): 
+- `Snakefile`: a master file that contains the desired outputs and the rules to generate them from the input files.
+- `config.yaml`: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.
+- `data/`: a folder containing samples.txt (sample descriptions) and subsetted paired-end fastq files used to test locally the pipeline. Generated using [Seqtk](https://github.com/lh3/seqtk):
 `seqtk sample -s100 {inputfile(can be gzipped)} 250000 > {output(always gunzipped)}`
-- envs/, folder containing the environments needed for the Snakefile to run. Need to make one specifically for MACS2 as MACS2 uses python 2.7 following the information found [here](https://groups.google.com/forum/#!searchin/snakemake/macs%7Csort:relevance/snakemake/60txGSq81zE/NzCUTdJ_AQAJ).
+This folder should contain the `fastq` of the paired-end RNA-seq data, you want to run.
+- `envs/`: a folder containing the environments needed for the conda package manager. If run with the `--use-conda` command, Snakemake will install the necessary softwares and packages using the conda environment files.
+- `samples.tsv`:  a file containing information about the names, the paths and the conditions of the samples used as input for the pipeline. **This file has to be adapted to your sample names before running the pipeline**.
 
 
 # Usage
 
-## Conda environment
-First, you need to install all softwares and packages needed with the [Conda package manager](https://conda.io/docs/using/envs.html).  
-1. Create a virtual environment named "chipseq" from the `environment.yaml` file with the following command:
-`conda env create --name RNA-Seq --file ~/envs/DCM.yaml`
-2. Then, activate this virtual environment with: `source activate RNA-Seq`    
-Now, all the basic softwares and packages versions in use are the one listed in the `DCM.yaml` file.
-The other environments (hisat2, subRead etc) will automatically be created and activated when requested by a rule.
+## Download or clone the Github repository
+You will need a local copy of the `Snakemake_hisat-DESeq` on your machine. You can either:
+- use git in the shell: `git clone git@github.com:KoesGroup/Snakemake_hisat-DESeq.git`
+- click on "Clone or download" and select `download`
+
+## Installing and activating a virtual environment
+First, you need to create an environment where `Snakemake` and the python `pandas`package will be installed. To do that, we will use the conda package manager.   
+1. Create a virtual environment named `rnaseq` using the `global_env.yaml` file with the following command: `conda env create --name rnaseq --file envs/global_env.yaml`
+    Then, activate this virtual environment with source activate chipseq
+
+The Snakefile will then take care of installing and loading the packages and softwares required by each step of the pipeline.
 
 ## Configuration file
-The `~/configs/config.yaml` file specifies the sample data file, the genomic and transcriptomic reference fasta files to use, the parameters for the rules to use, etc. This file is used so the `Snakefile` does not need to be changed when locations or parameters need to be changed.
+Make sure you have changed the parameters in the `config.yaml` file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc.  
+This file is used so the `Snakefile` does not need to be changed when locations or parameters need to be changed.
 
 ## Snakemake execution
-The Snakemake pipeline/workflow management system reads a master file (often called `Snakefile`) to list the steps to be executed and defining their order.
-It has many rich features. Read more [here](https://snakemake.readthedocs.io/en/stable/).
+The Snakemake pipeline/workflow management system reads a master file (often called `Snakefile`) to list the steps to be executed and defining their order. It has many rich features. Read more [here](https://snakemake.readthedocs.io/en/stable/).
 
 ## Dry run
-Use the command `snakemake --use-conda -np` to perform a dry run that prints out the rules and commands.
+From the folder containing the `Snakefile`, use the command `snakemake --use-conda -np` to perform a dry run that prints out the rules and commands.
 
 ## Real run
 Simply type `Snakemake --use-conda` and provide the number of cores with `--cores 10` for ten cores for instance.  
 For cluster execution, please refer to the [Snakemake reference](https://snakemake.readthedocs.io/en/stable/executable.html#cluster-execution).
 
 # Main outputs
-the alignment files (*.bam), fastqc report files, raw countsfile (counts.txt), results file containing normalized differential expressions (results.tsv)
+- the RNA-Seq read alignment files __*.bam__
+- the fastqc report files __\*.html__
+- the unscaled RNA-Seq read counts: __counts.txt__
+- the differential expression file __results.tsv__