Snakemake pipeline for analysis of bisulfite-seq data
Snakemake pipeline made for reproducible analysis of paired-end Illumina bisulfite-seq data Mapping and methylation calling is done with the tool BSseeker2. Identification of regions of low or no methylation is based on MethylSeekR. The pipeline contains a few added features to make it more suitable for the analysis of plant samples.
-
Snakefile, containing the targeted output and the rules to generate them from the input files.
-
data/, folder containing a subset of a couple of paired-end fastq files used to test the pipeline locally(tomato leaf bisulfite genomic sequence reads; SRR503393).
-
genome/, folder containing a small fragment of chromosome 12 of the tomato genome, to be used for the local test.
-
envs/, folder containing the environments needed for the Snakefile to run. To use Snakemake, it is required to create and activate an environment containing snakemake (envs/global_env.yaml )
-
samples.tsv, is a tab separated value file containing information about the used samplesnames (name of used species, tissue, ...) and the path to the fastq files relative to the Snakefile. Change this file according to your samples.
First, you need to create an environment for the use of Snakemake with Conda package manager.
- Create a virtual environment named "BSanalysis" from the
global_env.yaml
file with the following command:conda env create --name BSanalysis --file envs/global_env.yaml
- Then, activate this virtual environment with
conda activate BSanalysis
The Snakefile will then take care of installing and loading the packages and software required by each step of the pipeline.
The configs.yaml
file specifies the sample list (sample.tsv), the genomic reference fasta file to use, the directories to use, etc. This file is then used to build parameters in the main Snakefile
.
The Snakemake pipeline/workflow management system reads a master file (often called Snakefile
) to list the steps to be executed and defines their order.
It has many rich features. More info on snakemake.
Samples are listed in the samples.tsv
file and will be used by the Snakefile automatically. Change the name accordingly.
Use the command snakemake -np
to perform a dry run that prints out the rules and commands.
Simply type Snakemake --use-conda
and provide the number of cores with --cores 10
for ten cores for instance.
For cluster execution, please refer to the Snakemake reference.
Please pay attention to --use-conda
, it is required for the installation and loading of the dependencies used by the rules of the pipeline.
- bed files containing the unmethylated (UMR) and low methylated (LMR) regions, separated in CG, CCG, CWG, CHG and CHH.
- bed file containing "active" regions, ea regions in wich C's in both CG and CHG context are unmethylated.
- log files containing reports of the fastP, BSseeker2 and methylcalling steps.
The settings as given, is optimized to plant samples. Can be altered in the config.yaml.