- Run human_genomics_pipeline on a single machine like a laptop or single server/computer
- Table of contents
- 1. Fork the pipeline repo to a personal or lab account
- 2. Take the pipeline to the data on your local machine
- 3. Setup files and directories
- 4. Get prerequisite software/hardware
- 5. Create a local copy of the GATK resource bundle (either b37 or hg38)
- 6. Modify the configuration file
- 7. Modify the run scripts
- 8. Create and activate a conda environment with python and snakemake installed
- 9. Run the pipeline
- 10. Evaluate the pipeline run
- 11. Commit and push to your forked version of the github repo
- 12. Repeat step 11 each time you re-run the analysis with different parameters
- 13. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)
See here for help forking a repository
Clone the forked human_genomics_pipeline repo into the same directory as your paired end fastq data to be processed.
See here for help cloning a repository
Required folder structure and file naming convention:
.
|___fastq/
| |___sample1_1.fastq.gz
| |___sample1_2.fastq.gz
| |___sample2_1.fastq.gz
| |___sample2_2.fastq.gz
| |___sample3_1.fastq.gz
| |___sample3_2.fastq.gz
| |___sample4_1.fastq.gz
| |___sample4_2.fastq.gz
| |___sample5_1.fastq.gz
| |___sample5_2.fastq.gz
| |___sample6_1.fastq.gz
| |___sample6_2.fastq.gz
| |___ ...
|
|___human_genomics_pipeline/
If you're analysing cohort's of samples, you will need an additional directory with a pedigree file for each cohort/family using the following folder structure and file naming convention:
.
|___fastq/
| |___sample1_1.fastq.gz
| |___sample1_2.fastq.gz
| |___sample2_1.fastq.gz
| |___sample2_2.fastq.gz
| |___sample3_1.fastq.gz
| |___sample3_2.fastq.gz
| |___sample4_1.fastq.gz
| |___sample4_2.fastq.gz
| |___sample5_1.fastq.gz
| |___sample5_2.fastq.gz
| |___sample6_1.fastq.gz
| |___sample6_2.fastq.gz
| |___ ...
|
|___pedigrees/
| |___proband1_pedigree.ped
| |___proband2_pedigree.ped
| |___ ...
|
|___human_genomics_pipeline/
Requirements:
- Input paired end fastq files need to identified with
_1
and_2
(not_R1
and_R2
) - Currently, the filenames of the pedigree files need to be labelled with the name of the proband/individual affected with the disease phenotype in the cohort (we will be working towards removing this requirement)
- Singletons and cohorts need to be run in separate pipeline runs
- It is assumed that there is one proband/individual affected with the disease phenotype of interest in a given cohort (one individual with a value of 2 in the 6th column of a given pedigree file)
The provided test dataset can be used. Setup the test dataset before running the pipeline on this data - choose to setup to run either a single sample analysis or a cohort analysis with the -a
flag. For example:
cd ./human_genomics_pipeline
bash ./test/setup_test.sh -a cohort
For GPU accelerated runs, you'll need NVIDIA GPUs and NVIDIA CLARA PARABRICKS and dependencies. Talk to your system administrator to see if the HPC has this hardware and software available.
Other software required to get setup and run the pipeline:
- Git (tested with version 2.7.4)
- Conda (tested with version 4.8.2)
- Mamba (tested with version 0.4.4) (note. mamba can be installed via conda with a single command)
- gsutil (tested with version 4.52)
- gunzip (tested with version 1.6)
Most of this software is commonly pre-installed on HPC's, likely available as modules that can be loaded. Talk to your system administrator if you need help with this.
5. Create a local copy of the GATK resource bundle (either b37 or hg38)
Download from Google Cloud Bucket
gsutil cp -r gs://gatk-legacy-bundles/b37 /where/to/download/
Download from Google Cloud Bucket
gsutil cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/
Edit 'config.yaml' found within the config directory
Specify whether the data is to be analysed on it's own ('Single') or as a part of a cohort of samples ('Cohort'). For example:
DATA: "Single"
Specify whether the pipeline should be GPU accelerated where possible (either 'Yes' or 'No', this requires NVIDIA GPUs and NVIDIA CLARA PARABRICKS)
GPU_ACCELERATED: "No"
Set the the working directories to the reference human genome file (b37 or hg38). For example:
REFGENOME: "/scratch/publicData/b37/human_g1k_v37_decoy.fasta"
Set the the working directory to your dbSNP database file (b37 or hg38). For example:
dbSNP: "/scratch/publicData/b37/dbsnp_138.b37.vcf"
Set the the working directory to a temporary file directory. Make sure this is a location with a fair amount of memory space for large intermediate analysis files. For example:
TEMPDIR: "/scratch/tmp/"
If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced (see here for more information on accessing design files). Also set the level of padding by passing the amount of padding in base pairs. For example:
If NOT analysing WES data, leave these fields blank
WES:
# File path to the exome capture regions over which to operate
INTERVALS: "/scratch/publicData/sure_select_human_all_exon_V7/S31285117_Padded.bed"
# Padding (in bp) to add to each region
PADDING: "100"
These settings allow you to configure the resources per rule/sample
Set the number of threads to use per sample/rule for multithreaded rules (rule trim_galore_pe
and rule bwa_mem
). Multithreading will significantly speed up these rules, however the improvements in speed will diminish beyond 8 threads. If desired, a different number of threads can be set for these multithreaded rules by utilising the --set-threads
flag in the runscript (see step 7).
THREADS: 8
Set the maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes)
MAXMEMORY: "40g"
Set the maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg 1
for 1 GPU)
GPU: 1
It is a good idea to consider the number of samples that you are processing. For example, if you set THREADS: "8"
and set the maximum number of cores to be used by the pipeline in the run script to -j/--cores 32
(see step 7), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set THREADS: "1"
and -j/--cores 32
, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting MAXMEMORY
+ --resources mem_mb
and GPU
+ --resources gpu
.
Specify whether the raw fastq reads should be trimmed (either 'Yes' or 'No'). For example:
TRIM: "Yes"
If trimming the raw fastq reads, set the trim galore adapter trimming parameters. Choose one of the common adapters such as Illumina universal, Nextera transposase or Illumina small RNA with --illumina
, --nextera
or --small_rna
. Alternatively, pass adapter sequences to the -a
and -a2
flags. If not set, trim galore will try to auto-detect the adapter based on the fastq reads.
TRIMMING:
ADAPTERS: "--illumina"
Pass the resources to be used to recalibrate bases with gatk BaseRecalibrator (this passes the resource files to the --known-sites
flag), these known polymorphic sites will be used to exclude regions around known polymorphisms from base recalibration. Note. you can include as many or as few resources as you like, but you'll need at least one recalibration resource file. For example:
RECALIBRATION:
RESOURCES:
- /scratch/publicData/b37/dbsnp_138.b37.vcf
- /scratch/publicData/b37/Mills_and_1000G_gold_standard.indels.b37.vcf
- /scratch/publicData/b37/1000G_phase1.indels.b37.vcf
Set the number maximum number of cores to be used with the --cores
flag and the maximum amount of memory to be used (in megabytes) with the resources mem_mb=
flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the --resources gpu=
flag. For example:
Dry run (dryrun.sh):
snakemake \
--dryrun \
--cores 32 \
--resources mem_mb=150000 \
--resources gpu=2 \
--use-conda \
--conda-frontend mamba \
--latency-wait 120 \
--configfile ../config/config.yaml
Full run (run.sh):
snakemake \
--cores 32 \
--resources mem_mb=150000 \
--resources gpu=2 \
--use-conda \
--latency-wait 120 \
--configfile ../config/config.yaml
See the snakemake documentation for additional run parameters.
cd ./workflow/
mamba env create -f pipeline_run_env.yml
conda activate pipeline_run_env
First carry out a dry run
bash dryrun.sh
If there are no issues, start a full run
bash run.sh
Generate an interactive html report
bash report.sh
To maintain reproducibility, commit and push:
- All configuration files
- All run scripts
- The final report
13. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)
See the README for info on how to contribute back to the pipeline!