Skip to content
cfarkas edited this page Dec 4, 2022 · 23 revisions

Welcome to the variants2genes wiki!

What this pipeline do:

variants2genes is a pipeline that address (based on several well-known genomic tools) case associated variants (with correspondent genes) between two matched samples from RNA-seq/WES/WGS data. The pipeline generates genome-wide plots of coverage between the two samples as initial inspection, and detect alleles present in the case sample (but not substantially present in the control) by calling variants and intersecting the correspondent genes using intersecting the output from bcftools and strelka somatic variant calling: https://github.com/Illumina/strelka. This resource could be useful in the Normal/Tumor comparison analysis, haplotype analysis, characterization of substrains, among other scenarios. The pipeline requiere BASH/R enviroment and is available for several organism models such as Human, Mouse, Rat and Chicken.

Pipeline Outline:

1) Picard and GATK4 best practices preprocessing of BAM files : https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-
2) Call variants in Control and Case samples using bcftools mpileup.
3) Filter these variants using vcflib and bedtools (mainly to correct coverage artifacts)
4) Call germline and somatic variants in both samples using Strelka2 variant caller.
5) Intersect (using --invert flag) strelka germline variants with bcftools filtered variants. 

- The output from these steps will output case-linked variants and correspondent genes. 
- Case-linked somatic variants and case-linked INDELs were be also reported.

Dependence installation details

  • In linux-based systems, with sudo privileges, users can install all dependences as follows:

Obtaining and installing R (>=3.2.0)

See https://cran.r-project.org/sources.html for R installation in Linux/Ubuntu and Windows. R version 3.2.3 comes with default in Ubuntu >=16.04 LTS, so first check your R version before installing a new one. Users with older Ubuntu distributions must upgrade R. A way accomplish this can be the following:

# Removing R from system
sudo apt-get remove r-base-core

# Editing sources.list 
sudo su
echo "deb https://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/" >> /etc/apt/sources.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

# Installing R (version 3.6.0)
sudo apt update; sudo apt install r-base
exit
R

The following R packages are required for the script usages:

dplyr >=0.7.4
gridExtra >=2.3
reshape2 >=1.4.2
ggplot2 >=2.2.1

These packages can be installed in R (macOS/ubuntu) by opening R and typying:

install.packages("dplyr")
install.packages("gridExtra")
install.packages("reshape2")
install.packages("ggplot2")

To check installation of these packages, open R and type:

library(dplyr)
library(gridExtra)
library(reshape2)
library(ggplot2)

Obtaining and installing vcflib:

Clone vcflib folder in current directory:

git clone --recursive git://github.com/vcflib/vcflib.git

# If this line not work, try:
git config --global url.https://github.com/.insteadOf git://github.com/
git clone --recursive git://github.com/vcflib/vcflib.git

#Enter vcflib directory and make
cd vcflib
make   # Needs CMake compiler, with sudo privileges do: sudo apt-get install cmake

#After make, binaries and scripts can be copied in /usr/local/bin with sudo. In vcflib/ directory:

sudo cp scripts/* /usr/local/bin/
sudo cp bin/* /usr/local/bin/

# To check vcflib scripts, type vcf in terminal followed by TAB and display all posibilities

Obtaining and Installing BEDTools

Complete instructions can be found in https://bedtools.readthedocs.io/en/latest/content/installation.html. Users with privileges can accomplish with sudo:

sudo apt-get install bedtools

Obtaining and installing up-to-date SAMtools, bcftools and htslib (latest version=1.15, March 24, 2022)

Old samtools version will not work. Users needs to install version up to date of these three packages. Users can first install htslib v1.15 and then samtools with bcftools v1.15, respectively. For downloading these packages, see https://github.com/samtools/samtools/releases/). The latter can be accomplish by downloading the three packages, decompressing it, and doing the following:

# Preeliminars
sudo apt-get install -y libbz2-dev 

# htslib
wget https://github.com/samtools/htslib/releases/download/1.15/htslib-1.15.tar.bz2
bzip2 -d htslib-1.15.tar.bz2
tar -xvf htslib-1.15.tar
cd htslib-1.15/
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install
cd ..

# samtools
wget https://github.com/samtools/samtools/releases/download/1.15/samtools-1.15.tar.bz2
bzip2 -d samtools-1.15.tar.bz2
tar -xvf samtools-1.15.tar
cd samtools-1.15/
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install
cd ..

# bcftools
wget https://github.com/samtools/bcftools/releases/download/1.15/bcftools-1.15.tar.bz2
bzip2 -d bcftools-1.15.tar.bz2
tar -xvf bcftools-1.15.tar
cd bcftools-1.15
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install

Obtaining and installing Freebayes and vcflib:

Clone Freebayes folder in current directory:

git clone --recursive git://github.com/ekg/freebayes.git

# If this line not work, try:

git config --global url.https://github.com/.insteadOf git://github.com/
git clone --recursive git://github.com/ekg/freebayes.git

#Enter Freebayes directory and make:
cd freebayes
make
sudo make install
sudo cp scripts/* /usr/local/bin/

#To check installation, type in terminal:
freebayes
bamleftalign

## Installing vcflib (in freebayes folder):

cd vcflib
make   

#After make, binaries and scripts can be copied in /usr/local/bin with sudo. In vcflib/ directory:

sudo cp scripts/* /usr/local/bin/
sudo cp bin/* /usr/local/bin/

#To check vcflib scripts, type vcf in terminal followed by TAB and display all posibilities

Obtaining tabix (version >= 1.7, to decompress .vfc.gz files from bcftools output).

For detailed install, check here: http://wiki.wubrowse.org/How_to_install_tabix. User with root privileges can do:

sudo apt install tabix

Obtaining and installing Subread (for using featurecounts)

Complete instructions can be found in http://subread.sourceforge.net/. Users with privileges can accomplish with sudo:

sudo apt-get install subread

Obtaining SRA toolkit from ncbi (for downloading reads from GEO datasets).

wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6/sratoolkit.2.9.6-ubuntu64.tar.gz
gunzip sratoolkit.2.9.6-ubuntu64.tar.gz
tar -xvf sratoolkit.2.9.6-ubuntu64.tar
sudo cp sratoolkit.2.9.6-ubuntu64/bin/fastq-dump /usr/local/bin/

Obtaining HISAT2 aligner (for aligning RNA-seq data).

for detailed install instructions check here: http://ccb.jhu.edu/software/hisat2/manual.shtml . The pre-compiled version for linux can be obtained from this webpage as follows:

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.4-Linux_x86_64.zip
unzip hisat2-2.0.4-Linux_x86_64.zip
sudo cp hisat2-2.0.4/hisat2* /usr/local/bin/

Obtaining Rust, Cargo Rust Package and Install varlociraptor

# Install cmake : https://cmake.org/install/
cd --
wget https://github.com/Kitware/CMake/releases/download/v3.23.0/cmake-3.23.0.tar.gz
gunzip cmake-3.23.0.tar.gz
tar -xvf cmake-3.23.0.tar
cd cmake-3.23.0/
./bootstrap
make
make install
sudo make install

# Install Rust and Cargo
cd --
curl https://sh.rustup.rs -sSf | sh

# To configure your current shell, run:
source $HOME/.cargo/env

# Install varlociraptor
cargo install varlociraptor

Example: Collect KO-linked variants from RNA-seq data:

  • As an example, we will analyze haplotypes from an RNA-seq data taken from SALL2 wild type and knockout mice, presenting germline variants linked to Chromosome 14, see: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5504-9. With the pipeline, we will obtain these linked variants to knockout mice, not present in the wild-type counterpart. The correspondent illumina reads will be downloaded and aligned against mm10 genome (mus musculus version 10).
  • As outputs, the pipeline will take BAM file names until the .bam prefix is encountered
  • From scratch, inside variants2genes folder:

Obtaining BAM files to be used as inputs:

# Inside variants2genes folder
mkdir SALL2_WT_vs_KO && cd SALL2_WT_vs_KO

## (1) Download reference genome from UCSC and correspondent GTF file. Then, build HISAT2 index. 
../bin/genome-download mm10
hisat2-build mm10.fa mm10_hisat2

## (2) Download and align SALL2 Wild type and Knockout reads with HISAT2, using 25 threads.
prefetch -O ./ SRR8267474 && fastq-dump --gzip SRR8267474            # WT sample
hisat2 -x mm10_hisat2 -p 25 -U SRR8267474.fastq.gz | samtools view -bSh > WT.bam
prefetch -O ./ SRR8267458 && fastq-dump --gzip SRR8267458            # KO sample
hisat2 -x mm10_hisat2 -p 25 -U SRR8267458.fastq.gz | samtools view -bSh > KO.bam

## (3) sort and index bam samples using 25 threads
samtools sort -o WT.sorted.bam WT.bam -@ 25 && samtools index WT.sorted.bam -@ 25
samtools sort -o KO.sorted.bam KO.bam -@ 25 && samtools index KO.sorted.bam -@ 25

## (4) (optional): Use plot-variants to inspect genome-wide variants in every sample (check graph.pdf)
../bin/plot-variants -a WT.sorted.bam -b KO.sorted.bam -g mm10.fa -p ../R_scripts/bam_coverage_mouse.R

Running the pipeline:

## (5) Run variants2genes.sh pipeline to collect KO-linked variants and correspondent genes with variants (using 20 threads)
wget -O mm10_dbSNP.raw.vcf https://usegalaxy.org/datasets/bbd44e69cb8906b509fb7398dabbcd16/display?to_ext=vcf   # download mm10 known-snps sites
../bin/variants2genes -a WT.sorted.bam -b KO.sorted.bam -g mm10.fa -r mm10.gtf -s mm10_dbSNP.raw.vcf -t 20

Inside variants2genes_$DATE_OF_EXECUTION, check output_files sub-folder containing output files. From this example, two chr12 and 502 chr14 KO-linked germline variants were discovered.

Employing user-provided genome and/or GTF files:

Important: If users have their own genome and/or annotation file, their can use it in the pipeline, if desired. Their must edit STEP 1 and STEP 5. We will run the example using my_genome.fa and my_annotation.gtf instead of mm10.fa and mm10.gtf as follows:

Obtaining BAM files to be used as inputs:

## (1): Build HISAT2 index of my_genome.fa. 
hisat2-build my_genome.fa my_genome_hisat2

... Same Steps (2), (3) and (4) ...

Running the pipeline:

## (5) Run variants2genes.sh script to collect KO-linked variants and correspondent genes with variants (using 20 threads)
wget -O mm10_dbSNP.raw.vcf https://usegalaxy.org/datasets/bbd44e69cb8906b509fb7398dabbcd16/display?to_ext=vcf   # download mm10 known-snps sites
../bin/variants2genes -a WT.sorted.bam -b KO.sorted.bam -g my_genome.fa -r my_annotation.gtf -s mm10_dbSNP.raw.vcf -t 20