-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the variants2genes wiki!
variants2genes is a pipeline that address (based on several well-known genomic tools) case associated variants (with correspondent genes) between two matched samples from RNA-seq/WES/WGS data. The pipeline generates genome-wide plots of coverage between the two samples as initial inspection, and detect alleles present in the case sample (but not substantially present in the control) by calling variants and intersecting the correspondent genes using intersecting the output from bcftools and strelka somatic variant calling: https://github.com/Illumina/strelka. This resource could be useful in the Normal/Tumor comparison analysis, haplotype analysis, characterization of substrains, among other scenarios. The pipeline requiere BASH/R enviroment and is available for several organism models such as Human, Mouse, Rat and Chicken.
Pipeline Outline:
1) Picard and GATK4 best practices preprocessing of BAM files : https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-
2) Call variants in Control and Case samples using bcftools mpileup.
3) Filter these variants using vcflib and bedtools (mainly to correct coverage artifacts)
4) Call germline and somatic variants in both samples using Strelka2 variant caller.
5) Intersect (using --invert flag) strelka germline variants with bcftools filtered variants.
- The output from these steps will output case-linked variants and correspondent genes.
- Case-linked somatic variants and case-linked INDELs were be also reported.
- In linux-based systems, with sudo privileges, users can install all dependences as follows:
See https://cran.r-project.org/sources.html
for R installation in Linux/Ubuntu and Windows. R version 3.2.3 comes with default in Ubuntu >=16.04 LTS, so first check your R version before installing a new one. Users with older Ubuntu distributions must upgrade R. A way accomplish this can be the following:
# Removing R from system
sudo apt-get remove r-base-core
# Editing sources.list
sudo su
echo "deb https://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/" >> /etc/apt/sources.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
# Installing R (version 3.6.0)
sudo apt update; sudo apt install r-base
exit
R
The following R packages are required for the script usages:
dplyr >=0.7.4
gridExtra >=2.3
reshape2 >=1.4.2
ggplot2 >=2.2.1
These packages can be installed in R (macOS/ubuntu) by opening R and typying:
install.packages("dplyr")
install.packages("gridExtra")
install.packages("reshape2")
install.packages("ggplot2")
To check installation of these packages, open R and type:
library(dplyr)
library(gridExtra)
library(reshape2)
library(ggplot2)
Clone vcflib folder in current directory:
git clone --recursive git://github.com/vcflib/vcflib.git
# If this line not work, try:
git config --global url.https://github.com/.insteadOf git://github.com/
git clone --recursive git://github.com/vcflib/vcflib.git
#Enter vcflib directory and make
cd vcflib
make # Needs CMake compiler, with sudo privileges do: sudo apt-get install cmake
#After make, binaries and scripts can be copied in /usr/local/bin with sudo. In vcflib/ directory:
sudo cp scripts/* /usr/local/bin/
sudo cp bin/* /usr/local/bin/
# To check vcflib scripts, type vcf in terminal followed by TAB and display all posibilities
Complete instructions can be found in https://bedtools.readthedocs.io/en/latest/content/installation.html. Users with privileges can accomplish with sudo:
sudo apt-get install bedtools
Obtaining and installing up-to-date SAMtools, bcftools and htslib (latest version=1.15, March 24, 2022)
Old samtools version will not work. Users needs to install version up to date of these three packages. Users can first install htslib v1.15 and then samtools with bcftools v1.15, respectively. For downloading these packages, see https://github.com/samtools/samtools/releases/). The latter can be accomplish by downloading the three packages, decompressing it, and doing the following:
# Preeliminars
sudo apt-get install -y libbz2-dev
# htslib
wget https://github.com/samtools/htslib/releases/download/1.15/htslib-1.15.tar.bz2
bzip2 -d htslib-1.15.tar.bz2
tar -xvf htslib-1.15.tar
cd htslib-1.15/
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install
cd ..
# samtools
wget https://github.com/samtools/samtools/releases/download/1.15/samtools-1.15.tar.bz2
bzip2 -d samtools-1.15.tar.bz2
tar -xvf samtools-1.15.tar
cd samtools-1.15/
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install
cd ..
# bcftools
wget https://github.com/samtools/bcftools/releases/download/1.15/bcftools-1.15.tar.bz2
bzip2 -d bcftools-1.15.tar.bz2
tar -xvf bcftools-1.15.tar
cd bcftools-1.15
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install
Clone Freebayes folder in current directory:
git clone --recursive git://github.com/ekg/freebayes.git
# If this line not work, try:
git config --global url.https://github.com/.insteadOf git://github.com/
git clone --recursive git://github.com/ekg/freebayes.git
#Enter Freebayes directory and make:
cd freebayes
make
sudo make install
sudo cp scripts/* /usr/local/bin/
#To check installation, type in terminal:
freebayes
bamleftalign
## Installing vcflib (in freebayes folder):
cd vcflib
make
#After make, binaries and scripts can be copied in /usr/local/bin with sudo. In vcflib/ directory:
sudo cp scripts/* /usr/local/bin/
sudo cp bin/* /usr/local/bin/
#To check vcflib scripts, type vcf in terminal followed by TAB and display all posibilities
For detailed install, check here: http://wiki.wubrowse.org/How_to_install_tabix. User with root privileges can do:
sudo apt install tabix
Complete instructions can be found in http://subread.sourceforge.net/. Users with privileges can accomplish with sudo:
sudo apt-get install subread
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6/sratoolkit.2.9.6-ubuntu64.tar.gz
gunzip sratoolkit.2.9.6-ubuntu64.tar.gz
tar -xvf sratoolkit.2.9.6-ubuntu64.tar
sudo cp sratoolkit.2.9.6-ubuntu64/bin/fastq-dump /usr/local/bin/
for detailed install instructions check here: http://ccb.jhu.edu/software/hisat2/manual.shtml . The pre-compiled version for linux can be obtained from this webpage as follows:
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.4-Linux_x86_64.zip
unzip hisat2-2.0.4-Linux_x86_64.zip
sudo cp hisat2-2.0.4/hisat2* /usr/local/bin/
# Install cmake : https://cmake.org/install/
cd --
wget https://github.com/Kitware/CMake/releases/download/v3.23.0/cmake-3.23.0.tar.gz
gunzip cmake-3.23.0.tar.gz
tar -xvf cmake-3.23.0.tar
cd cmake-3.23.0/
./bootstrap
make
make install
sudo make install
# Install Rust and Cargo
cd --
curl https://sh.rustup.rs -sSf | sh
# To configure your current shell, run:
source $HOME/.cargo/env
# Install varlociraptor
cargo install varlociraptor
- As an example, we will analyze haplotypes from an RNA-seq data taken from SALL2 wild type and knockout mice, presenting germline variants linked to Chromosome 14, see: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5504-9. With the pipeline, we will obtain these linked variants to knockout mice, not present in the wild-type counterpart. The correspondent illumina reads will be downloaded and aligned against mm10 genome (mus musculus version 10).
- As outputs, the pipeline will take BAM file names until the
.bam
prefix is encountered - From scratch, inside
variants2genes
folder:
# Inside variants2genes folder
mkdir SALL2_WT_vs_KO && cd SALL2_WT_vs_KO
## (1) Download reference genome from UCSC and correspondent GTF file. Then, build HISAT2 index.
../bin/genome-download mm10
hisat2-build mm10.fa mm10_hisat2
## (2) Download and align SALL2 Wild type and Knockout reads with HISAT2, using 25 threads.
prefetch -O ./ SRR8267474 && fastq-dump --gzip SRR8267474 # WT sample
hisat2 -x mm10_hisat2 -p 25 -U SRR8267474.fastq.gz | samtools view -bSh > WT.bam
prefetch -O ./ SRR8267458 && fastq-dump --gzip SRR8267458 # KO sample
hisat2 -x mm10_hisat2 -p 25 -U SRR8267458.fastq.gz | samtools view -bSh > KO.bam
## (3) sort and index bam samples using 25 threads
samtools sort -o WT.sorted.bam WT.bam -@ 25 && samtools index WT.sorted.bam -@ 25
samtools sort -o KO.sorted.bam KO.bam -@ 25 && samtools index KO.sorted.bam -@ 25
## (4) (optional): Use plot-variants to inspect genome-wide variants in every sample (check graph.pdf)
../bin/plot-variants -a WT.sorted.bam -b KO.sorted.bam -g mm10.fa -p ../R_scripts/bam_coverage_mouse.R
## (5) Run variants2genes.sh pipeline to collect KO-linked variants and correspondent genes with variants (using 20 threads)
wget -O mm10_dbSNP.raw.vcf https://usegalaxy.org/datasets/bbd44e69cb8906b509fb7398dabbcd16/display?to_ext=vcf # download mm10 known-snps sites
../bin/variants2genes -a WT.sorted.bam -b KO.sorted.bam -g mm10.fa -r mm10.gtf -s mm10_dbSNP.raw.vcf -t 20
Inside variants2genes_$DATE_OF_EXECUTION
, check output_files
sub-folder containing output files. From this example, two chr12 and 502 chr14 KO-linked germline variants were discovered.
Important: If users have their own genome and/or annotation file, their can use it in the pipeline, if desired. Their must edit STEP 1 and STEP 5. We will run the example using my_genome.fa
and my_annotation.gtf
instead of mm10.fa
and mm10.gtf
as follows:
## (1): Build HISAT2 index of my_genome.fa.
hisat2-build my_genome.fa my_genome_hisat2
... Same Steps (2), (3) and (4) ...
## (5) Run variants2genes.sh script to collect KO-linked variants and correspondent genes with variants (using 20 threads)
wget -O mm10_dbSNP.raw.vcf https://usegalaxy.org/datasets/bbd44e69cb8906b509fb7398dabbcd16/display?to_ext=vcf # download mm10 known-snps sites
../bin/variants2genes -a WT.sorted.bam -b KO.sorted.bam -g my_genome.fa -r my_annotation.gtf -s mm10_dbSNP.raw.vcf -t 20