Software for inferring coalescence rates for unphased, low-coverage genomes.
- Required input: Two samples specified in bcf/bam format
- Requires mutation ages obtained from a genealogy (e.g., using Relate). Precomputed mutation ages available below.
- That's it!
Please send any comments, questions and bug reports to leo.speidel@outlook.com.
Paper describing the method: https://academic.oup.com/mbe/article/38/9/3497/6299394
Preprint: https://www.biorxiv.org/content/10.1101/2021.02.17.431573v1
A working example using chromosome 1 of two ancient human genomes (LBK and Loschbour; Lazaridis et al., Nature 2014) can be downloaded from here (512Mb).
The same example for all chromosomes is available here (7Gb).
Precompiled binaries for several platforms can be found under ./binaries.
Requirements*:
- C++11
- cmake
- ZLIB
- OpenSSL
- lzma
*On my cluster, I have to load modules CMake, GCC, and BCFtools to compile.
*On a Mac, I installed CMake and openssl@1.1 using homebrew and updated paths in lines 24-29 of CMakeLists.txt.
To compile from source use
cd build
cmake ..
make
Colate uses a modifed version of htslib, which has been copied to ./include/vcf/.
- Two sequences, in bcf or bam format, for which coalescence rates will be calculated
- A reference genome in fasta format (see below)
- Genome mask file (optional), to filter out unreliable regions.
- Mutation ages obtained from a genealogy (see below for precomputed mutation ages)
We provide a preprocessed file for the SGDP data (aligned to GRCh37) here: SGDP_mutages.tar (654Mb).
A human reference genome (GRCh37) in the required format (i.e., separated by chromosome): GRCh37.tgz (512Mb)
(not needed if using SGDP genealogy from above, you can simply download these files above)
Get mutations ages from a genealogy; this step will add fixed mutations to a *.mut file (see Relate documentation for file format), which is needed for Colate.
- ancestral genome in fasta format
- reference genome in fasta format
- anc/mut files of a Relate genealogy
- bcf of samples used to build Relate genealogy Link to human ancestral genomes can be found here.
chr=1
${PATH_TO_BINARY}/Colate --mode preprocess_mut \
--anc example_chr${chr}.anc.gz \
--mut example_chr${chr}.mut.gz \
--reference_bcf example_chr${chr}.bcf \
--ref_genome GRCh37_chr${chr}.fa.gz \
--anc_genome human_ancestor_${chr}.fa.gz \
--mask example_mask_chr${chr}.fa \
-o example_fixed_chr${chr}
#output is example_fixed_chr${chr}.mut
You can either directly run Colate on bcfs or bams, or precompute an input file, which is beneficial when running Colate repeatedly on the same samples.
bcfs:
- bcfs contain samples of interest (Please use e.g., bcftools view -S or -s to subset bcf files).
- chr.txt: Chromosome names, one per line (can be any strings)
- If chromosome names are "1", "2", etc, then input files are *_chr1.bcf, *_chr2.bcf etc.
- target_masks and ref_genome should be separated by chromosome, i.e. GRCh37_chr1.fa.gz, GRCh37_chr2.fa.gz etc.
- Target masks are given as fasta files, with "P" for passing and "N" for non-passing bases (one file per chromosome). These should always be specified when input is calculated from a BCF. The mask should indicate where the sample was callable, such that we know where the sample is homozygous REF.
- If --chr is not specified, full file names are required (e.g., --target example.bcf), however these should only contain a single chromosome.
mut="example_fixed" #name of .mut files obtained from step 1 (or downloaded)
${PATH_TO_BINARY}/Colate \
--mode make_tmp \
--mut ${mut} \
--target_bcf example_target \
--target_mask target_mask \
--ref_genome GRCh37 \
--chr chr.txt \
-o example_out
#output is example_out.colate.in
bam:
- bam should be sorted by chromosome name (same order as in chr.txt), and by position.
- ref_genome should be separated by chromosome, i.e. GRCh37_chr1.fa.gz, GRCh37_chr2.fa.gz etc.
- chr.txt: Chromosome names, one per line (can be any strings but consistent with file names)
mut="example_fixed" #name of .mut files obtained from step 1 (or downloaded)
${PATH_TO_BINARY}/Colate \
--mode make_tmp \
--mut ${mut} \
--target_bam example_target.bam \
--ref_genome GRCh37 \
--chr chr.txt \
-o example_out
#output is example_out.colate.in
- Mask files are not necessary if already specified earlier, unless you want to specifically mask out further regions of the genome.
#Optional:
#--target_mask, --reference_mask,
#--target_age, --reference_age (specified in years)
#--years_per_gen, --num_bootstrap
#--coal is used to specify arbitrary epoch boundaries and initial coalescence rates (file format .coal).
mut="example_fixed" #name of .mut files obtained from step 1 (or downloaded)
bins="3,7,0.2" #epochs in log10 years (format: start,end,stepsize)
${PATH_TO_BINARY}/Colate \
--mode mut \
--mut ${mut} \
--target_tmp example_target.colate.in \
--reference_tmp example_reference.colate.in \
--bins ${bins} \
--chr chr.txt \
--num_bootstrap 20 \
--target_age 1000 \
--reference_age 0 \
--years_per_gen 28 \
-o example_out
#output is example_out.coal, see bottom of page for file format.
- If chromosome names are "1", "2", etc, then input files are *_chr1.bcf, *_chr2.bcf etc.
- If --chr is not specified, full file names are required (e.g., --target example.bcf), however these should only contain a single chromosome.
- chr.txt: Chromosome names, one per line (can be any strings)
- Target and reference masks are given as fasta files, with "P" for passing and "N" for non-passing bases (one file per chromosome). These should always be specified when input is calculated from a BCF. The masks should indicate, where the sample was callable, such that we know where the sample is homozygous REF.
#--target_mask, --reference_mask,
#--target_age, --reference_age (specified in years)
#--years_per_gen, --num_bootstrap are optional
#--coal is used to specify arbitrary epoch boundaries and initial coalescence rates (file format .coal).
#age is specified in years.
#Assumption is any site not in the bcf is homozygous reference (unless masked out by a mask file).
mut="example_fixed" #name of .mut files obtained from step 1 (or downloaded)
bins="3,7,0.2" #epochs in log10 years (format: start,end,stepsize)
${PATH_TO_BINARY}/Colate \
--mode mut \
--mut ${mut} \
--target_bcf example_target \
--reference_bcf example_reference \
--ref_genome GRCh37_chr${chr}.fa.gz \
--target_mask target_mask \
--reference_mask reference_mask \
--bins ${bins} \
--chr chr.txt \
--num_bootstrap 20 \
--target_age 1000 \
--reference_age 0 \
--years_per_gen 28 \
-o example_out
#output is example_out.coal, see bottom of page for file format.
- bams should be sorted by chromosome name (same order as in chr.txt), and by position.
- ref_genome should be separated by chromosome, i.e. GRCh37_chr1.fa.gz, GRCh37_chr2.fa.gz etc.
- chr.txt: Chromosome names, one per line (can be any strings)
#--target_age, --reference_age (specified in years)
#--years_per_gen, --num_bootstrap are optional
#--coal is used to specify arbitrary epoch boundaries and initial coalescence rates (file format .coal).
mut="example_fixed" #name of .mut files obtained from step 1 (or downloaded)
bins="3,7,0.2" #epochs in log10 years (format: start,end,stepsize)
${PATH_TO_BINARY}/Colate \
--mode mut \
--mut ${mut} \
--target_bam example_target.bam \
--reference_bam example_reference.bam \
--ref_genome GRCh37 \
--bins ${bins} \
--chr chr.txt \
--num_bootstrap 20 \
--target_age 1000 \
--reference_age 0 \
--years_per_gen 28 \
-o example_out
#output is example_out.coal, see bottom of page for file format.
*coal files (see Relate documentation for file format). These can be loaded using the relater R package.
library(relater)
coal <- read.coal("example_out.coal") #group2 indexes bootstrap iterations
Summarize bootstrap iterations:
library(dplyr)
coal %>% group_by(epoch.start) %>%
summarize(mean = mean(haploid.coalescence.rate),
lower = quantile(haploid.coalescence.rate, prob = 0.025),
upper = quantile(haploid.coalescence.rate, prob = 0.975)) -> coal