NEWS:
🚀 A v.1.1 with several improvements in stability, speed and memory consumption has been released.
An automated computational framework for detecting Saccharomyces paradoxus introgressions in Saccharomyces cerevisiae strains from paired-end illumina sequencing.
v1.0. is described in Tellini, et al. 2024 Nat. EcoEvo, for detecting S.par introgressions in S.cer strains.
v1.1. contains the following implementations and changes:
-
minimap2
replacedbwa mem
almost halving the running time (see Heng Li 2018, Bioinformatics) achieving comparable results;sample: ERR3010122
threads: 2
Architecture: x86_64
CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
script Elapsed Time Maximum resident set size (GB) bwa mem + samtools (v1) 6:21 (m:ss) 1.3 minimap2 + samtools (v1.1) 3:36 (m:ss) 1.3 -
improved the reproducibility of the mapping by implementing the standard samtools workflow according to samtools' guideline
-
improved the roboustness of the mapping by appending the name of the strain to a checkpoint (cps) file (
./cps/cps.txt
). The strains which names are stored in./cps/cps.txt
will not be mapped again. -
introduced
data.table
,lapply
and custom function for large file manipulation for reducing runtime and RAM load. example:sample: ERR3010122
threads: 2
Architecture: x86_64
CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
script Elapsed Time (s) Maximum resident set size (GB) parser_marker.r (v1) 0:17 s 0.8 parser_marker.r (v1.1) 0:06 s 0.5 clrs.r (v1) 0:49 s 1.9 clrs.r (v1.1) 0:17 s 0.7 -
introduced the variables
nSamples
andnThreads
insiderunner.sh
. The first variable controls the number of samples to run in paralell and the second the per-samples number of threads.nSamples
guarantees a contant number of samples running in parallel; as soon as the count drop of one sample an other will start to run. The definition of these variables affect the scriptsminimap2.sh
(which replacesbwa.sh
),bcftools_markers.sh
(which replacessamtools_marker.sh
) andfreec.sh
; -
corrected an error that prevented the detection of the CNVs;
-
Added a new approach for merging markers in blocks:
In v1 the markers are (1) genotyped, (2) filtered and (3) joined as long as they are consecutive and carry the same information. In v1.1 this does not change.
In v1.1 the markers are (1) ranked, (2) genotyped, (3) filtered, (4) joined as long as they are consecutive in the ranking and carry the same information. v1 did not use the ranking. Inevitably, this results in a more fragmented signal but provides a more realistic and faithful representation of the introgression reflecting regions where the genotyping was either discordant or failed. The ranking also represents the strategy that allowed the speedup of
clrs.r
(the script that generates the blocks).
:
git clone --recursive https://github.com/nicolo-tellini/intropipeline.git
📂 :
.
├── rep
│ ├── Ann
│ └── Asm
├── runner.sh
├── scr
└── seq
5 directories 1 file
rep
: repository with assemblies, annotations and pre-computed marker table,runner.sh
: the script you edit and run,scr
: scripts,seq
: put the FASTQs files here,
gzip -d ./rep/mrktab.gz
gzip -d ./rep/Asm/*gz
Move the FASTQs inside ./seq/
Paired-end FASTQs data must be gziped and suffixed with .R1.fastq.gz and .R2.fastq.gz.
./scr/bwa.sh
uses 2 thread for sample (n.samples = 2).
./scr/samtools_markers.sh
uses 1 thread for sample (n.samples = 4).
./scr/gem.sh
uses 2 threads.
./scr/freec.sh
uses 4 threads.
these values can be changed editing the scripts.
Edit runner.sh
📃
#!/bin/bash
#####################
### user settings ###
#####################
## S. paradoxus reference assembly
ref2Label="CBS432" ## choose the Spar assembly you think better fit the origin of your samples
## short labels (used to name file)
ref2="EU" ## choose a short name for Spar
# STEP 1
fastqQC="yes" ## fastqc control (required) ("yes","no" or "-" the last is skip)
# STEP 2
shortReadMapping="yes" ## ("yes","no")
# STEP 3
mrkgeno="yes" ## ("yes","no")
# STEP 4
cnv="yes" ## ("yes","no")
# STEP 5
intro="yes" ## ("yes","no")
#####################
### settings' end ###
#####################
Run runner.sh
🏃
nohup bash runner.sh &
The results concerning the introgressions are stored in ./int
Ex.
An Alpechin strain:
Blue-Red plots provides an overview of potential introgressed DNA across the genome. The interpretation of the results is a process that require the integration of different data the pipeline produces.
❗ Reminder: blocks are defined as consecutive markers besring the same genomic info (Homo S.cer, Homo S.par, Het).
How are markers distributed inside the S.par block?
A couple of possible scenarious:
Case 1: abundant markers suporting the block
❗ Note: Only a few markers in the figure above are represented in the cartoon;
Case 2: not so abundant markers suporting the block
❗ Note: you should not exclude the possibility that a large events is supported by a low number of markers as in the example.
The number of markers supporting the blocks, the marker density and the info concerning the genotype are stored in int
and int/AllSegments
.
- FastQC
- minimap2
- samtools
- bcftools
- GEM v. 1.315 (beta) !! The GEM version used for the analyses is 1.759 (not available anymore).
- Control-FREEC v. 11.6; makeGraph.R script was renamed makeplotcnv.R; A copy of all the scripts in FREEC/scripts/ is in scr. Nevertheless freec has to be installed
- A copy of sambamba v. 0.6.5 is provided with the pipeline (no installation required)
- data.table
- ggplot2
- rtracklayer
- R.filesets
- GenomicRanges
- purrr
- dplyr
- R.utilis
Marker definition Methods
Please cite this paper when using intropipeline for your publications.
Ancient and recent origins of shared polymorphisms in yeast
Nicolò Tellini, Matteo De Chiara, Simone Mozzachiodi, Lorenzo Tattini, Chiara Vischioni, Elena S. Naumova, Jonas Warringer, Anders Bergström & Gianni Liti
Nature Ecologya and Evolution, 2024, https://doi.org/10.1038/s41559-024-02352-5
@article{tellini2024ancient,
title={Ancient and recent origins of shared polymorphisms in yeast},
author={Tellini, Nicol{\`o} and De Chiara, Matteo and Mozzachiodi, Simone and Tattini, Lorenzo and Vischioni, Chiara and Naumova, Elena S and Warringer, Jonas and Bergstr{\"o}m, Anders and Liti, Gianni},
journal={Nature Ecology \& Evolution},
pages={1--16},
year={2024},
publisher={Nature Publishing Group UK London}
}
- v1.0 released in 2023
- v1.1 released in 2024