Table of Contents
- quota_Anchor
- Installation
- Usage
- Help information
- Example of synteny analysis between maize and sorghum
- Preparation of genome and annotation file
- Generate the longest protein sequence files
- Generate the chromosome length files from fai and gff file
- Generate the table files that will be used as the input file for synteny analysis
- Performing synteny analysis
- Generate the longest coding sequence file
- Calculate synonymous and non-synonymous substitution rates for syntenic pairs
- Homologous pairs and syntenic pairs visualization
- Maize gene/gene pairs classification
- Positioning wgd events relative to species divergent events based on ks
- FAQ
Here are the documents to conduct strand and WGD aware syntenic gene identification for a pair of genomes using the longest path algorithm implemented in AnchorWave. For more information about the algorithm, refer to the document
You can simply install the software via conda(AnchorWave has not yet released a new version v1.2.6, so you may need to manually download the development version of AnchorWave from GitHub and copy the binary file to the bin directory of your conda environment):
conda create -n quota_Anchor bioconda::quota_anchor
quota_Anchor -h
usage: quota_Anchor [-h] [-v] {longest_pep,longest_cds,get_chr_length,pre_col,col,ks,dotplot,circle,line,line_proali,class_gene,kde,kf,trios,correct} ...
Conduct strand and WGD aware syntenic gene identification for a pair of genomes using the longest path algorithm implemented in AnchorWave.
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Gene collinearity analysis:
{longest_pep,longest_cds,get_chr_length,pre_col,col,ks,dotplot,circle,line,line_proali,class_gene,kde,kf,trios,correct}
longest_pep Call gffread to generate the protein sequence of the species based on the genome and gff files. The longest transcripts are then
extracted based on the gff file and the protein sequence file.
longest_cds Call gffread to generate the coding sequence of the species based on the genome and gff files. The longest cds are then extracted
based on the gff file and the coding sequence file.
get_chr_length Generate a length file containing chromosome length and total number of genes based on the fai file and gff file.
pre_col Generates the input file for synteny analysis (called a table file or blast file containing gene position information).
col Generate a collinearity file based on the table file.
ks Synonymous/non-synonymous substitution rates for syntenic gene pairs calculated in parallel.
dotplot Generate collinear gene pairs dotplot or homologous gene pairs dotplot.
circle Collinearity result visualization(circos).
line Collinearity result visualization(line style).
class_gene Genes or gene pairs are classified into whole genome duplication, tandem duplication, proximal duplication, transposed duplication,
and dispersed duplication. For gene classification, there is also a single gene category (singleton) which has no homologous gene.
kde Focal species all syntenic pairs ks / block ks median histogram and gaussian kde curve.
kf Ks fitting plot of the focal species whole genome duplication or ks fitting plot including the corrected ks peaks of species
divergence events.
trios Generate trios (consist of focal species, sister species, and outgroup species) and species pair files based on the binary
tree in newick format.
correct The peak ks of species divergence events were fitted and corrected to the evolutionary rate level of the focal species.
Here is an example to identify syntenic genes between maize and sorghum. The maize lineage has undergone a whole genome duplication (WGD) since its divergence with sorghum, but subsequent chromosomal fusions resulted in these species having the same chromosome number (n = 10). AnchorWave can allow up to two collinear paths for each sorghum anchor while one collinear path for each maize anchor.
The current working directory contains genome files in fasta format and genome annotation files in gff format.
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0.fa.gz
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/sorghum_bicolor/dna/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/gff3/sorghum_bicolor/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3.gz
gunzip *gz
For convenience, rename the file as follows:
├── maize.fa
├── maize.gff3
├── sorghum.fa
└── sorghum.gff3
The process, implemented in the quota_Anchor longest_pep
module, consists of two main steps:
- Protein sequences are extracted from genomes and annotation files based on genetic code tables.
- For each gene, the longest protein sequence was identified and extracted to ensure the most complete characterization for further analysis.
quota_Anchor longest_pep -f sorghum.fa,maize.fa -g sorghum.gff3,maize.gff3 -p sb.p.fa,zm.p.fa -l sorghum.protein.fa,maize.protein.fa -t 2 --overwrite -merge merged.pep.fa
The length files are required as input for generating table files, which are subsequently used for synteny analysis and plotting.
quota_Anchor get_chr_length -f sorghum.fa.fai,maize.fa.fai -g sorghum.gff3,maize.gff3 -s 0-9:chr -o sorghum.length.txt,maize.length.txt --overwrite
- Use DIAMOND/BLASTP/BLASTN for sequence alignment.
- Combine the BLAST results and GFF file information into a single table file.
quota_Anchor pre_col -a diamond -rs sorghum.protein.fa -qs maize.protein.fa -db sorghum.database.diamond -mts 20 -e 1e-10 -b sorghum.maize.diamond -rg sorghum.gff3 -qg maize.gff3 -o sb_zm.table -bs 100 -al 0 -rl sorghum.length.txt -ql maize.length.txt --overwrite
-
Generate collinearity result and specify
-r -q
parameter.quota_Anchor col -i sb_zm.table -o sb_zm.collinearity -r 2 -q 1 -s 0 --overwrite
-
Get
all
collinearity result andremove
relative inversion gene pairs.quota_Anchor col -i sb_zm.table -o sb_zm.collinearity -s 1 -a 1 --overwrite
-
Get
all
collinearity result andretain
relative inversion gene pairs.quota_Anchor col -i sb_zm.table -o sb_zm.collinearity -s 0 -a 1 --overwrite
The process, implemented in the quota_Anchor longest_cds
module, consists of two main steps:
- Extract coding sequences from genome files and annotation files.
- Identify and extract the longest cds for each gene.
quota_Anchor longest_cds -f sorghum.fa,maize.fa -g sorghum.gff3,maize.gff3 -p sb.cds.fa,zm.cds.fa -l sorghum.cds.fa,maize.cds.fa -t 2 --overwrite -merge merged.cds.fa
quota_Anchor ks -i sb_zm.collinearity -a muscle -p merged.pep.fa -d merged.cds.fa -o sb_zm.ks -t 16 --overwrite
-
Homologous gene pairs visualization using identity as a legend.
quota_Anchor dotplot -i sb_zm.table -o sb_zm.table.identity.png -r sorghum.length.txt -q maize.length.txt -t order -r_label "Sorghum bicolor" -q_label "Zea mays" -w 1500 -e 1200 -use_identity --overwrite
-
Syntenic pairs visualization using identity as a legend.
quota_Anchor dotplot -i sb_zm.collinearity -o sb_zm.collinearity.identity.png -r sorghum.length.txt -q maize.length.txt -t order -r_label "Sorghum bicolor" -q_label "Zea mays" -w 1500 -e 1200 -use_identity --overwrite
-
Syntenic pairs visualization using ks value as a legend.
quota_Anchor dotplot -i sb_zm.collinearity -o sb_zm.collinearity.ks.png -r sorghum.length.txt -q maize.length.txt -t order -r_label "Sorghum bicolor" -q_label "Zea mays" -w 1500 -e 1200 -ks sb_zm.ks --overwrite
Inter-species
quota_Anchor circle -i sb_zm.collinearity -o sb_zm.circle.png -q maize.length.txt -r sorghum.length.txt -rn "Sorghum bicolor" -qn "Zea mays" -cf 9 -sf 9 -rm chr,Chr -fs 14,14 --overwrite
Intra-species
quota_Anchor circle -i sb_sb.collinearity -o sb_sb.circle.png --overwrite -r ../sorghum.length.txt -q ../sorghum.length.txt -rn "sorghum" -qn "sorghum"
-
Visualization of syntenic pairs of two species
quota_Anchor line -i sb_zm.collinearity -o sb_zm.line.png -l sorghum.length.txt,maize.length.txt -n "Sorghum bicolor,Zea mays" --overwrite
-
Multi-species syntenic pairs visualization
quota_Anchor line -i os_sb.collinearity,sb_sv.collinearity -o os_sb_sv.line.png -l os.length.txt,sb.length.txt,sv.length.txt -n "Oryza sativa, Sorghum bicolor,Zea mays" -rm "chr,Chr" -cf 7 -sf 10 -fs 14,14 --overwrite
This pipeline refers to DupGen_finder, with some modifications to suit our specific requirements. In short, the partitioning conditions in non-unique mode are more relaxed, whereas the conditions in unique mode are more stringent.
-
Synteny Analysis of intra-Maize
quota_Anchor pre_col -a diamond -rs maize.protein.fa -qs maize.protein.fa -db maize.database.diamond -mts 20 -e 1e-10 -b maize.maize.diamond -rg maize.gff3 -qg maize.gff3 -o zm_zm.table -bs 100 -al 0 -rl maize.length.txt -ql maize.length.txt --overwrite quota_Anchor col -i zm_zm.table -o zm_zm.collinearity -s 0 -m 500 -W 5 -E -0.005 -D 25 -a 1 --overwrite
-
Download Musa balbisiana data and rename the filename
wget https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCA_004837865.1/download\?include_annotation_type\=GENOME_FASTA\&include_annotation_type\=GENOME_GFF
├── banana.B.fa └── banana.B.gff
-
Synteny Analysis between Banana.B and Maize
quota_Anchor longest_pep -f banana.B.fa -g banana.B.gff -p B.p.pep -l banana.B.pep -t 1 --overwrite quota_Anchor get_chr_length -f banana.B.fa.fai -g banana.B.gff -s CM01 -o banana.B.length.txt --overwrite quota_Anchor pre_col -a diamond -rs banana.B.pep -qs maize.protein.fa -db banana.B.database.diamond -mts 20 -e 1e-10 -b banana.B.maize.diamond -rg banana.B.gff -qg maize.gff3 -o bananaB_zm.table -bs 100 -al 0 -rl banana.B.length.txt -ql maize.length.txt --overwrite quota_Anchor col -i bananaB_zm.table -o bananaB_zm.collinearity -s 0 --overwrite -D 25 -a 1
-
Classifying maize genes/gene pairs Unique mode
quota_Anchor class_gene -b maize.maize.diamond -g maize.gff3 -q zm_zm.collinearity -qr bananaB_zm.collinearity -o maize_classify_dir -p maize -s 1 -d 10 --overwrite -u
Non-unique mode
quota_Anchor class_gene -b maize.maize.diamond -g maize.gff3 -q zm_zm.collinearity -qr bananaB_zm.collinearity -o maize_classify_dir -p maize -s 1 -d 10 --overwrite
This pipeline refers to ksrates, with some differences between two methods. In short, this pipeline uses the collinear gene pair ks value fitting results obtained based on the -r_value -q_value
parameters as the species divergent peak, while ksrates uses the RBH gene pair ks value fitting results as the species divergent peak. And there are also some slight differences in the fitting methods.
The following is the current directory tree and oryza sativa and setaria viridis data soucre.
wget https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna_rm.toplevel.fa.gz
wget https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/gff3/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gff3.gz
wget https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/fasta/setaria_viridis/dna/Setaria_viridis.Setaria_viridis_v2.0.dna.toplevel.fa.gz
wget https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/gff3/setaria_viridis/Setaria_viridis.Setaria_viridis_v2.0.59.gff3.gz
├── raw_data
│ ├── maize.fa
│ ├── maize.gff3
│ ├── oryza.fa
│ ├── oryza.gff3
│ ├── setaria.fa
│ ├── setaria.gff3
│ ├── sorghum.fa
│ └── sorghum.gff3
└── scripts
├── ks_pipeline.py
└── longest_pipeline.py
-
Generate longest protein and longest cds for each species in the input directory.
python ./scripts/longest_pipeline.py -i raw_data -o output_dir --overwrite
-
Get species chromosome length file. You may need to run
quota_Anchor get_chr_length
to understand the meaning of the-s
parameter.a)
find ./raw_data/*fai |awk '{printf "%s,", $1}' find ./raw_data/*gff3 |awk '{printf "%s,", $1}' find ./raw_data/*gff3 |awk '{printf "%s,", $1}'|sed s/gff3/length\.txt/g
quota_Anchor get_chr_length -f ./raw_data/maize.fa.fai,./raw_data/oryza.fa.fai,./raw_data/setaria.fa.fai,./raw_data/sorghum.fa.fai -g ./raw_data/maize.gff3,./raw_data/oryza.gff3,./raw_data/setaria.gff3,./raw_data/sorghum.gff3 -s 0-9,CHR,chr,Chr:0-9,CHR,chr,Chr:0-9,CHR,chr,Chr:0-9,CHR,chr,Chr -o ./raw_data/maize.length.txt,./raw_data/oryza.length.txt,./raw_data/setaria.length.txt,./raw_data/sorghum.length.txt --overwrite
b)
quota_Anchor get_chr_length -f "$(find ./raw_data/*fai |awk '{printf "%s,", $1}')" -g "$(find ./raw_data/*gff3 |awk '{printf "%s,", $1}')" -s 0-9,CHR,chr,Chr:0-9,CHR,chr,Chr:0-9,CHR,chr,Chr:0-9,CHR,chr,Chr -o "$(find ./raw_data/*gff3 |awk '{printf "%s,", $1}'|sed s/gff3/length\.txt/g)" --overwrite
-
Generate trios files and species pair files based on the binary tree in newick format.
quota_Anchor trios -n "(((maize, sorghum), setaria), oryza);" -k "maize" -ot ortholog_trios_maize.csv -op species_pairs.csv -t tree.txt --overwrite
Species_1 Species_2 q_value r_value get_all_collinear_pairs maize setaria 1 1 0 maize setaria 1 1 0 sorghum setaria 1 1 0 maize oryza 1 1 0 sorghum oryza 1 1 0 setaria oryza 1 1 0 -
Get synteny file and ks file for each species pair. Note:
- The
./scripts/ks_pipeline.py
script uses theSpecies_1
column ofspecies_pairs.csv
as the query and theSpecies_2
column ofspecies_pairs.csv
as the reference in the collinearity procedure. - The
./scripts/ks_pipeline.py
script adjusts the parameters of the collinearity procedure based on ther_value
q_value
andget_all_collinear_pairs
columns of the species pairs file. - You may need to understand the meaning of the
r_value
,q_value
andget_all_collinear_pairs
parameter viaquota_Anchor col
command or refer to document.
python ./scripts/ks_pipeline.py -i raw_data -o output_dir -s species_pairs.csv -a diamond -l raw_data --overwrite -plot_table
- The
-
Ks fitting and correction for each species divergent peak. Note:
- The
0
infind ./output_dir/02synteny/*0.ks |awk '{printf "%s,", $1}'
represents the value of theget_all_collinear_pairs
column of the species pair file. - The order of species pairs in the species pair file(specify by -s parameter, species_pairs.csv) must be consistent with the order of the ks file(specify by -k parameter)
find ./output_dir/02synteny/*0.ks |awk '{printf "%s,", $1}'
quota_Anchor correct -k "./output_dir/02synteny/maize_sorghum0.ks,./output_dir/02synteny/maize_setaria0.ks,./output_dir/02synteny/sorghum_setaria0.ks,./output_dir/02synteny/maize_oryza0.ks,./output_dir/02synteny/sorghum_oryza0.ks,./output_dir/02synteny/setaria_oryza0.ks" -s species_pairs.csv -t ortholog_trios_maize.csv -kr 0,2 -ot outfile_divergent_peaks.csv --overwrite
- The
-
Maize wgd ks peaks fitting
quota_Anchor pre_col -a diamond -rs ./output_dir/01longest/maize.longest.pep -qs ./output_dir/01longest/maize.longest.pep -db ./maize/maize.database.diamond -mts 20 -e 1e-10 -b ./maize/maize.maize.diamond -rg ./raw_data/maize.gff3 -qg ./raw_data/maize.gff3 -o ./maize/zm_zm.table -bs 100 -al 0 --overwrite quota_Anchor dotplot -i ./maize/zm_zm.table -o ./maize/zm.zm.png -r ./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -r_label maize -q_label maize -use_identity --overwrite quota_Anchor col -i ./maize/zm_zm.table -o ./maize/zm_zm.collinearity -a 1 -m 1000 -W 1 -D 25 -I 5 -E -0.01 -f 0 --overwrite quota_Anchor dotplot -i ./maize/zm_zm.collinearity -o ./maize/zm.zm.collinearity.png -r ./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -r_label maize -q_label maize -use_identity --overwrite quota_Anchor ks -i ./maize/zm_zm.collinearity -a mafft -p ./output_dir/01longest/maize.longest.pep -d ./output_dir/01longest/maize.longest.cds -o ./maize/zm.zm.ks -t 16 --overwrite quota_Anchor dotplot -i ./maize/zm_zm.collinearity -o ./maize/zm.zm.collinearity.ks.png -r ./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -r_label maize -q_label maize --overwrite -ks ./maize/zm.zm.ks
You need to provide the number of components according to
zm.zm.table.png
or other methodsquota_Anchor kde -i ./maize/zm_zm.collinearity -r./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -o ./maize/zm.zm.kde.png -k ./maize/zm.zm.ks --overwrite
quota_Anchor kf -i ./maize/zm_zm.collinearity -r./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -o ./maize/zm.zm.kf.png -k ./maize/zm.zm.ks -f maize -components 2 -kr 0,2 --overwrite
-
The Gaussian mixture model was used to group wgd gene pairs ks, and kernel density estimation and Gaussian approximation fitting were performed on each component. The initial species divergent peak was obtained by taking the mode of 382 kernel density estimates, and the rate was corrected to the focal species level based on trios.
quota_Anchor kf -i ./maize/zm_zm.collinearity -r ./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -o ./maize/zm.zm.png -k ./maize/zm.zm.ks -components 2 -f maize -correct_file outfile_divergent_peaks.csv -kr 0,2 --overwrite
The FAQ document is available at FAQ.md.