Skip to content

Strand and WGD aware syntenic gene identification

License

Notifications You must be signed in to change notification settings

xiaodli/quota_Anchor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

quota_Anchor  install with biocondaLicense: MIT


Table of Contents

Here are the documents to conduct strand and WGD aware syntenic gene identification for a pair of genomes using the longest path algorithm implemented in AnchorWave. For more information about the algorithm, refer to the document

Installation

You can simply install the software via conda(AnchorWave has not yet released a new version v1.2.6, so you may need to manually download the development version of AnchorWave from GitHub and copy the binary file to the bin directory of your conda environment):

conda create -n quota_Anchor bioconda::quota_anchor

Usage

Help information

quota_Anchor -h
usage: quota_Anchor [-h] [-v] {longest_pep,longest_cds,get_chr_length,pre_col,col,ks,dotplot,circle,line,line_proali,class_gene,kde,kf,trios,correct} ...

Conduct strand and WGD aware syntenic gene identification for a pair of genomes using the longest path algorithm implemented in AnchorWave.

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Gene collinearity analysis:
  {longest_pep,longest_cds,get_chr_length,pre_col,col,ks,dotplot,circle,line,line_proali,class_gene,kde,kf,trios,correct}
    longest_pep         Call gffread to generate the protein sequence of the species based on the genome and gff files. The longest transcripts are then
                        extracted based on the gff file and the protein sequence file.
    longest_cds         Call gffread to generate the coding sequence of the species based on the genome and gff files. The longest cds are then extracted
                        based on the gff file and the coding sequence file.
    get_chr_length      Generate a length file containing chromosome length and total number of genes based on the fai file and gff file.
    pre_col             Generates the input file for synteny analysis (called a table file or blast file containing gene position information).
    col                 Generate a collinearity file based on the table file.
    ks                  Synonymous/non-synonymous substitution rates for syntenic gene pairs calculated in parallel.
    dotplot             Generate collinear gene pairs dotplot or homologous gene pairs dotplot.
    circle              Collinearity result visualization(circos).
    line                Collinearity result visualization(line style).
    class_gene          Genes or gene pairs are classified into whole genome duplication, tandem duplication, proximal duplication, transposed duplication,
                        and dispersed duplication. For gene classification, there is also a single gene category (singleton) which has no homologous gene.
    kde                 Focal species all syntenic pairs ks / block ks median histogram and gaussian kde curve.
    kf                  Ks fitting plot of the focal species whole genome duplication or ks fitting plot including the corrected ks peaks of species
                        divergence events.
    trios               Generate trios (consist of focal species, sister species, and outgroup species) and species pair files based on the binary
                        tree in newick format.
    correct             The peak ks of species divergence events were fitted and corrected to the evolutionary rate level of the focal species.

Example of synteny analysis between maize and sorghum

Here is an example to identify syntenic genes between maize and sorghum. The maize lineage has undergone a whole genome duplication (WGD) since its divergence with sorghum, but subsequent chromosomal fusions resulted in these species having the same chromosome number (n = 10). AnchorWave can allow up to two collinear paths for each sorghum anchor while one collinear path for each maize anchor.

Preparation of genome and annotation file

The current working directory contains genome files in fasta format and genome annotation files in gff format.

wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0.fa.gz
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/sorghum_bicolor/dna/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/gff3/sorghum_bicolor/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.57.gff3.gz
gunzip *gz

For convenience, rename the file as follows:

├── maize.fa
├── maize.gff3
├── sorghum.fa
└── sorghum.gff3

Generate the longest protein sequence files

The process, implemented in the quota_Anchor longest_pep module, consists of two main steps:

  1. Protein sequences are extracted from genomes and annotation files based on genetic code tables.
  2. For each gene, the longest protein sequence was identified and extracted to ensure the most complete characterization for further analysis.
quota_Anchor longest_pep -f sorghum.fa,maize.fa -g sorghum.gff3,maize.gff3 -p sb.p.fa,zm.p.fa -l sorghum.protein.fa,maize.protein.fa -t 2 --overwrite -merge merged.pep.fa

Generate the chromosome length files from fai and gff file

The length files are required as input for generating table files, which are subsequently used for synteny analysis and plotting.

quota_Anchor get_chr_length -f sorghum.fa.fai,maize.fa.fai -g sorghum.gff3,maize.gff3 -s 0-9:chr -o sorghum.length.txt,maize.length.txt --overwrite

Generate the table files that will be used as the input file for synteny analysis

  1. Use DIAMOND/BLASTP/BLASTN for sequence alignment.
  2. Combine the BLAST results and GFF file information into a single table file.
quota_Anchor pre_col -a diamond -rs sorghum.protein.fa -qs maize.protein.fa -db sorghum.database.diamond -mts 20 -e 1e-10 -b sorghum.maize.diamond -rg sorghum.gff3 -qg maize.gff3 -o sb_zm.table -bs 100 -al 0 -rl sorghum.length.txt -ql maize.length.txt --overwrite

Performing synteny analysis

  1. Generate collinearity result and specify -r -q parameter.

    quota_Anchor col -i sb_zm.table -o sb_zm.collinearity -r 2 -q 1 -s 0 --overwrite 
  2. Get all collinearity result and remove relative inversion gene pairs.

    quota_Anchor col -i sb_zm.table -o sb_zm.collinearity -s 1 -a 1 --overwrite
  3. Get all collinearity result and retain relative inversion gene pairs.

    quota_Anchor col -i sb_zm.table -o sb_zm.collinearity -s 0 -a 1 --overwrite

Generate the longest coding sequence file

The process, implemented in the quota_Anchor longest_cds module, consists of two main steps:

  1. Extract coding sequences from genome files and annotation files.
  2. Identify and extract the longest cds for each gene.
quota_Anchor longest_cds -f sorghum.fa,maize.fa -g sorghum.gff3,maize.gff3 -p sb.cds.fa,zm.cds.fa -l sorghum.cds.fa,maize.cds.fa -t 2 --overwrite -merge merged.cds.fa

Calculate synonymous and non-synonymous substitution rates for syntenic pairs

quota_Anchor ks -i sb_zm.collinearity -a muscle -p merged.pep.fa -d merged.cds.fa  -o sb_zm.ks -t 16 --overwrite 

Homologous pairs and syntenic pairs visualization

Dotplot visualiztion

  1. Homologous gene pairs visualization using identity as a legend.

    quota_Anchor dotplot -i sb_zm.table  -o sb_zm.table.identity.png -r sorghum.length.txt -q maize.length.txt -t order -r_label "Sorghum bicolor" -q_label "Zea mays" -w 1500 -e 1200 -use_identity --overwrite 

    sb_zm.identity.table

  2. Syntenic pairs visualization using identity as a legend.

    quota_Anchor dotplot -i sb_zm.collinearity  -o sb_zm.collinearity.identity.png -r sorghum.length.txt -q maize.length.txt -t order -r_label "Sorghum bicolor" -q_label "Zea mays" -w 1500 -e 1200 -use_identity --overwrite

    sb_zm.identity.collinearity

  3. Syntenic pairs visualization using ks value as a legend.

    quota_Anchor dotplot -i sb_zm.collinearity  -o sb_zm.collinearity.ks.png -r sorghum.length.txt -q maize.length.txt -t order -r_label "Sorghum bicolor" -q_label "Zea mays" -w 1500 -e 1200 -ks sb_zm.ks --overwrite

    sb_zm.ks.collinearity.

Circos visualiztion

Inter-species

quota_Anchor circle -i sb_zm.collinearity -o sb_zm.circle.png -q maize.length.txt -r sorghum.length.txt -rn "Sorghum bicolor" -qn "Zea mays" -cf 9 -sf 9 -rm chr,Chr -fs 14,14 --overwrite

sb_zm.circle.collinearity

Intra-species

quota_Anchor circle -i sb_sb.collinearity -o sb_sb.circle.png --overwrite -r ../sorghum.length.txt -q ../sorghum.length.txt -rn "sorghum" -qn "sorghum" 

sb_sb.circle.collinearity

Chromosome line style visualization

  1. Visualization of syntenic pairs of two species

    quota_Anchor line -i sb_zm.collinearity -o sb_zm.line.png -l sorghum.length.txt,maize.length.txt -n "Sorghum bicolor,Zea mays" --overwrite  

    sb_zm.line.collinearity

  2. Multi-species syntenic pairs visualization

    quota_Anchor line -i os_sb.collinearity,sb_sv.collinearity -o os_sb_sv.line.png -l os.length.txt,sb.length.txt,sv.length.txt -n "Oryza sativa, Sorghum bicolor,Zea mays" -rm "chr,Chr" -cf 7 -sf 10 -fs 14,14 --overwrite

    os_sb_sv.line.collinearity

Maize gene/gene pairs classification

This pipeline refers to DupGen_finder, with some modifications to suit our specific requirements. In short, the partitioning conditions in non-unique mode are more relaxed, whereas the conditions in unique mode are more stringent.

  1. Synteny Analysis of intra-Maize

    quota_Anchor pre_col -a diamond -rs maize.protein.fa -qs maize.protein.fa -db maize.database.diamond -mts 20 -e 1e-10 -b maize.maize.diamond -rg maize.gff3 -qg maize.gff3 -o zm_zm.table -bs 100 -al 0 -rl maize.length.txt -ql maize.length.txt --overwrite
    quota_Anchor col -i zm_zm.table -o zm_zm.collinearity -s 0 -m 500 -W 5 -E -0.005 -D 25 -a 1 --overwrite
  2. Download Musa balbisiana data and rename the filename

    wget https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCA_004837865.1/download\?include_annotation_type\=GENOME_FASTA\&include_annotation_type\=GENOME_GFF
    ├── banana.B.fa
    └── banana.B.gff
    
  3. Synteny Analysis between Banana.B and Maize

    quota_Anchor longest_pep -f banana.B.fa -g banana.B.gff -p B.p.pep -l banana.B.pep -t 1 --overwrite
    quota_Anchor get_chr_length -f banana.B.fa.fai -g banana.B.gff -s CM01 -o banana.B.length.txt --overwrite
    quota_Anchor pre_col -a diamond -rs banana.B.pep -qs maize.protein.fa -db banana.B.database.diamond -mts 20 -e 1e-10 -b banana.B.maize.diamond -rg banana.B.gff -qg maize.gff3 -o bananaB_zm.table -bs 100 -al 0 -rl banana.B.length.txt -ql maize.length.txt --overwrite
    quota_Anchor col -i bananaB_zm.table -o bananaB_zm.collinearity -s 0 --overwrite -D 25 -a 1 
  4. Classifying maize genes/gene pairs Unique mode

    quota_Anchor class_gene -b maize.maize.diamond -g maize.gff3 -q zm_zm.collinearity -qr bananaB_zm.collinearity -o maize_classify_dir -p maize -s 1 -d 10 --overwrite -u

    maize-unique.stats.gene.png

    maize-unique.stats.pair.png

    Non-unique mode

    quota_Anchor class_gene -b maize.maize.diamond -g maize.gff3 -q zm_zm.collinearity -qr bananaB_zm.collinearity -o maize_classify_dir -p maize -s 1 -d 10 --overwrite

    maize.stats.gene.png

    maize.stats.pair.png

Positioning wgd events relative to species divergent events based on ks

This pipeline refers to ksrates, with some differences between two methods. In short, this pipeline uses the collinear gene pair ks value fitting results obtained based on the -r_value -q_value parameters as the species divergent peak, while ksrates uses the RBH gene pair ks value fitting results as the species divergent peak. And there are also some slight differences in the fitting methods. The following is the current directory tree and oryza sativa and setaria viridis data soucre.

wget https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna_rm.toplevel.fa.gz
wget https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/gff3/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gff3.gz
wget https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/fasta/setaria_viridis/dna/Setaria_viridis.Setaria_viridis_v2.0.dna.toplevel.fa.gz
wget https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/gff3/setaria_viridis/Setaria_viridis.Setaria_viridis_v2.0.59.gff3.gz
├── raw_data
│   ├── maize.fa
│   ├── maize.gff3
│   ├── oryza.fa
│   ├── oryza.gff3
│   ├── setaria.fa
│   ├── setaria.gff3
│   ├── sorghum.fa
│   └── sorghum.gff3
└── scripts
    ├── ks_pipeline.py
    └── longest_pipeline.py
  1. Generate longest protein and longest cds for each species in the input directory.

    python ./scripts/longest_pipeline.py -i raw_data -o output_dir --overwrite
  2. Get species chromosome length file. You may need to run quota_Anchor get_chr_length to understand the meaning of the -s parameter.

    a)

    find ./raw_data/*fai |awk '{printf "%s,", $1}'
    find ./raw_data/*gff3 |awk '{printf "%s,", $1}'
    find ./raw_data/*gff3 |awk '{printf "%s,", $1}'|sed s/gff3/length\.txt/g   
    quota_Anchor get_chr_length -f ./raw_data/maize.fa.fai,./raw_data/oryza.fa.fai,./raw_data/setaria.fa.fai,./raw_data/sorghum.fa.fai -g ./raw_data/maize.gff3,./raw_data/oryza.gff3,./raw_data/setaria.gff3,./raw_data/sorghum.gff3 -s 0-9,CHR,chr,Chr:0-9,CHR,chr,Chr:0-9,CHR,chr,Chr:0-9,CHR,chr,Chr -o ./raw_data/maize.length.txt,./raw_data/oryza.length.txt,./raw_data/setaria.length.txt,./raw_data/sorghum.length.txt --overwrite

    b)

    quota_Anchor get_chr_length -f "$(find ./raw_data/*fai |awk '{printf "%s,", $1}')" -g "$(find ./raw_data/*gff3 |awk '{printf "%s,", $1}')" -s 0-9,CHR,chr,Chr:0-9,CHR,chr,Chr:0-9,CHR,chr,Chr:0-9,CHR,chr,Chr -o "$(find ./raw_data/*gff3 |awk '{printf "%s,", $1}'|sed s/gff3/length\.txt/g)" --overwrite
  3. Generate trios files and species pair files based on the binary tree in newick format.

    quota_Anchor trios -n "(((maize, sorghum), setaria), oryza);" -k "maize" -ot ortholog_trios_maize.csv -op species_pairs.csv -t tree.txt --overwrite
    Species_1 Species_2 q_value r_value get_all_collinear_pairs
    maize setaria 1 1 0
    maize setaria 1 1 0
    sorghum setaria 1 1 0
    maize oryza 1 1 0
    sorghum oryza 1 1 0
    setaria oryza 1 1 0
  4. Get synteny file and ks file for each species pair. Note:

    1. The ./scripts/ks_pipeline.py script uses the Species_1 column of species_pairs.csv as the query and the Species_2 column of species_pairs.csv as the reference in the collinearity procedure.
    2. The ./scripts/ks_pipeline.py script adjusts the parameters of the collinearity procedure based on the r_value q_value and get_all_collinear_pairs columns of the species pairs file.
    3. You may need to understand the meaning of the r_value, q_value and get_all_collinear_pairs parameter via quota_Anchor col command or refer to document.
    python ./scripts/ks_pipeline.py -i raw_data -o output_dir -s species_pairs.csv -a diamond -l raw_data --overwrite -plot_table       
  5. Ks fitting and correction for each species divergent peak. Note:

    1. The 0 in find ./output_dir/02synteny/*0.ks |awk '{printf "%s,", $1}' represents the value of the get_all_collinear_pairs column of the species pair file.
    2. The order of species pairs in the species pair file(specify by -s parameter, species_pairs.csv) must be consistent with the order of the ks file(specify by -k parameter)
    find ./output_dir/02synteny/*0.ks |awk '{printf "%s,", $1}'
     quota_Anchor correct -k "./output_dir/02synteny/maize_sorghum0.ks,./output_dir/02synteny/maize_setaria0.ks,./output_dir/02synteny/sorghum_setaria0.ks,./output_dir/02synteny/maize_oryza0.ks,./output_dir/02synteny/sorghum_oryza0.ks,./output_dir/02synteny/setaria_oryza0.ks" -s species_pairs.csv -t ortholog_trios_maize.csv -kr 0,2 -ot outfile_divergent_peaks.csv --overwrite
  6. Maize wgd ks peaks fitting

    quota_Anchor pre_col -a diamond -rs ./output_dir/01longest/maize.longest.pep -qs ./output_dir/01longest/maize.longest.pep -db ./maize/maize.database.diamond -mts 20 -e 1e-10 -b ./maize/maize.maize.diamond -rg ./raw_data/maize.gff3 -qg ./raw_data/maize.gff3 -o ./maize/zm_zm.table -bs 100 -al 0 --overwrite
    
    quota_Anchor dotplot -i ./maize/zm_zm.table -o ./maize/zm.zm.png -r ./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -r_label maize -q_label maize -use_identity --overwrite
    
    quota_Anchor col -i ./maize/zm_zm.table -o ./maize/zm_zm.collinearity -a 1 -m 1000 -W 1 -D 25 -I 5 -E -0.01 -f 0 --overwrite
    
    quota_Anchor dotplot -i ./maize/zm_zm.collinearity -o ./maize/zm.zm.collinearity.png -r ./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -r_label maize -q_label maize -use_identity --overwrite
    
    quota_Anchor ks -i ./maize/zm_zm.collinearity -a mafft -p ./output_dir/01longest/maize.longest.pep -d ./output_dir/01longest/maize.longest.cds -o ./maize/zm.zm.ks -t 16  --overwrite  
    
    quota_Anchor dotplot -i ./maize/zm_zm.collinearity -o ./maize/zm.zm.collinearity.ks.png -r ./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -r_label maize -q_label maize --overwrite -ks ./maize/zm.zm.ks

    You need to provide the number of components according to zm.zm.table.png or other methods

    quota_Anchor kde -i ./maize/zm_zm.collinearity -r./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -o ./maize/zm.zm.kde.png -k ./maize/zm.zm.ks --overwrite

    zm.zm.kde.png

    quota_Anchor kf -i ./maize/zm_zm.collinearity -r./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -o ./maize/zm.zm.kf.png -k ./maize/zm.zm.ks -f maize -components 2 -kr 0,2 --overwrite

    zm.zm.kf.png

  7. The Gaussian mixture model was used to group wgd gene pairs ks, and kernel density estimation and Gaussian approximation fitting were performed on each component. The initial species divergent peak was obtained by taking the mode of 382 kernel density estimates, and the rate was corrected to the focal species level based on trios.

    quota_Anchor kf -i ./maize/zm_zm.collinearity -r ./raw_data/maize.length.txt -q ./raw_data/maize.length.txt -o ./maize/zm.zm.png -k ./maize/zm.zm.ks -components 2 -f maize -correct_file outfile_divergent_peaks.csv -kr 0,2 --overwrite

    zm.zm.kf.mix.png

FAQ

The FAQ document is available at FAQ.md.

About

Strand and WGD aware syntenic gene identification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%