Skip to content

Latest commit

 

History

History
78 lines (72 loc) · 6.19 KB

README.md

File metadata and controls

78 lines (72 loc) · 6.19 KB

pfam-genomes

A compact pipeline that allows for searching selected domains (based on Pfam HMM models) in proteins encoded in genomic sequences.

1. Conda environment

To set up the environment properly you need to have Miniconda installed on linux-64 platform. The pipeline was tested on Ubuntu 22.04 using conda 23.7.4. Once you are ready with that, you need to create a conda environment pfam-gen described in envs/pfam-gen.txt:

conda create --name pfam-gen --file envs/pfam-gen.txt

Next, activate the environment:

conda activate pfam-gen

2. Test data

Using test_data.sh script, you may download 20 staphylococcal genomic sequences from NCBI GenBank to genomes directory, as well as Pfam-A.hmm file to test the pipeline on a set of 10 CAT (catalytic) and 5 CWT (cell-wall-targeting) domains (input/domains.tsv) that occurr in bacterial peptidoglycan hydrolases. Simply run the script in the pipeline directory:

./test_data.sh

3. Directory structure and pipeline files

The pipeline utilises the following directory structure:

your_pipeline_location/
├── Snakefile
├── config.yaml
├── input
│   ├── colors.tsv
│   ├── domains.tsv
│   ├── IUPACDNA.txt
│   └── TABLE11.txt
├── src
│   ├── build.sh
│   └── extractorfs.cpp
├── templates
│   ├── style.css
│   └── tmpl.html
├── scripts
│   ├── annot.py
│   ├── archchart.py
│   ├── extractfasta.py
│   ├── extractorfs
│   ├── extractpfm.sh
│   ├── filter.py
│   ├── lib
│   │   └── fasta.py
│   ├── preprocess.py
│   └── unique.py
├── log
└── output

In the working directory you can find the Snakefile describing the pipeline and config.yaml, in which paths to Pfam-A.hmm and the directory with genomic sequences as well as domain architectures of interes are provided. In input/ there are files describing nucleotide sequence alphabet and translation table to be used (IUPACDNA.txt, TABLE11.txt) as well as groups of domains and domains being sarched for next to colors these groups and domains are to be painted with in a visualisation of all possible architectures that are found in searched genomes (domains.tsv, colors.tsv). In src/ a source file extractorfs.cpp for scripts/extractorfs is placed. If build.sh is run from that directory, the application will be recompiled and saved as scripts/extractorfs. In templates/ CSS and HTML templates are located. These are used to generate a HTML visualisation of all possible architectures that are found in searched genomes, which next is converted to a PNG file. Necessary scripts and one compiled application are located in scripts/. Directories output/ and log/ will be created automatically once the pipeline is run. All diagnostic and error messages from tools and scripts used by the pipeline will be redirected to files in the log/ directory.

4. Pipeline architecture

The pipline described in the Snakefile encompasses the following stages:

  1. extractorfs -- using scripts/extractorfs, provided DNA alphabet (input/IUPACDNA.txt) and translation table (input/TABLE11.txt), from each genome extract all posisble open reading frames (ORFs) of lenght >= 300 nt.
  2. uniquetrans -- using scripts/unique.py, cluster quickly extracted protein sequences based on their 100% identity to prepare a non-redundant set for HMM searches.
  3. hmmfetch -- using hmmfetch tool, fetch domains, from Pfam-A.hmm file, that are listed in input/domains.tsv and their Pfam accession version numbers are provided in pfam_acc column of that file.
  4. hmmsearch -- using hmmsearch tool, search for retrived domains in the non-redundant protein sequence set.
  5. preprocess -- using scripts/preprocess.py, preprocess the raw hmmsearch results and save relevant columns.
  6. archchart -- using scripts/archchart.py, branch to generate charts in HTML format that describe all domain architectures found, also in regard to groups of domains (column group in input/domains.tsv).
  7. convertchart -- using wkhtmltoimage tool, continue the branch to convert charts in HTML format to PNG.
  8. filter -- using scripts/filter.py, continue the main branch and from preprocessed hits select those of independent E-value (i-Evalue) <= 0.001 and domain coverage >= 80% (0.8). Next from target protein sequences select those with domain architecure of interest (config.yaml).
  9. extracttrans -- using scripts/extractfasta, extract finally selected sequences from the non-redundant set.
  10. signalp -- using independently installed SignalP 5.0 tool available as signalp, detect N-terminal signal sequences in the final set of protein sequences for each domain architecture of interest. If the tool is not available, the step generates an empty result file and is skipped.
  11. annotdom -- using scripts/annot.py, prepare a GFF3 file with annotations for the finally selected protein sequences. Annotations are prepared based on hmmsearch filtered results and signalp results.

More detailed description on how the pipeline works you will find in comments in the Snakefile, config.yaml file and the script/source files.

5. Running the pipeline

Providing that you have pfam-gen environment properly set up and activated as well as at least the test data dowloaded, you can first look up the list of tasks to be done by running the following command from the pipeline directory, where Snakefile is located:

snakemake --dryrun --quiet

and then run the pipeline using as many cores as you wish:

snakemake --cores number_of_cores

Final data, non-redundant protein sequences and corresponding annotations for each domain architecture of interest, as well as HTML and PNG visualisatons of all domain architectures found, you will find in output/final directory.