Skip to content

Commit

Permalink
Change name to SCARAP and update version to 0.3.1
Browse files Browse the repository at this point in the history
  • Loading branch information
SWittouck committed Dec 22, 2020
1 parent ad74c8e commit 7f85162
Show file tree
Hide file tree
Showing 13 changed files with 38 additions and 38 deletions.
52 changes: 26 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Progenomics: toolkit for prokaryotic comparative genomics
# SCARAP: toolkit for comparative genomics of prokaryotes

Progenomics is a toolkit for fast and scalable comparative genomics. It has been designed for prokaryotes but should work for eukaryotic genomes as well. Progenomics can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets up until the order level. Its most useful features are fast pangenome inference, sensitive search of query sequences in a pangenome, rapid core genome inference using a heuristic strategy and the construction of concatenated core gene alignments ("supermatrices").
SCARAP is short for scalable and rapid pangenomes. Pangenome inference is SCARAP's main feature, but it also contains a number of other useful tools for comparative genomics of prokaryotes, such as: pangenome profile database construction and searching, rapid core genome inference, calculation of ANI/AAI-like metrics, genome clustering and dereplication and the construction of concatenated core gene alignments ("supermatrices"). SCARAP has been designed for prokaryotes but should work for eukaryotic genomes as well. It can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets with prokaryotic genomes from the species to the order level.

## Dependencies

Expand All @@ -12,21 +12,21 @@ Essential dependencies:
* [ete3](http://etetoolkit.org/) version >= 3.1.1
* [scipy](https://www.scipy.org/) version >= 1.4.1
* [MAFFT](https://mafft.cbrc.jp/alignment/software/) version >= 7.407
* [MMseqs2](https://github.com/soedinglab/MMseqs2) version >= 12-113e3
* [MMseqs2](https://github.com/soedinglab/MMseqs2) release 11 or 12

Dependencies for the search module and core pipeline:
Dependencies for the build, search and core-pipeline modules:

* [HMMER](http://hmmer.org/) version >= 3.1b2

Dependencies when using [OrthoFinder](https://github.com/davidemms/OrthoFinder) for pangenome inference:

* [OrthoFinder](https://github.com/davidemms/OrthoFinder) version >= 2.1.2
* [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) version >= 2.6.0
* [MCL](https://www.micans.org/mcl/index.html?sec_software) version >= 14-137
* [OrthoFinder](https://github.com/davidemms/OrthoFinder) version >= 2.1.2

## Usage

Progenomics is able to perform a number of specific tasks related to prokaryotic core and pangenomes (see also `progenomics -h`):
SCARAP is able to perform a number of specific tasks related to prokaryotic comparative genomics (see also `scarap -h`):

* `pan`: infer a pangenome from a set of faa files
* `build`: build a profile HMM database for a core/pangenome
Expand All @@ -47,14 +47,14 @@ A full core and pangenome pipeline are also implemented:
If you want to infer the pangenome of a set of genomes, you only need their faa files (fasta files with protein sequences) as input. If the faa files are stored in a folder `faas`, you can infer the pangenome using 16 threads by running:

ls faas/*.faa > faapaths.txt
progenomics pan faapaths.txt pan -t 16
scarap pan faapaths.txt pan -t 16

The pangenome will be stored in `pan/pangenome.tsv`.

The above example will use the builtin "FH" strategy to infer the pangenome; it is fast and scales more or less linearly with the number of input genomes. If you prefer to use OrthoFinder for pangenome inference, you can run:

ls faas/*.faa > faapaths.txt
progenomics pan faapaths.txt pan -d O-B -t 16
scarap pan faapaths.txt pan -d O-B -t 16

This will be a bit slower though.

Expand All @@ -67,12 +67,12 @@ Disclaimer: many aspects of this pipeline can still change, especially the way t
If you want to infer a pangenome of your genomes as well as build a pangenome database that you can later query with one or more genes of interest, you can run:

ls faas/*.faa > faapaths.txt
progenomics pan-pipeline faapaths.txt pan -t 16
scarap pan-pipeline faapaths.txt pan -t 16

If you then want to identify whether some genes of interest (let's say in a file called `querygenes.fasta`) are present in the pangenome database, you can run:

echo querygenes.fasta > querypath.txt
progenomics search querypath.txt pangenome/db hits
scarap search querypath.txt pangenome/db hits

This will produce a `hits` output folder with the file `hits.tsv`.

Expand All @@ -81,71 +81,71 @@ This will produce a `hits` output folder with the file `hits.tsv`.
The pangenome pipeline can also be performed by running individual tasks:

ls faas/*.faa > faapaths.txt
progenomics pan faapaths.txt pan
progenomics build faapaths.txt pan/pangenome.tsv db
progenomics search faapaths.txt db pan2 -s pan -p pan/pangenome.tsv
scarap pan faapaths.txt pan
scarap build faapaths.txt pan/pangenome.tsv db
scarap search faapaths.txt db pan2 -s pan -p pan/pangenome.tsv

The final `search` step is required because it will train a hmmer score cutoff for each profile HMM in the pangenome database and add these cutoffs to the database. In addition, it produces an orthogroup assignment for each protein in the set of input genomes (`pan2/hits.tsv`). Importantly, these assignments are not always the same as the orthogroup assignments listed in `pan/pangenome.tsv` because they are produced by a hmmer search with orthogroup-specific cutoffs, while the original orthogroup assignments have been produced by the pangenome inference process. A comparison between these two strategies of orthogroup assignment could be interesting.

### Inferring a core genome only

Let's say we want to infer the core genome for a set of genomes and we have one faa file (amino acid sequences of predicted genes) per genome in the folder `faas`. This can be done using the core pipeline of progenomics, which can be a lot faster than full pangenome inference.
Let's say we want to infer the core genome for a set of genomes and we have one faa file (amino acid sequences of predicted genes) per genome in the folder `faas`. This can be done using the core pipeline of SCARAP, which can be a lot faster than full pangenome inference.

**Quick version**

To get the core genome, we can simply run the following commands:

ls faas/*.faa > faapaths.txt
progenomics core-pipeline faapaths.txt core -t 16
scarap core-pipeline faapaths.txt core -t 16

This will create the output folder `core`, run progenomics with 16 threads and produce the file `coregenome.tsv`. This output tsv file contains the core orthogroups and has the columns gene, genome and orthogroup.
This will create the output folder `core`, run SCARAP with 16 threads and produce the file `coregenome.tsv`. This output tsv file contains the core orthogroups and has the columns gene, genome and orthogroup.

If we now want to construct a supermatrix (concatenated alignment) of these core orthogroups, we could do it as follows:

progenomics supermatrix faapaths.txt core/coregenome.tsv supermatrix
scarap supermatrix faapaths.txt core/coregenome.tsv supermatrix

This will create a `supermatrix` output folder, with in it a file supermatrix.fasta.

And that's it! Three lines of code to get from the faa files to the supermatrix fasta file, ready to start constructing your phylogenetic tree.

**Detailed version**

If we want more fine-grained control, we could achieve the same result by running individual progenomics tasks. These individual tasks also give insight in how the core genome pipeline actually works.
If we want more fine-grained control, we could achieve the same result by running individual SCARAP tasks. These individual tasks also give insight in how the core genome pipeline actually works.

**Step 1:** infer the pangenome of a random subset of seed genomes (e.g. 30).

mkdir seeds cands
ls faas/*.faa > faapaths.txt
shuf -n 30 faapaths.txt > seeds/faapaths.txt
progenomics pan seeds/faapaths.txt seeds/pan
scarap pan seeds/faapaths.txt seeds/pan

**Step 2:** build a profile HMM database of "candidate core orthogroups" that are present in at least M seed genomes (e.g. 25).

progenomics build seeds/faapaths.txt seeds/pan/pangenome.tsv cands/db -m 25
scarap build seeds/faapaths.txt seeds/pan/pangenome.tsv cands/db -m 25

**Step 3:** identify the candidate core genes in the full set of genomes by searching all proteins of all genomes against the database of candidate core genes.

progenomics search faapaths.txt cands/db cands/core -y core
scarap search faapaths.txt cands/db cands/core -y core

**Step 4:** identify the core genes from the candidates by imposing a minimum percentage presence cutoff (e.g. 98%) in the full set of genomes.

progenomics checkgroups cands/core/coregenome.tsv cands/groups
scarap checkgroups cands/core/coregenome.tsv cands/groups
awk '{ if ($2 > 0.98) { print $1 } }' cands/groups/orthogroups.tsv \
> orthogroups.txt
progenomics filter cands/core/coregenome.tsv core -o orthogroups.txt
scarap filter cands/core/coregenome.tsv core -o orthogroups.txt

The output folder `core` will now contain the file coregenome.tsv.

## License

Progenomics is free software, licensed under [GPLv3](https://github.com/SWittouck/progenomics/blob/master/LICENSE).
SCARAP is free software, licensed under [GPLv3](https://github.com/SWittouck/scarap/blob/master/LICENSE).

## Feedback

All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file [issues](https://github.com/SWittouck/progenomics/issues).
All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file [issues](https://github.com/SWittouck/scarap/issues).

## Citation

When you use progenomics for your publication, please cite:
When you use for your publication, please cite:

[Wittouck, S., Wuyts, S., Meehan, C. J., van Noort, V., & Lebeer, S. (2019). A Genome-Based Species Taxonomy of the Lactobacillus Genus Complex. mSystems, 4(5), e00264–19.](https://doi.org/10.1128/mSystems.00264-19)
2 changes: 1 addition & 1 deletion src/progenomics/README.md → src/scarap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Remark: all scripts import utils

* progenomics.py:
* scarap.py:
* commandline interface
* imports taskwrappers
* taskwrappers.py:
Expand Down
File renamed without changes.
File renamed without changes.
4 changes: 2 additions & 2 deletions src/progenomics/checkers.py → src/scarap/checkers.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def check_mafft():
version = r.search(res.stderr.decode()).group()
logging.info(f"detected MAFFT v{version}")
if float(version) < 7.310:
logging.warning("progenomics has been tested with MAFFT v7.310 or newer")
logging.warning("SCARAP has been tested with MAFFT v7.310 or newer")

def check_mmseqs():
try:
Expand All @@ -41,7 +41,7 @@ def check_mmseqs():
release = releases_tested.get(version, "unknown")
logging.info(f"detected MMseqs2 version {version} (release {release})")
if not version in releases_tested.keys():
logging.warning("progenomics has only been tested with MMseqs2 "
logging.warning("SCARAP has only been tested with MMseqs2 "
"releases 11 and 12")

def check_infile(infile):
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
18 changes: 9 additions & 9 deletions src/progenomics/progenomics.py → src/scarap/scarap.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
#! /usr/bin/env python3

# This is the main script of progenomics; it only contains the commandline
# This is the main script of SCARAP; it only contains the commandline
# interface.

__author__ = "Stijn Wittouck"
__version__ = "0.3.0"
__version__ = "0.3.1"

import argparse
import logging
Expand All @@ -22,7 +22,7 @@ def print_help():
Stijn Wittouck (development)
Sarah Lebeer (supervision)
USAGE
progenomics [-h] <task> <task-specific arguments>
scarap [-h] <task> <task-specific arguments>
TASKS
pan --> infer a pangenome from a set of faa files
build --> build a profile HMM database for a core/pangenome
Expand All @@ -40,7 +40,7 @@ def print_help():
core-pipeline --> infer a core genome, build a profile HMM database and
train score cutoffs from a set of faa files
DOCUMENTATION
https://github.com/swittouck/progenomics\
https://github.com/swittouck/scarap\
'''

print(message.format(__version__))
Expand All @@ -49,7 +49,7 @@ def print_intro():

message = '''\
This is progenomics version {0}
This is SCARAP version {0}
'''

print(message.format(__version__))
Expand Down Expand Up @@ -250,19 +250,19 @@ def parse_arguments():
format = '[%(asctime)s] %(levelname)s: %(message)s',
datefmt = '%d/%m %H:%M:%S'
)
logging.info("welcome to progenomics")
logging.info("welcome to SCARAP")

if "cont" in args and args.cont and os.path.exists(args.outfolder):
logging.info("continuing in existing output folder")
check_outfile(os.path.join(args.outfolder, "progenomics.log"))
check_outfile(os.path.join(args.outfolder, "SCARAP.log"))
else:
logging.info("creating output folder and log file")
check_outdir(args.outfolder)
os.makedirs(args.outfolder, exist_ok = True)
logging.info(f"output folder '{args.outfolder}' created")

handler = logging.FileHandler(
filename = os.path.join(args.outfolder, "progenomics.log"),
filename = os.path.join(args.outfolder, "SCARAP.log"),
mode = 'w'
)
handler.setFormatter(logging.getLogger().handlers[0].formatter)
Expand All @@ -271,4 +271,4 @@ def parse_arguments():

args.func(args)

logging.info("progenomics out")
logging.info("SCARAP out")
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 comments on commit 7f85162

Please sign in to comment.