Change name to SCARAP and update version to 0.3.1

SWittouck · Dec 22, 2020 · 7f85162 · 7f85162
1 parent ad74c8e
commit 7f85162
Show file tree

Hide file tree

Showing 13 changed files with 38 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-# Progenomics: toolkit for prokaryotic comparative genomics
+# SCARAP: toolkit for comparative genomics of prokaryotes
 
-Progenomics is a toolkit for fast and scalable comparative genomics. It has been designed for prokaryotes but should work for eukaryotic genomes as well. Progenomics can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets up until the order level. Its most useful features are fast pangenome inference, sensitive search of query sequences in a pangenome, rapid core genome inference using a heuristic strategy and the construction of concatenated core gene alignments ("supermatrices"). 
+SCARAP is short for scalable and rapid pangenomes. Pangenome inference is SCARAP's main feature, but it also contains a number of other useful tools for comparative genomics of prokaryotes, such as: pangenome profile database construction and searching, rapid core genome inference, calculation of ANI/AAI-like metrics, genome clustering and dereplication and the construction of concatenated core gene alignments ("supermatrices"). SCARAP has been designed for prokaryotes but should work for eukaryotic genomes as well. It can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets with prokaryotic genomes from the species to the order level. 
 
 ## Dependencies
 
@@ -12,21 +12,21 @@ Essential dependencies:
     * [ete3](http://etetoolkit.org/) version >= 3.1.1
     * [scipy](https://www.scipy.org/) version >= 1.4.1
 * [MAFFT](https://mafft.cbrc.jp/alignment/software/) version >= 7.407
-* [MMseqs2](https://github.com/soedinglab/MMseqs2) version >= 12-113e3
+* [MMseqs2](https://github.com/soedinglab/MMseqs2) release 11 or 12
 
-Dependencies for the search module and core pipeline:
+Dependencies for the build, search and core-pipeline modules: 
 
 * [HMMER](http://hmmer.org/) version >= 3.1b2
 
 Dependencies when using [OrthoFinder](https://github.com/davidemms/OrthoFinder) for pangenome inference:
 
+* [OrthoFinder](https://github.com/davidemms/OrthoFinder) version >= 2.1.2
 * [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) version >= 2.6.0
 * [MCL](https://www.micans.org/mcl/index.html?sec_software) version >= 14-137
-* [OrthoFinder](https://github.com/davidemms/OrthoFinder) version >= 2.1.2
 
 ## Usage
 
-Progenomics is able to perform a number of specific tasks related to prokaryotic core and pangenomes (see also `progenomics -h`):
+SCARAP is able to perform a number of specific tasks related to prokaryotic comparative genomics (see also `scarap -h`):
 
 * `pan`: infer a pangenome from a set of faa files
 * `build`: build a profile HMM database for a core/pangenome
@@ -47,14 +47,14 @@ A full core and pangenome pipeline are also implemented:
 If you want to infer the pangenome of a set of genomes, you only need their faa files (fasta files with protein sequences) as input. If the faa files are stored in a folder `faas`, you can infer the pangenome using 16 threads by running: 
 
     ls faas/*.faa > faapaths.txt
-    progenomics pan faapaths.txt pan -t 16
+    scarap pan faapaths.txt pan -t 16
 
 The pangenome will be stored in `pan/pangenome.tsv`. 
 
 The above example will use the builtin "FH" strategy to infer the pangenome; it is fast and scales more or less linearly with the number of input genomes. If you prefer to use OrthoFinder for pangenome inference, you can run:
 
     ls faas/*.faa > faapaths.txt
-    progenomics pan faapaths.txt pan -d O-B -t 16
+    scarap pan faapaths.txt pan -d O-B -t 16
 
 This will be a bit slower though.
 
@@ -67,12 +67,12 @@ Disclaimer: many aspects of this pipeline can still change, especially the way t
 If you want to infer a pangenome of your genomes as well as build a pangenome database that you can later query with one or more genes of interest, you can run:
 
     ls faas/*.faa > faapaths.txt
-    progenomics pan-pipeline faapaths.txt pan -t 16
+    scarap pan-pipeline faapaths.txt pan -t 16
 
 If you then want to identify whether some genes of interest (let's say in a file called `querygenes.fasta`) are present in the pangenome database, you can run:
 
     echo querygenes.fasta > querypath.txt
-    progenomics search querypath.txt pangenome/db hits
+    scarap search querypath.txt pangenome/db hits
 
 This will produce a `hits` output folder with the file `hits.tsv`.
 
@@ -81,71 +81,71 @@ This will produce a `hits` output folder with the file `hits.tsv`.
 The pangenome pipeline can also be performed by running individual tasks:
 
     ls faas/*.faa > faapaths.txt
-    progenomics pan faapaths.txt pan
-    progenomics build faapaths.txt pan/pangenome.tsv db
-    progenomics search faapaths.txt db pan2 -s pan -p pan/pangenome.tsv
+    scarap pan faapaths.txt pan
+    scarap build faapaths.txt pan/pangenome.tsv db
+    scarap search faapaths.txt db pan2 -s pan -p pan/pangenome.tsv
 
 The final `search` step is required because it will train a hmmer score cutoff for each profile HMM in the pangenome database and add these cutoffs to the database. In addition, it produces an orthogroup assignment for each protein in the set of input genomes (`pan2/hits.tsv`). Importantly, these assignments are not always the same as the orthogroup assignments listed in `pan/pangenome.tsv` because they are produced by a hmmer search with orthogroup-specific cutoffs, while the original orthogroup assignments have been produced by the pangenome inference process. A comparison between these two strategies of orthogroup assignment could be interesting.
 
 ### Inferring a core genome only
 
-Let's say we want to infer the core genome for a set of genomes and we have one faa file (amino acid sequences of predicted genes) per genome in the folder `faas`. This can be done using the core pipeline of progenomics, which can be a lot faster than full pangenome inference. 
+Let's say we want to infer the core genome for a set of genomes and we have one faa file (amino acid sequences of predicted genes) per genome in the folder `faas`. This can be done using the core pipeline of SCARAP, which can be a lot faster than full pangenome inference. 
 
 **Quick version**
 
 To get the core genome, we can simply run the following commands:
 
     ls faas/*.faa > faapaths.txt
-    progenomics core-pipeline faapaths.txt core -t 16
+    scarap core-pipeline faapaths.txt core -t 16
 
-This will create the output folder `core`, run progenomics with 16 threads and produce the file `coregenome.tsv`. This output tsv file contains the core orthogroups and has the columns gene, genome and orthogroup.
+This will create the output folder `core`, run SCARAP with 16 threads and produce the file `coregenome.tsv`. This output tsv file contains the core orthogroups and has the columns gene, genome and orthogroup.
 
 If we now want to construct a supermatrix (concatenated alignment) of these core orthogroups, we could do it as follows:
 
-    progenomics supermatrix faapaths.txt core/coregenome.tsv supermatrix
+    scarap supermatrix faapaths.txt core/coregenome.tsv supermatrix
 
 This will create a `supermatrix` output folder, with in it a file supermatrix.fasta.
 
 And that's it! Three lines of code to get from the faa files to the supermatrix fasta file, ready to start constructing your phylogenetic tree.
 
 **Detailed version**
 
-If we want more fine-grained control, we could achieve the same result by running individual progenomics tasks. These individual tasks also give insight in how the core genome pipeline actually works.
+If we want more fine-grained control, we could achieve the same result by running individual SCARAP tasks. These individual tasks also give insight in how the core genome pipeline actually works.
 
 **Step 1:** infer the pangenome of a random subset of seed genomes (e.g. 30).
 
     mkdir seeds cands
     ls faas/*.faa > faapaths.txt
     shuf -n 30 faapaths.txt > seeds/faapaths.txt
-    progenomics pan seeds/faapaths.txt seeds/pan
+    scarap pan seeds/faapaths.txt seeds/pan
 
 **Step 2:** build a profile HMM database of "candidate core orthogroups" that are present in at least M seed genomes (e.g. 25).
 
-    progenomics build seeds/faapaths.txt seeds/pan/pangenome.tsv cands/db -m 25
+    scarap build seeds/faapaths.txt seeds/pan/pangenome.tsv cands/db -m 25
 
 **Step 3:** identify the candidate core genes in the full set of genomes by searching all proteins of all genomes against the database of candidate core genes.
 
-    progenomics search faapaths.txt cands/db cands/core -y core
+    scarap search faapaths.txt cands/db cands/core -y core
 
 **Step 4:** identify the core genes from the candidates by imposing a minimum percentage presence cutoff (e.g. 98%) in the full set of genomes.
 
-    progenomics checkgroups cands/core/coregenome.tsv cands/groups
+    scarap checkgroups cands/core/coregenome.tsv cands/groups
     awk '{ if ($2 > 0.98) { print $1 } }' cands/groups/orthogroups.tsv \
       > orthogroups.txt
-    progenomics filter cands/core/coregenome.tsv core -o orthogroups.txt
+    scarap filter cands/core/coregenome.tsv core -o orthogroups.txt
 
 The output folder `core` will now contain the file coregenome.tsv.
 
 ## License
 
-Progenomics is free software, licensed under [GPLv3](https://github.com/SWittouck/progenomics/blob/master/LICENSE).
+SCARAP is free software, licensed under [GPLv3](https://github.com/SWittouck/scarap/blob/master/LICENSE).
 
 ## Feedback
 
-All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file [issues](https://github.com/SWittouck/progenomics/issues).
+All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file [issues](https://github.com/SWittouck/scarap/issues).
 
 ## Citation
 
-When you use progenomics for your publication, please cite:
+When you use  for your publication, please cite:
 
 [Wittouck, S., Wuyts, S., Meehan, C. J., van Noort, V., & Lebeer, S. (2019). A Genome-Based Species Taxonomy of the Lactobacillus Genus Complex. mSystems, 4(5), e00264–19.](https://doi.org/10.1128/mSystems.00264-19)
diff --git a/src/progenomics/README.md → src/scarap/README.md b/src/progenomics/README.md → src/scarap/README.md
@@ -4,7 +4,7 @@
 
 Remark: all scripts import utils
 
-* progenomics.py:
+* scarap.py:
     * commandline interface
     * imports taskwrappers
 * taskwrappers.py:

diff --git a/src/progenomics/__init__.py → src/scarap/__init__.py b/src/progenomics/__init__.py → src/scarap/__init__.py
diff --git a/src/progenomics/callers.py → src/scarap/callers.py b/src/progenomics/callers.py → src/scarap/callers.py
diff --git a/src/progenomics/checkers.py → src/scarap/checkers.py b/src/progenomics/checkers.py → src/scarap/checkers.py
@@ -27,7 +27,7 @@ def check_mafft():
     version = r.search(res.stderr.decode()).group()
     logging.info(f"detected MAFFT v{version}")
     if float(version) < 7.310:
-        logging.warning("progenomics has been tested with MAFFT v7.310 or newer")
+        logging.warning("SCARAP has been tested with MAFFT v7.310 or newer")
 
 def check_mmseqs():
     try:
@@ -41,7 +41,7 @@ def check_mmseqs():
     release = releases_tested.get(version, "unknown")
     logging.info(f"detected MMseqs2 version {version} (release {release})")
     if not version in releases_tested.keys():
-        logging.warning("progenomics has only been tested with MMseqs2 "
+        logging.warning("SCARAP has only been tested with MMseqs2 "
             "releases 11 and 12")
 
 def check_infile(infile):

diff --git a/src/progenomics/computers.py → src/scarap/computers.py b/src/progenomics/computers.py → src/scarap/computers.py
diff --git a/src/progenomics/pan.py → src/scarap/pan.py b/src/progenomics/pan.py → src/scarap/pan.py
diff --git a/src/progenomics/readerswriters.py → src/scarap/readerswriters.py b/src/progenomics/readerswriters.py → src/scarap/readerswriters.py
diff --git a/src/progenomics/progenomics.py → src/scarap/scarap.py b/src/progenomics/progenomics.py → src/scarap/scarap.py
@@ -1,10 +1,10 @@
 #! /usr/bin/env python3
 
-# This is the main script of progenomics; it only contains the commandline
+# This is the main script of SCARAP; it only contains the commandline
 # interface.
 
 __author__ = "Stijn Wittouck"
-__version__ = "0.3.0"
+__version__ = "0.3.1"
 
 import argparse
 import logging
@@ -22,7 +22,7 @@ def print_help():
     Stijn Wittouck (development)
     Sarah Lebeer (supervision)
 USAGE
-    progenomics [-h] <task> <task-specific arguments>
+    scarap [-h] <task> <task-specific arguments>
 TASKS
     pan           --> infer a pangenome from a set of faa files
     build         --> build a profile HMM database for a core/pangenome
@@ -40,7 +40,7 @@ def print_help():
     core-pipeline --> infer a core genome, build a profile HMM database and
                       train score cutoffs from a set of faa files
 DOCUMENTATION
-    https://github.com/swittouck/progenomics\
+    https://github.com/swittouck/scarap\
 '''
 
     print(message.format(__version__))
@@ -49,7 +49,7 @@ def print_intro():
 
     message = '''\
 
-This is progenomics version {0}
+This is SCARAP version {0}
 '''
 
     print(message.format(__version__))
@@ -250,19 +250,19 @@ def parse_arguments():
         format = '[%(asctime)s] %(levelname)s: %(message)s',
         datefmt = '%d/%m %H:%M:%S'
     )
-    logging.info("welcome to progenomics")
+    logging.info("welcome to SCARAP")
 
     if "cont" in args and args.cont and os.path.exists(args.outfolder):
         logging.info("continuing in existing output folder")
-        check_outfile(os.path.join(args.outfolder, "progenomics.log"))
+        check_outfile(os.path.join(args.outfolder, "SCARAP.log"))
     else:
         logging.info("creating output folder and log file")
         check_outdir(args.outfolder)
         os.makedirs(args.outfolder, exist_ok = True)
         logging.info(f"output folder '{args.outfolder}' created")
 
     handler = logging.FileHandler(
-        filename = os.path.join(args.outfolder, "progenomics.log"),
+        filename = os.path.join(args.outfolder, "SCARAP.log"),
         mode = 'w'
     )
     handler.setFormatter(logging.getLogger().handlers[0].formatter)
@@ -271,4 +271,4 @@ def parse_arguments():
 
     args.func(args)
 
-    logging.info("progenomics out")
+    logging.info("SCARAP out")
diff --git a/src/progenomics/tasks_composite.py → src/scarap/tasks_composite.py b/src/progenomics/tasks_composite.py → src/scarap/tasks_composite.py
diff --git a/src/progenomics/tasks_core.py → src/scarap/tasks_core.py b/src/progenomics/tasks_core.py → src/scarap/tasks_core.py
diff --git a/src/progenomics/taskwrappers.py → src/scarap/taskwrappers.py b/src/progenomics/taskwrappers.py → src/scarap/taskwrappers.py
diff --git a/src/progenomics/utils.py → src/scarap/utils.py b/src/progenomics/utils.py → src/scarap/utils.py