Releases: SWittouck/SCARAP
SCARAP v1.0.0
After having been in development for a long time, SCARAP has now reached the first version that I consider mature! To celebrate this milestone, I have put this version of SCARAP on PyPI: it can now be installed with pip install scarap
🎉
A special thank you to @TheOafidian for supplying some code structure updates, bug fixes, a test suite via GitHub Actions and a very nice logo!
Features:
- The
pan
module delimits orthogroups in a binary splitting process of initial gene clusters. Previously, each splitting step involved the selection of 50 representative sequences, which were aligned and hierarchically clustered. Three parameters were added to the module to give the user control over this process:--max-align
(default: 512): Only sequence clusters with more sequences than this number will go through representative sequence selection; otherwise, all sequences will be considered representatives.--max-reps
(default: 40): The maximum number of representatives to use for splitting large clusters. Initially, this number of representatives will be selected.--min-reps
(default: 32): The minimum number of representatives to use for splitting large clusters. Subclusters inherit representative sequences from their parent cluster if they are available. If the number of inherited clusters is larger than MIN-REPS, they will be re-used.
- The values of the
--max-align
and--min-reps
parameters were optimized, leading to a speedup of thepan
andcore
modules in many situations. - The
search
module received a slight speedup (and therefore indirectly thecore
module as well). - If supplied fasta filenames are not unique, the problematic names will now be listed in the log.
Bug fixes:
- The presence of the ">" character in a fasta header no longer leads to an error.
- In the
build
module, the core filter is now applied before score cutoff training instead of the core prefilter (as intended). - MAFFT would sometimes mistakenly identify amino acid sequences as nucleotide sequences; this was fixed.
SCARAP v0.4.0
Features:
- The
build
andsearch
modules have been improved.- Both modules now use MMseqs2 profile searches instead of HMMER profile searches. This results in a big speed increase and removes the HMMER dependency from SCARAP.
- The
build
module can now build a core genome database; it takes the arguments--core-prefilter
,--core-filter
and--max-core-genes
. - The
build
module now always trains score cutoffs on all orthogroups that were supplied (and selected).
- The
core-pipeline
module has been improved and has been renamed tocore
.- Of course, the
core
module benefits from the improvements of thebuild
andsearch
modules. - The final core genes are now selected on the seed genomes instead of on all genomes. This means that the
core
module is now scalable to larger datasets. To compensate for this, the default number of seed genomes has been increased to 100. Instead of a "seedfilter" and "allfilter", a user now specifies a "prefilter" and "filter" with the arguments--core-prefilter
and--core-filter
(just like with thebuild
module). - The prefilter and filter are now based on single-copy occurrence instated of overall occurrence.
- It is now possible to specify a maximum number of core genes to extract with the parameter
--max-core-genes
.
- Of course, the
- The
clust
module has been renamed tosample
.- The argument
--max_clusters
has been renamed to--max-genomes
.
- The argument
- The
supermatrix
module was renamed toconcat
and now allows for core gene selection with the arguments--core-filter
and--max-core-genes
. - The
pan-pipeline
module has been removed.- Its functionality can now be achieved by combining
pan
andbuild
.
- Its functionality can now be achieved by combining
- Some argument names were improved.
Bug fixes:
- Tabs in fasta headers no longer give problems.
- Fixed a bug in the
core
module where a python error was thrown when a folder with faa files was used as input. - Fixed a bug in the
sample
module: the last line of seeds.txt did not contain a newline character.
SCARAP v0.3.2
This is a small update with some user-friendliness improvements and bug fixes. It may be worth it to update to this version, because SCARAP v0.3.1 couldn't deal with MMseqs2 release 13 (February 24th, 2021).
Features:
- A folder with fasta files is now allowed as input (as an alternative to a file with paths to individual fasta files).
- An error is no longer thrown when the detected MMseqs2 version is unknown, and MMseqs2 release 13 is now recognized.
- SCARAP now checks whether gene names are unique across fasta files.
Bug fixes:
- Fasta headers without spaces no longer result in an error.
- Having zero splitable superfamilies no longer results in an error.
- Gene identifiers with "|" are now dealt with correctly.
SCARAP v0.3.1
With my PhD defense nearing quickly, I'm wrapping up some enhancements to Progenomics and releasing them in a new version: v0.3.1. The main improvements are that I added a module that can cluster huge genome datasets in linear time and that I fixed some rare but annoying bugs in the pan
module. In addition, the toolkit also receives a brand-new name: SCARAP (short for scalable and rapid pangenomes). The main reason for this name change is that the name Progenomics was a bit too generic and to avoid confusion with the progenomes database of the EMBL. SCARAP can of course do more than infer pangenomes, but that is one of its main features. Also, I like the sound of SCARAP, and so did my colleagues at the Lebeer Lab. SCARAP fulfills most of the criteria for a good software name: it is short, unique and easy to remember (or so I hope).
Features:
- A
clust
module has been added that can cluster genome datasets in linear time, given a desired number or clusters and/or an ANI/AAI-like cutoff. - A
fetch
module has been added that extracts sequences of a pangenome into a fasta file per orthogroup. - An "S" pangenome inference strategy has been added that runs the superclustering step only, without cluster splitting.
- The
pan
module has been made compatible with MMseqs2 releases 11 and 12 (but is now no longer compatible with releases before that). - All modules are now able to read gzipped fasta files.
- The output of the checkgroups module now also contains the total occurrence of the orthogroups (next to their single-copy occurrence).
- Invalid commandline arguments are now corrected when possible (e.g. 50 to 0.50 for a percentage).
- Logging has been improved for many modules.
- Version checks were added for MMseqs2 and MAFFT.
- Large temporary files are now removed for the
pan
andsupermatrix
modules.
Bug fixes:
- Some rare but important bugs of the
pan
module were fixed:- Error in the cluster splitting process when all sequences of a cluster had the same representative
- Error when the first characters of an amino acid sequence made it look like a DNA sequence
- Occasional false negative split of high-copy families
- Some other, small bugs
Progenomics v0.3.0
Progenomics now has its own builtin pangenome strategies!
- You can still set the pangenome inference strategy with the
-d
commandline option. - The new default strategy, called "FH", is very fast in comparison to existing tools such as OrthoFinder or SonicParanoid. In addition, it scales more or less linearly with the number of input genomes.
- A variant of the FH strategy, called "H", is not scalable but is even faster on relatively small datasets (~ 60 prokaryotes genomes or less).
- You can still use OrthoFinder for pangenome inference by setting the strategy to "O-B" (for OrthoFinder with BLAST) or "O-D" (for OrthoFinder with DIAMOND).
Progenomics v0.2.0
This is a major overhaul of the entire toolkit:
- The interface has been rewritten, with simpler tasks and shorter and more intuitive task names
- The following tasks are now available:
pan
,build
,search
,checkgenomes
,checkgroups
,filter
andsupermatrix
- The following pipelines are now available:
pan-pipeline
andcore-pipeline
- The R dependencies have been removed
- Dependencies are now checked before running
- Logging has been improved
- All tasks now assume unzipped fasta files
We're still not in major version 1 because task names can still change, as well as some of the functionality (for example, the way in which hmmer score cutoffs are trained).
Progenomics v0.1.0
This is the first public release of progenomics! It can handle the following tasks:
- construct_profile_db
- prepare_candidate_scgs
- select_scgs
- select_genomes
- construct_scg_matrix
- construct_supermatrix
- calculate_scnis
- nucleotide_supermatrix_from_scg_matrix