Skip to content

Releases: SWittouck/SCARAP

SCARAP v1.0.0

15 Nov 15:58
Compare
Choose a tag to compare

After having been in development for a long time, SCARAP has now reached the first version that I consider mature! To celebrate this milestone, I have put this version of SCARAP on PyPI: it can now be installed with pip install scarap 🎉

A special thank you to @TheOafidian for supplying some code structure updates, bug fixes, a test suite via GitHub Actions and a very nice logo!

Features:

  • The pan module delimits orthogroups in a binary splitting process of initial gene clusters. Previously, each splitting step involved the selection of 50 representative sequences, which were aligned and hierarchically clustered. Three parameters were added to the module to give the user control over this process:
    • --max-align (default: 512): Only sequence clusters with more sequences than this number will go through representative sequence selection; otherwise, all sequences will be considered representatives.
    • --max-reps (default: 40): The maximum number of representatives to use for splitting large clusters. Initially, this number of representatives will be selected.
    • --min-reps (default: 32): The minimum number of representatives to use for splitting large clusters. Subclusters inherit representative sequences from their parent cluster if they are available. If the number of inherited clusters is larger than MIN-REPS, they will be re-used.
  • The values of the --max-align and --min-reps parameters were optimized, leading to a speedup of the pan and core modules in many situations.
  • The search module received a slight speedup (and therefore indirectly the core module as well).
  • If supplied fasta filenames are not unique, the problematic names will now be listed in the log.

Bug fixes:

  • The presence of the ">" character in a fasta header no longer leads to an error.
  • In the build module, the core filter is now applied before score cutoff training instead of the core prefilter (as intended).
  • MAFFT would sometimes mistakenly identify amino acid sequences as nucleotide sequences; this was fixed.

SCARAP v0.4.0

15 Mar 11:00
Compare
Choose a tag to compare

Features:

  • The build and search modules have been improved.
    • Both modules now use MMseqs2 profile searches instead of HMMER profile searches. This results in a big speed increase and removes the HMMER dependency from SCARAP.
    • The build module can now build a core genome database; it takes the arguments --core-prefilter, --core-filter and --max-core-genes.
    • The build module now always trains score cutoffs on all orthogroups that were supplied (and selected).
  • The core-pipeline module has been improved and has been renamed to core.
    • Of course, the core module benefits from the improvements of the build and search modules.
    • The final core genes are now selected on the seed genomes instead of on all genomes. This means that the core module is now scalable to larger datasets. To compensate for this, the default number of seed genomes has been increased to 100. Instead of a "seedfilter" and "allfilter", a user now specifies a "prefilter" and "filter" with the arguments --core-prefilter and --core-filter (just like with the build module).
    • The prefilter and filter are now based on single-copy occurrence instated of overall occurrence.
    • It is now possible to specify a maximum number of core genes to extract with the parameter --max-core-genes.
  • The clust module has been renamed to sample.
    • The argument --max_clusters has been renamed to --max-genomes.
  • The supermatrix module was renamed to concat and now allows for core gene selection with the arguments --core-filter and --max-core-genes.
  • The pan-pipeline module has been removed.
    • Its functionality can now be achieved by combining pan and build.
  • Some argument names were improved.

Bug fixes:

  • Tabs in fasta headers no longer give problems.
  • Fixed a bug in the core module where a python error was thrown when a folder with faa files was used as input.
  • Fixed a bug in the sample module: the last line of seeds.txt did not contain a newline character.

SCARAP v0.3.2

19 Dec 08:18
Compare
Choose a tag to compare

This is a small update with some user-friendliness improvements and bug fixes. It may be worth it to update to this version, because SCARAP v0.3.1 couldn't deal with MMseqs2 release 13 (February 24th, 2021).

Features:

  • A folder with fasta files is now allowed as input (as an alternative to a file with paths to individual fasta files).
  • An error is no longer thrown when the detected MMseqs2 version is unknown, and MMseqs2 release 13 is now recognized.
  • SCARAP now checks whether gene names are unique across fasta files.

Bug fixes:

  • Fasta headers without spaces no longer result in an error.
  • Having zero splitable superfamilies no longer results in an error.
  • Gene identifiers with "|" are now dealt with correctly.

SCARAP v0.3.1

22 Dec 16:57
Compare
Choose a tag to compare

With my PhD defense nearing quickly, I'm wrapping up some enhancements to Progenomics and releasing them in a new version: v0.3.1. The main improvements are that I added a module that can cluster huge genome datasets in linear time and that I fixed some rare but annoying bugs in the pan module. In addition, the toolkit also receives a brand-new name: SCARAP (short for scalable and rapid pangenomes). The main reason for this name change is that the name Progenomics was a bit too generic and to avoid confusion with the progenomes database of the EMBL. SCARAP can of course do more than infer pangenomes, but that is one of its main features. Also, I like the sound of SCARAP, and so did my colleagues at the Lebeer Lab. SCARAP fulfills most of the criteria for a good software name: it is short, unique and easy to remember (or so I hope).

Features:

  • A clust module has been added that can cluster genome datasets in linear time, given a desired number or clusters and/or an ANI/AAI-like cutoff.
  • A fetch module has been added that extracts sequences of a pangenome into a fasta file per orthogroup.
  • An "S" pangenome inference strategy has been added that runs the superclustering step only, without cluster splitting.
  • The pan module has been made compatible with MMseqs2 releases 11 and 12 (but is now no longer compatible with releases before that).
  • All modules are now able to read gzipped fasta files.
  • The output of the checkgroups module now also contains the total occurrence of the orthogroups (next to their single-copy occurrence).
  • Invalid commandline arguments are now corrected when possible (e.g. 50 to 0.50 for a percentage).
  • Logging has been improved for many modules.
  • Version checks were added for MMseqs2 and MAFFT.
  • Large temporary files are now removed for the pan and supermatrix modules.

Bug fixes:

  • Some rare but important bugs of the pan module were fixed:
    • Error in the cluster splitting process when all sequences of a cluster had the same representative
    • Error when the first characters of an amino acid sequence made it look like a DNA sequence
    • Occasional false negative split of high-copy families
  • Some other, small bugs

Progenomics v0.3.0

29 Oct 07:40
Compare
Choose a tag to compare

Progenomics now has its own builtin pangenome strategies!

  • You can still set the pangenome inference strategy with the -d commandline option.
  • The new default strategy, called "FH", is very fast in comparison to existing tools such as OrthoFinder or SonicParanoid. In addition, it scales more or less linearly with the number of input genomes.
  • A variant of the FH strategy, called "H", is not scalable but is even faster on relatively small datasets (~ 60 prokaryotes genomes or less).
  • You can still use OrthoFinder for pangenome inference by setting the strategy to "O-B" (for OrthoFinder with BLAST) or "O-D" (for OrthoFinder with DIAMOND).

Progenomics v0.2.0

27 Dec 10:15
Compare
Choose a tag to compare

This is a major overhaul of the entire toolkit:

  • The interface has been rewritten, with simpler tasks and shorter and more intuitive task names
  • The following tasks are now available: pan, build, search, checkgenomes, checkgroups, filter and supermatrix
  • The following pipelines are now available: pan-pipeline and core-pipeline
  • The R dependencies have been removed
  • Dependencies are now checked before running
  • Logging has been improved
  • All tasks now assume unzipped fasta files

We're still not in major version 1 because task names can still change, as well as some of the functionality (for example, the way in which hmmer score cutoffs are trained).

Progenomics v0.1.0

22 Aug 12:03
Compare
Choose a tag to compare

This is the first public release of progenomics! It can handle the following tasks:

  • construct_profile_db
  • prepare_candidate_scgs
  • select_scgs
  • select_genomes
  • construct_scg_matrix
  • construct_supermatrix
  • calculate_scnis
  • nucleotide_supermatrix_from_scg_matrix