- Fix a bug that causes one-column rows to be reduced to scalars (e07636d). This bug occurs with pure isolates when all reads can be unambiguously assigned to a single species (#5).
- Add
-g
to control the minimum number of unique marker genes (default: 1) required for a species to report its genome copies. Increase-g
(1 -> 2) lowers recall (detection limit: 0.125 -> 0.25) but improves precision.
- Fix a bug introduced in v0.2.2 causing ties not resolved properly. Results should be identical to v0.2.1.
- Avoid direct float comparison (315b795).
- Revert filtering criteria back to v0.1.6. The old criteria turn out to be helpful in rescuing certain chimeric reads in very rare settings.
- Reduce peak memory usage.
- [breaking] Add decoy protein sequences (RefSeq fungi, protozoa, viral, plant, and human GRCh38/hg38) which effectively trap non-prokaryotic reads and prevent them from inflating total prokaryotic genome copy estimates if the pre-filtering module (default with
Kraken2
) is not enabled. Pre-filtering is no longer necessary even if samples are contaminated with human DNA or other common eukaryotes/viruses, unless the mean genome size of prokaryotes needs to be estimated. See 8918168 for more details. This function requires a database released on or after 2024-06-28.
- Simplify filtering criteria for alignments.
- Prevent
extract_sequence
from loading all marker-containing reads into memory. - Change
-F
to--frameshift
andmax_iteration
tomax_iterations
for consistency. - Switch from figshare to zenodo for better database versioning.
- Fix a bug causing
tqdm
being disabled (3bbd087).
- Use
tqdm
for logging. - Reduce peak memory usage by parsing PAF files on the fly.
- Output both gap-compressed and gap-uncompressed (BLAST-like) identity.
- Refine output format.
- Change alignment filtering criteria: make
AS
cutoff more stringent, dropMS
. See 7cc6dbd for details.
- Fix a bug causing total genome copies not being properly calculated with diamond>=2.1.9.
- Add gap-compressed ANI to output.
- Add options to control EM early stop.
- Use
scipy.sparse
to reduce peak memory usage and computational time. - Change default terminal condition of EM (
max_iteration
: 100 -> 1000;epsilon
: 1e-5 -> 1e-10).
- Fix a bug causing chimeric reads not being aggregated.
- Output a json file to indicate the lineage of processed reads.
- Make databases indexed by default.
- Use only kraken's prediction for removal of non-prokaryotic reads.
- Change
--db_kraken
to--db-kraken
for consistency. - Change
sp[0-9]+
tosp[0-9]+
for consistency. - Change
copies
tocopy
in output files for consistency.
- Prevent numpy from using all logical cores.
- First release.