RepeatGenome uses known repeat sequences to assist in sequence alignment.
It is reasonably stable and future drastic changes are unlikely.
It implements a slightly modified version of Kraken, a metagenomics algorithm. For some constant k, Kraken associates each k-long subsequence ("k-mer") of a set of reference genomes with its least-common ancestor (LCA) in the taxonomy tree of the reference genomes used. For example, a certain k-mer may exist only in a given genus. The resulting library is then used to estimate which subtree of the repeat taxonomy tree each of a set of sequencing reads came from. This is done by querying the reads' k-mers against the library. For a full description, see the journal article.
RepeatGenome adapts this technique to estimate where in a single chosen species' genome a sequencing read originated. This is done using repeat sequences, which are prolific and of low complexity. Like taxonomic classifications, repeat sequence types are organized in a heirarchical tree. For example, the tree of repeat types may contain a subtree with the root node "Satellite", whose child nodes would be different types of Satellite DNA. These heirarchies are supplied by projects like Repbase. This allows us to associate k-mers with subtrees in a repeat classification tree like Kraken does with a taxonomy tree.
We use RepeatMasker to determine what parts of a reference genome are repeat sequences. Therefore, RepeatGenome's input comprises a RepeatMasker output file and the associated reference genome in FASTA format.
This is an admittedly brief description of the program logic. We may publish a more thorough coverage such as a white paper eventually. However, if you want to fully understand the code, it would be best to read the Kraken article, the apparently relevant parts of the RepeatMasker documentation, and the source code's function comments. The RepeatGenome source code is heavily commented, and the result of the godoc documentation generater will soon be posted publicly.
Integers are usually explicitly declared as having type int64 or uint64, but due to the size of the minimizer maps and of genomes in general, RepeatGenome probably won't run on 32-bit systems. You may be able to make it work for small m and k values and small reference genomes, but there are no guarantees.
The requirements for file input are currently somewhat strict. If you find them limiting, let me know and I can quickly add more flexible options.
The main argument to run-rg
is the genome name. This should be the
abbreviation of the reference genome being used - for example, most of
our tests were run on the most recent Drosophila melanogaster
(fruitfly) genome, dm3.
In the below file and directory names, <genome-name>
is used to
denote the above-mentioned genome name.
There is expected to be a subdirectory of the current working
directory bearing the genome name. This directory should contain a
RepeatMasker output file with the .fa.out
suffix.
A second directory <genome-name>-fasta
should contain all reference
sequence files in FASTA format.
Additionally, if any reads are to be processed, a subdirectory
<genome-name>-reads
of the current working directory is expected to
contain all reads in SAM format, with the .sam
suffix.
###Command Line Options
-force_gen
: Force generation of the Kraken database. By default, the database will be read from file if it already exists.-write_stats
: Write various tab-delimited and JSON files representing peripheral Kraken and repeat data.-no_write_lib
: Do not write the database to file.-verify_class
: Run classification a second time, with SAM-formatted reads, to find the percent of reads that were classified correctly. SAM-formatted reads must be available.-debug
: Run and print various debugging tests checking sanity and data integrity.-cpuprof
: Write cpu profile to file<genomeName>.cpuprof
using the runtime/pprof library.-memprof
: Write memory profile to<genomeName>.memprof
using the runtime/pprof library.-lca_classify
: Use the LCA of all recognized k-mers' classes as a read's classification. The default is to use the class of the first recognized k-mer.
Below are the directories and contents output by run-rg
.
<genome-name>-lib/
: Reusable library data associated with this genome.
<genome-name>.kraken
: The Kraken library for this genome. The storage format is relatively simple using varints - refer to the source code of RepeatGenome.WriteKraken() and RepeatGenome.ReadKraken() for a full specification.
<genome-name>-stats/
: Contains any statistics data written in the course of processing.
<genome-name>.classtree.json
: The classification tree JSON data optionally written.
Because the sequence alignment API is not yet defined, read-classification data is not yet written to file. The current code is therefore primarily a research implementation. If there's a certain format in which you'd like it stored, let me know.