repeatgenome

Introduction

RepeatGenome uses known repeat sequences to assist in sequence alignment.

It is reasonably stable and future drastic changes are unlikely.

It implements a slightly modified version of Kraken, a metagenomics algorithm. For some constant k, Kraken associates each k-long subsequence ("k-mer") of a set of reference genomes with its least-common ancestor (LCA) in the taxonomy tree of the reference genomes used. For example, a certain k-mer may exist only in a given genus. The resulting library is then used to estimate which subtree of the repeat taxonomy tree each of a set of sequencing reads came from. This is done by querying the reads' k-mers against the library. For a full description, see the journal article.

RepeatGenome adapts this technique to estimate where in a single chosen species' genome a sequencing read originated. This is done using repeat sequences, which are prolific and of low complexity. Like taxonomic classifications, repeat sequence types are organized in a heirarchical tree. For example, the tree of repeat types may contain a subtree with the root node "Satellite", whose child nodes would be different types of Satellite DNA. These heirarchies are supplied by projects like Repbase. This allows us to associate k-mers with subtrees in a repeat classification tree like Kraken does with a taxonomy tree.

We use RepeatMasker to determine what parts of a reference genome are repeat sequences. Therefore, RepeatGenome's input comprises a RepeatMasker output file and the associated reference genome in FASTA format.

This is an admittedly brief description of the program logic. We may publish a more thorough coverage such as a white paper eventually. However, if you want to fully understand the code, it would be best to read the Kraken article, the apparently relevant parts of the RepeatMasker documentation, and the source code's function comments. The RepeatGenome source code is heavily commented, and the result of the godoc documentation generater will soon be posted publicly.

Integers are usually explicitly declared as having type int64 or uint64, but due to the size of the minimizer maps and of genomes in general, RepeatGenome probably won't run on 32-bit systems. You may be able to make it work for small m and k values and small reference genomes, but there are no guarantees.

Input

The requirements for file input are currently somewhat strict. If you find them limiting, let me know and I can quickly add more flexible options.

The main argument to run-rg is the genome name. This should be the abbreviation of the reference genome being used - for example, most of our tests were run on the most recent Drosophila melanogaster (fruitfly) genome, dm3.

In the below file and directory names, <genome-name> is used to denote the above-mentioned genome name.

There is expected to be a subdirectory of the current working directory bearing the genome name. This directory should contain a RepeatMasker output file with the .fa.out suffix.

A second directory <genome-name>-fasta should contain all reference sequence files in FASTA format.

Additionally, if any reads are to be processed, a subdirectory <genome-name>-reads of the current working directory is expected to contain all reads in SAM format, with the .sam suffix.

###Command Line Options

-force_gen: Force generation of the Kraken database. By default, the database will be read from file if it already exists.
-write_stats: Write various tab-delimited and JSON files representing peripheral Kraken and repeat data.
-no_write_lib: Do not write the database to file.
-verify_class: Run classification a second time, with SAM-formatted reads, to find the percent of reads that were classified correctly. SAM-formatted reads must be available.
-debug: Run and print various debugging tests checking sanity and data integrity.
-cpuprof: Write cpu profile to file <genomeName>.cpuprof using the runtime/pprof library.
-memprof: Write memory profile to <genomeName>.memprof using the runtime/pprof library.
-lca_classify: Use the LCA of all recognized k-mers' classes as a read's classification. The default is to use the class of the first recognized k-mer.

Output

Below are the directories and contents output by run-rg.

<genome-name>-lib/: Reusable library data associated with this genome.

<genome-name>.kraken: The Kraken library for this genome. The storage format is relatively simple using varints - refer to the source code of RepeatGenome.WriteKraken() and RepeatGenome.ReadKraken() for a full specification.

<genome-name>-stats/: Contains any statistics data written in the course of processing.

<genome-name>.classtree.json: The classification tree JSON data optionally written.

Because the sequence alignment API is not yet defined, read-classification data is not yet written to file. The current code is therefore primarily a research implementation. If there's a certain format in which you'd like it stored, let me know.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
bioutils		bioutils
repeatgenome		repeatgenome
LICENSE		LICENSE
NOTES.md		NOTES.md
README.md		README.md
TODO.md		TODO.md
run-rg.go		run-rg.go
v0.1-sample.stdout		v0.1-sample.stdout

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

repeatgenome

Introduction

Input

Output

About

Releases

Packages

Languages

License

mmcco/jh-bio

Folders and files

Latest commit

History

Repository files navigation

repeatgenome

Introduction

Input

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages