dbgAssembler is a simple genome/sequence assembler using a de Bruijn graph based approach. The dbgAssembler uses a one-line sequence fasta file as input to generate kmers of the sequence and reassembles it by finding the Eulierian path in the de Bruijn graph.
- python v3.9.10
Test data for evaluating the assembler
- human corona virus - RefSeq Assembly Accession: GCF_009858895.2
- human papilloma virus - RefSeq Assembly Accession: GCF_001274345.1
- Human adenovirus - NCBI Reference Sequence: NC_012959.1
- Human gammaherpesvirus 4 - NCBI Reference Sequence: NC_007605.1
The script can be installed by following instructions:
git clone https://github.com/Nilsson-D/dbgAssembler.git
cd dbgAssembler
python setup.py install
After installation, the dbgAssembler is located in the Assembler folder
Assembler/dbgAssembler.py -h
usage: dbgAssembler.py -i <input_file> -k <kmer_size> [optional] -o <output_file> [optional]
Type -h/--help for the help message
This program takes an one-line fasta file (DNA) as input and breaks the
sequence into kmers of size k. Then reassembles the string using a de Bruijn
graph based approach
optional arguments:
-h, --help show this help message and exit
-i <input file> path to fasta file
-k <kmer size> kmer size (default: 31, max: 251)
-o <output file> name of output file, default:
dbgAssembler_run{current_date}.fna
-d <directory> name of output directory to create, default:
dbgAssembler_{current_date}
-n <y/n> if y, allow Ns in sequence
Running with defualt parameters
dbgAssembler.py -i <input_file>
Running with k-mer size 15
dbgAssembler.py -i <input_file> -k 15
A test run can be done by running the script test_run_viruses in the test folder. If deciding to run the test script, keep in mind that the assembler runs for different k-mer sizes. Each run for gammaherpesvirus 4 will take about 12 minutes (see example evaluation below).
test/test_run_viruses
If only testing the script for one size of k, the normal command can be run for one of the test data found in the folder test_data
dbgAssembler.py -i test_data/<input.fna> -k 15
The output is a directory containing two files:
- a fasta file for the assembly
- a log file with the information about the input file and k-mer size
Here are some examples how correct the assemblies are. The assemblies are aligned against the reference genomes using mummers (v3.23) dnadiff. The assembler has problems with lower kmers and even though a complete alignment is obtained, the assembly had to be broken up into contigs, which is not optimal (Fig 1). A higher kmer size is needed for larger genomes. This is even clear when evaluating the variants between the assembly and the reference (Fig 2). Though the dbgAssembler did not manage to complete assemble the gamma herpesvirus. Lastly, the execution time increases quickly with the sequence length as well (Fig 3). Thus, larger genomes will be time-consuming.
Fig 1. The percentage of aligned bases to the reference genome and the number of contigs that were needed to align the assembly to reference genome evaluated by mummers dnadiff for different k-mer sizes.
Fig 2. The number of variants detected when aligning the assembly to the reference genome evaluated by mummers dnadiff for differ-ent k-mer sizes.
Fig 3. The execution time for each genome run using different k-mer sizes
Daniel Nilsson