This repository contains Genetic Algorithm (GA) and Simulated Annealing (SA) based approach for DNA compression, leveraging the Minimum Description Length (MDL) principle. Our method aims to efficiently compress genomic sequences by identifying optimal k-mers (patterns) that provide the best compression performance. This project addresses the challenges associated with the exponential growth of genomic data, offering a robust solution for effective data management.
- Genetic Algorithm Optimization: Utilizes a genetic algorithm to optimize k-mer selection, enhancing compression efficiency.
- Simulated Annealing Optimization: Utilizes a simulated annealing to optimize k-mer selection, enhancing compression efficiency.
- Minimum Description Length Principle: Applies the MDL principle to identify the most compact representation of genomic sequences.
- Flexible Input: Supports various genomic datasets in text and FASTA format.
- Performance Benchmarking: Evaluates compression ratios and time against state-of-the-art methods.
- Java Development Kit (JDK) 8 or higher
- Maven (for building the project)
git clone https://github.com/MuhammadzohaibNawaz/HMG.git cd HMG
mvn clean install
To run the compression algorithm, modify the DNAClassification class according to your input files and parameters.
- Set Dataset: Modify the DS variable to select your dataset.
- Adjust Parameters: Change parameters such as generations, and topSubsequences (this is given as input when running the code) based on your requirements.
- Run the Program: Execute the main method to start the compression process.
java -cp target/HMG-1.0-SNAPSHOT.jar dna.HMG
Contributions are welcome! Please open an issue or submit a pull request for any enhancements or bug fixes.
The development of this compression algorithm was inspired by the need for efficient genomic data storage.