This project analyzes X chromosome data to study gene configurations associated with color blindness. The goal is to align sequencing reads to the reference sequence, map them to the red and green gene exons, and determine the most likely configuration responsible for color blindness.
The assignment involves:
- Aligning 3 million reads to the X chromosome reference sequence using the Burrows-Wheeler Transform (BWT) with up to two mismatches.
- Counting reads mapping to exons of the red and green genes:
- Unambiguous mapping: Count as 1.
- Ambiguous mapping: Count as 0.5 for each gene.
- Calculating probabilities for different gene configurations and identifying the most probable configuration responsible for color blindness.
The following files are used in the analysis:
chrX.fa
: The X chromosome's reference sequence in FASTA format.reads
: Contains sequencing reads for alignment to the reference sequence.chrX_map.txt
: Provides exon positions for the red and green genes.chrX_last_col.txt
: The last column of the Burrows-Wheeler Transform (BWT) of the X chromosome reference sequence.
- Reference Sequence: The reference sequence is processed from the FASTA file, concatenating all sequence lines into a single continuous string.
- Reads: Sequencing reads are preprocessed to replace 'N' with 'A' for compatibility with genomic alignment.
- Exon Locations: Exon positions for the red and green genes are loaded and organized for mapping purposes.
- BWT Data: The BWT last column is read along with rank and count information for efficient read alignment.
Reads are aligned to the reference sequence using a sliding window approach, allowing up to two mismatches per alignment. The Burrows-Wheeler Transform data is utilized to optimize this process.
Aligned reads are mapped to the exons of the red and green genes. Counts are calculated:
- 1 for unambiguous mapping: Read maps uniquely to one gene.
- 0.5 for ambiguous mapping: Read maps to both genes.
For each gene configuration, the probabilities are calculated based on the total counts of reads mapping to the red and green genes.
The analysis outputs:
- Total counts of reads mapping to the red and green genes.
- Probabilities for each gene configuration.
- The most likely configuration responsible for color blindness, identified as Config3 in this study.
process_reference_file
: Loads and processes the X chromosome reference sequence.process_reads_file
: Preprocesses sequencing reads to handle missing nucleotides.process_exon_locations
: Loads and extracts exon locations for the red and green genes.read_bwt_last_column
: Reads the BWT last column for efficient alignment.
find_matches_with_mismatches
: Identifies potential alignment positions allowing up to two mismatches.count_gene_mappings
: Counts reads mapping to the red and green gene exons.
calculate_probabilities
: Computes configuration probabilities based on gene counts.align_reads
: Manages alignment and mapping for all reads.
The analysis reveals that Config3 has the highest probability among the configurations, suggesting it as the most likely cause of color blindness. The study provides insights into gene configurations and their association with this genetic condition.
If you have any questions or suggestions, please feel free to reach out to me at nvarjunmani07@gmail.com.