MAP-SEA

Mapping And Prediction of Shared Elements in Alignments

Scripts for mapping short elements stored in .BED files (as intervals, e.g. G-quadruplexes) to .MAF (alignment) files.

Overview

This repository provides two Python scripts for mapping and analyzing short genomic elements (such as G-quadruplexes) in multi-species alignments. The tools are designed to work with the MAF (Multiple Alignment Format) file format and accept multiple BED files containing the sequence intervals to map. The first script maps these intervals to their corresponding locations in a MAF alignment, allowing users to observe where each element appears across different sequences in the alignment.

The second script refines the initial mapping by accounting for slight positional variations (flanks or “wobbles”) in similar elements across species. This script helps remove redundant entries, merge similar elements, and optimize the data for downstream analyses. Together, these tools support comparative genomics studies focused on the evolution and distribution of short genomic elements across species.

Features

Alignment Mapping: Map multiple BED intervals to positions within a single MAF file.
Customizable Parameters: Configure analysis settings to suit different species and overlaps.
Redundancy Filtering: Combine elements that meet defined similarity thresholds.
Mutation Analysis: Track mutations in shared elements across species.
Optimized Performance: Designed for large genomic datasets with support for multicore parallelization.

Usage

Command-line Usage

Below are example commands to demonstrate the main functionality of this repository:

Alignment Mapping:
Use case: mapsea.py [-h] -m MAFINPUT -b BEDFOLDER -o OUTFILE -t TEMPDIR -r INTERSECTRATIO -d SPECIESDICT [-f MAPFILE] [-c CORES]

options:
-h, --help  show this help message and exit
-m MAFINPUT, --mafInput MAFINPUT
			Input gzipped maf file path (e.g., 'demo_hs1.chr1.2023v2.processed.maf.gz')
-b BEDFOLDER, --bedFolder BEDFOLDER
			Folder name to access the bed files (without / at the end); the folder should have a tree structure, 
			with each species having their bed files under a directory separated by chromosomes (e.g., 'pqsfinderOutput')
-o OUTFILE, --outFile OUTFILE
			Output file path (e.g., 'hs1.chr1.2023v2.processed.analysed.quadron.dat')
-t TEMPDIR, --tempDir TEMPDIR
			Temporary directory path (e.g., 'tmp/hsa1')
-r INTERSECTRATIO, --intersectRatio INTERSECTRATIO
			Ratio of what fraction of elements overlap with alignment block (e.g., 1.0)
-d SPECIESDICT, --speciesDict SPECIESDICT
			Species dictionary file path, a json file only (e.g. 'speciesDict.json')
-f MAPFILE, --mapFile MAPFILE
			Homolog chromosome map file path
-c CORES, --cores CORES
			The number of cores available for this job (e.g. 10)

Merging Similar G4s:
Use case: refiner.py [-h] -d DATINPUT -f FLANK [-c CORES] [-o OUTFILE] [-m]

options:
-h, --help            show this help message and exit
-d DATINPUT, --datInput DATINPUT
		              Input dat file output by mapsea (e.g., 'chr22_bonobo_vs_chr22_borang.dat')
-f FLANK, --flank FLANK
                      The flank allowed for the element to be termed 'shared' (e.g. 3)
-c CORES, --cores CORES
                      The number of cores available for this job (e.g. 10)
-o OUTFILE, --outFile OUTFILE
                      Output file path (e.g., 'hs1.chr1.2023v2.processed.analysed.quadron.df')
-m, --mutInfo         Include mutation information in the output file

Example Use Cases

Alignment Mapping:

python3 mapsea.py -m ../test/maffiles/chr22_bonobo_vs_chr22_borang.maf.gz -b ../test/bedfiles -o ../output/chr22_bonobo_vs_chr22_borang.dat -r 1.0 -t ../test/tmp -d ../test/refer_dict.json
Merging Similar G4s: python3 refiner.py -d ../output/chr22_bonobo_vs_chr22_borang.dat -f 3 -o ../output/chr22_bonobo_vs_chr22_borang.rmredundant.df -m

Notes

Some things to keep in mind while running these scripts:

Script Dependency: Ensure that mapsea.py is run with the -f 1.0 flag if you intend to use refiner.py, as this is required for the latter's compatibility.
BED File Naming and Format: BED files must be named in the format chrN.bed (e.g., chr1.bed, chr2.bed). The BED files should have the following columns:

Column Description

col 1 chrom

col 2 start

col 3 end

col 4 score

col 5 length

col 6 strand
MAF File Entry Format: Each entry in the MAF file should follow the standard format speciesName.chrN (e.g., human.chr1).

Citation

If you use this tool in your research, please cite the following paper:

Mohanty, S. K., Chiaromonte, F., & Makova, K. (2024). "Evolutionary Dynamics of G-Quadruplexes in Human and Other Great Ape Telomere-to-Telomere Genomes. bioRxiv, 2024-11. doi: https://doi.org/10.1101/2024.11.05.621973

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
output		output
src		src
test		test
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAP-SEA

Mapping And Prediction of Shared Elements in Alignments

Table of Contents

Overview

Features

Usage

Command-line Usage

Example Use Cases

Notes

Citation

About

Releases

Packages

Languages

Column	Description
col 1	chrom
col 2	start
col 3	end
col 4	score
col 5	length
col 6	strand

License

makovalab-psu/mapsea

Folders and files

Latest commit

History

Repository files navigation

MAP-SEA

Mapping And Prediction of Shared Elements in Alignments

Table of Contents

Overview

Features

Usage

Command-line Usage

Example Use Cases

Notes

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages