An adaptation of this Google Colab Notebook for processing Hi-C data files, running the LAMMPS Molecular Dynamics simulation and creating structure data. It has been rewritten to work as a command-line script and python module.
- Set up your Python virtual environment
virtualenv venv/
source venv/bin/activate
pip3 install -r requirements.txt
-
Install LAMMPS. You may be able to install it with your distro's package manager, or you can download and build it from source.
-
Install the module
cd src/
python3 setup.py install
python3 -m hic2structure --verbose HIC_FILE
This will load contact records from the given Hi-C file and run a LAMMPS simulation on in it in a temporary directory, creating an output structure data file and a copy of the LAMMPS log in the directory ./out
.
You can change many settings (which resolution to load from the Hi-C file, which chromosome, how many timesteps to run the simulation, etc.) with command-line arguments. For example:
python3 -m hic2structure --verbose --resolution 200000 --threshold 2.5 --chromosome 22 HIC_FILE
You can run python3 -m hic2structure --help
to see all the available options and and their default values.
If you need finer control over things, you can import hic2structure
into a script. Most functions exported by the module and its submodules revolve around a "settings" dictionary with the same sort of parameters as above.
For example, the below script will get a contact map from a Hi-C file, output it, then run LAMMPS and output the new, simulated, contacts.
#!/usr/bin/env python3
from gettext import find
from pathlib import Path
from hic2structure.types import Settings
from hic2structure.hic import HIC
from hic2structure.lammps import run_lammps
from hic2structure.out import write_contact_set, contact_records_to_set
from hic2structure.contacts import find_contacts
settings: Settings = {
'resolution': 200000,
'threshold': 2.5,
'chromosome': '22',
'bond_coeff': 50,
'timesteps': 100000
}
hic = HIC( Path('my_input.hic') )
real_contacts = hic.get_contact_records(settings)
write_contacts(Path('real_contactmap.tsv'), real_contacts)
lammps_data = run_lammps( contact_records_to_set(real_contacts), settings)
# the run_lammps result is indexed by timestep
last_timestep = lammps_data[100000]
simulated_contacts = find_contacts(last_timestep, settings)
write_contact_set(Path('simulated_contactmap.tsv'), simulated_contacts)
When contact records are read from the Hi-C file, all the records with a value below the threshold (set with the --count-threshold
argument) are excluded and then the values for the remaining contacts are discarded. Effectively, this makes a "binary" contact map where each pair of coordinates either contacts or doesn't, with no values inbetween. In this case, only the coordinates with a count greater than the threshold are included.
These records are used as input to the LAMMPS simulation which then comes up with a 3D structure. The coordinates in 3D space for each bead is what's output in the structure.csv
file in the output. When using the hic2structure
module, calling the find_contacts
function on a timestep of the LAMMPS results will return a similarly "binary" contact map, consisting of all the pairs of beads whose distance from eachother (in 3D space) is less than the provided threshold (set the distance_threshold
field in the settings dict).