Skip to content

A4Bio/MeToken

Repository files navigation

MeToken: Uniform Micro-Environment Token Boosts Post-Translational Modification Prediction

This repository contains the open-source implementation of the paper "MeToken: Uniform Micro-Environment Token Boosts Post-Translational Modification Prediction." The MeToken model leverages both sequence and structural information to accurately predict post-translational modification (PTM) types at specific sites on proteins. By tokenizing the micro-environment of each amino acid, MeToken captures the complex factors influencing PTMs, addressing limitations of sequence-only models and improving prediction performance, especially for rare PTM types.

Table of Contents

Introduction

Post-translational modifications (PTMs) are crucial for regulating protein function and interactions. Accurately predicting PTM sites and their types helps understand biological processes and disease mechanisms. Traditional computational approaches mainly focus on sequence motifs for PTM prediction, often neglecting the role of protein structure.

MeToken addresses these limitations by integrating both sequence and structural information into unified tokens that represent the micro-environment of each amino acid. The model leverages a large-scale sequence-structure PTM dataset and uses uniform sub-codebooks to handle the long-tail distribution of PTM types, ensuring robust performance even for rare PTMs.

Features

  • 🚀 Integration of Sequence and Structure: MeToken tokenizes the local micro-environment of amino acids, combining sequence motifs and 3D structural information.
  • ⚡ Support for Multiple PTM Types: The model is designed to predict a wide range of PTM types, including rare modifications.

Installation

  1. Clone the repository:

    git clone https://github.com/your_username/MeToken.git
    cd MeToken
  2. Install dependencies:

    conda env create -f environment.yml
    conda activate metoken
  3. Download the pretrained model:

    We provide a pretrained model for MeToken. Download it here and place it in the pretrained_models directory.

Usage

Inference

To perform PTM prediction on a single PDB file, follow these steps:

  1. Run the inference script:
python inference.py --pdb_file_path examples/Q16613.pdb --predict_indices 31 79 114
  • --pdb_file_path: Path to the input PDB file (e.g., examples/Q16613.pdb).
  • --predict_indices: A list of residue indices for which PTM prediction should be made.
  1. Optional arguments:
  • --checkpoint_path: Specify the path to the model checkpoint (default is pretrained_model/checkpoint.ckpt).
  • --output_json_path: Path to save prediction results in JSON format (default is output/predict.json).
  • --output_hdf5_path: Path to save prediction results in HDF5 format (default is output/predict.hdf5).
  1. Example Output: The script will print predictions for the specified positions:
PTM type at position 31 is phosphorylation.

Testing

You can evaluate the model using predefined test datasets.

  1. Set the test dataset path in args within quick_test.ipynb. Available test sets:
  • ./data_test/large_scale_dataset/
  • ./data_test/generalization/PTMint_dataset/
  • ./data_test/generalization/qPTM_dataset/
  1. Run the test notebook:
jupyter notebook quick_test.ipynb

This will provide performance metrics and model evaluation results.

References

For a complete description of the method, see:

@article{tan2024metoken,
  title={MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction},
  author={Tan, Cheng and Cao, Zhenxiao and Gao, Zhangyang and Wu, Lirong and Li, Siyuan and Huang, Yufei and Xia, Jun and Hu, Bozhen and Li, Stan Z},
  journal={arXiv preprint arXiv:2411.01856},
  year={2024}
}

Contact

Please submit any bug reports, feature requests, or general usage feedback as a github issue or discussion.

License

This project is licensed under the MIT License. See the LICENSE file for more details.