- AlphaFold Swiss-Prot: A protein structure database predicted by the AlphaFold model, containing approximately 542K protein 3D structure information. You can download this dataset clicking the website.
- EC: A public dataset used to predict the protein's enzyme commission (EC) numbers, which describe their catalysis of biochemical reactions. You can download this dataset clicking the website.
- GO-BP: A dataset related to the biological process (BP) terms of a protein, representing a specific objective that the organism is genetically programmed to achieve. You can download this dataset clicking the website.
- GO-MF: A dataset about the protein's molecular functions (MF), which correspond to activities that can be performed by individual gene products. You can download this dataset clicking the website
- GO-CC: A dataset with the cellular component (CC) terms of a protein, referring to the locations about cellular structures where a gene product performs a function. You can download this dataset clicking the website
- Glycoprotein Dataset (self-built dataset used in
Case Study
): You can find this dataset in the folderdatasets
In order to reproduce our code, you need to install the following pip dependencies:
numpy==1.22.4
pandas==1.4.3
rdkit-pypi==2022.3.5
torch==1.12.0
torch-cluster==1.6.0
torch-geometric==2.0.4
torch-scatter==2.0.9
torch-sparse==0.6.14
torch-spline-conv==1.2.1
torchdrug==0.2.0
- Preprocess the pre-training dataset
- After downloading the full AlphaFold Swiss-Prot dataset, you are supposed to leverage the
src\alphafold.py
to transform the original protein PDB file into aNetworkX
file.
- After downloading the full AlphaFold Swiss-Prot dataset, you are supposed to leverage the
- Pretrain the protein structure model and sequence model.
- Run the file
src\pretrain_model.py
to pretrain two models simultaneously.
- Run the file
- Evaluate the pretrained structure model on benchmark datasets.
- Transform the downloaded dataset to the
NetworkX
format. - Run the file
src\pipeline.py
to evaluate the model on downstream tasks.
- Transform the downloaded dataset to the
If you feel this work helpful for your research, please cite the following papers:
@inproceedings{ma2024scop,
author={Ma, Runze and He, Chengxin and Zheng, Huiru and Wang, Xinye and Wang, Haiying and Zhang, Yidan and Duan, Lei},
booktitle={2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
title={SCOP: A Sequence-Structure Contrast-Aware Framework for Protein Function Prediction},
year={2024},
pages={79-84},
doi={10.1109/BIBM62325.2024.10822541}
}