This repository, ML_QSAR_model
, contains Python scripts for developing machine learning-based QSAR (Quantitative Structure-Activity Relationship) classification models.
-
1a_generate_ecfp.py
This script generates Extended-Connectivity Fingerprints (ECFP4) with a diameter of 4 bonds using RDKit, an open-source cheminformatics toolkit. -
1b_generate_ecfp.py
A memory-efficient version of1a_generate_ecfp.py
that can handle multiple SDF files. -
2_splitting_with_structural_similarity.py
This script uses RDKit for splitting the dataset, ensuring stratified sampling based on both structural features and activity labels. -
3_rf_hyperparameter_tuning.py
UtilizesGridSearchCV
from Scikit-learn to systematically evaluate a predefined grid of hyperparameter values for Random Forest classifiers. The optimal hyperparameter combination is identified through cross-validation. -
4_svm_hyperparameter_tuning.py
Similar to the previous script, but for Support Vector Machine (SVM) classifiers. It also usesGridSearchCV
to find the best hyperparameters. -
5_model.py
This script runs the model using optimized and default parameters, plots the average ROC AUC over cross-validation iterations, computes various evaluation metrics, and organizes the results into directories.
To run these scripts, ensure that:
- The required dependencies, including RDKit and Scikit-learn, are installed.
- The correct data is being read from the appropriate directories.
Properly organizing your data and verifying the input paths will ensure the scripts run smoothly.