The Drug Response Prediction 2022 project in Computational Biology and Artificial Intelligence (COMBINE) Laboratory, McGill University.
In the console, type the following command.
git clone https://github.com/AntonioShen/MTDRP.git
It is preferred to use CONDA for dependency packages management. Type the following command to the console (make sure the current working directory is under the project root /MTDRP/) to create a new environment and to install all required packages.
conda create --name env --file ./requirements.txt
Activate the newly created CONDA environment.
Download DRP2022_preprocesssed.zip (not disclosed, will be available in the future), unzip it and merge the folder to ./data/DRP2022_preprocessed.
The dataset contains multiple .csv files, this operation extracts numerical values from them and creates objects (sub-class of torch.utils.data.Dataset) for easy training and testing.
A particular set of folds (for cross-validation) with an (optional) addition data preprocessing rule should be determined. In the example below (see 2.2.3), the first fold (indexed 0) in cl_fold and zero-mean standardization are used to create PyTorch datasets.
It is possible and easy to define a new 2nd-stage preprocessing method in ./datahandlers/custom_preprocess_rules.py (see 2.2.2). Min-max normalization and zero-mean standardization rules are provided initially.
Every preprocessing method should pack to a class that inherits datahandlers.dataset_handler.PreprocessRule, and implements its preprocess() interface to return a list that contains two torch.Tensor for training and testing, respectively.
In the Python console.
>>> from datahandlers.dataset_handler import DRPGeneralDataset
>>> from datahandlers.custom_preprocess_rules import Standardization
>>> GDSC = DRPGeneralDataset()
>>> GDSC.load_from_csv('GDSC',
'data/DRP2022_preprocessed/sanger/sanger_broad_ccl_log2tpm.csv',
'data/DRP2022_preprocessed/drug_features/gdsc_drug_descriptors.csv',
'data/DRP2022_preprocessed/drug_response/gdsc_tuple_labels_folds.csv')
>>> train, test = GDSC.get_fold('cl_fold', 0, preprocess=Standardization(), save=True)
>>> print(len(train), len(test))
259386 66319
In the above example, passing save=True saves all tensor files (.pt) under ./tensors/Standardization/GDSC/cl_fold0/. It is recommended to do so.
In the console, type the following command with arguments source_path, batch_size, epochs and lr (the learning rate).
python train.py --source_path ./tensors/Standardization/GDSC/cl_fold0/ --batch_size 20 --epochs 100 --lr 1e-4