Report 10

Predicting drugs that can cross the blood-brain-barrier

George Iniatis

2329642

Proposal

Motivation

The brain is surrounded by a permeable boundary that prevents many pathogens from getting in. However, it can also stop many useful drugs from entering the brain. This is especially important when trying to deliver critical therapeutics, such as chemotherapy, to brain tumours. Accurate prediction of whether a drug will easily cross the blood-brain barrier is a valuable tool for developing and testing new drugs for various diseases.

Aims

This project aims to gather publicly available data on drugs known to cross into the brain and those that cannot and place them into a new dataset. Then using this newly created dataset, build a machine learning system that uses a drug’s chemical structure to predict whether it can pass the blood-brain barrier.

Progress

Read 5 academic papers discussing how other people have solved the same problem using a variety of strategies and methods. Summarised in the Research Journal
Created the dataset. Whole process in detail can be found in the Dataset Creation Journal
Created a small Datalore notebook(Jetbrains equivalent of Jupyter Notebooks) to do some data exploration, produce some plots and build my models. Link

Built the dataset Decide to use a very small subset of chemical descriptors Add the different sources of data

Problems and risks

Problems

Dataset Problems (Copied directly from the Dataset Creation Journal)

Some SMILES include special characters (/,) that even when URL encoded alter the SMILES itself.
- Solved using POST requests to the PubChem API as suggested by the documentation
Complexity issues. Algorithms taking too long to run
- Solved using code refactoring, reformats and making use of binary searches where possible
Discovered a bug with one of the functions, get_pubchem_cid_and_smiles_using_name
- Only affected the entries associated with the Gao et al. datasets
- Some drug names can have multiple SMILES associated with them and the bug caused the function to only retrieve the first SMILES available
- Fixed through some code refactoring. The titles of the compounds retrieved by PubChem would also be returned and compared with the drug name we are currently searching. If we find a match we return that specific compound CID and SMILES
- Had to repopulate the whole dataset which took a couple of hours
While performing automated google searches, after roughly 100 search results were retrieved, a 429: Too many requests error would pop up. Possibly caused by a website discovering a bot was being used to scrape data
- Multiple queries were used to gather as much data as possible until that error was thrown
- In the end we decided againt directly using Google Searches and made use of the APIs offered by PubMed and Springer for a more targeted approach

General Problem

First time working with Machine Learning in such depth
- The Machine Learning course definetely helped
- Watched multiple scikit-learn and ML tutorials to get up to speed

Risks

A large percentage of the dataset includes previous datasets created by other researches. Therefore our current dataset is as good as those it has built upon

Some data validation required human check which could have added errors into the dataset

Plan

[Time plan, in roughly weekly to monthly blocks, up until submission week]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly