-
Notifications
You must be signed in to change notification settings - Fork 1
Report 10
The brain is surrounded by a permeable boundary that prevents many pathogens from getting in. However, it can also stop many useful drugs from entering the brain. This is especially important when trying to deliver critical therapeutics, such as chemotherapy, to brain tumours. Accurate prediction of whether a drug will easily cross the blood-brain barrier is a valuable tool for developing and testing new drugs for various diseases.
This project aims to gather publicly available data on drugs known to cross into the brain and those that cannot and place them into a new dataset. Then using this newly created dataset, build a machine learning system that uses a drug’s chemical structure to predict whether it can pass the blood-brain barrier.
-
Read 5 academic papers discussing how other people have solved the same problem using a variety of strategies and methods. Summarised in the Research Journal
-
Created the dataset. Whole process in detail can be found in the Dataset Creation Journal
-
Created a small Datalore notebook(Jetbrains equivalent of Jupyter Notebooks) to do some data exploration, produce some plots and build my models. Link
Built the dataset Decide to use a very small subset of chemical descriptors Add the different sources of data
Dataset Problems (Copied directly from the Dataset Creation Journal)
- Some SMILES include special characters (/,) that even when URL encoded alter the SMILES itself.
- Solved using POST requests to the PubChem API as suggested by the documentation
- Complexity issues. Algorithms taking too long to run
- Solved using code refactoring, reformats and making use of binary searches where possible
- Discovered a bug with one of the functions, get_pubchem_cid_and_smiles_using_name
- Only affected the entries associated with the Gao et al. datasets
- Some drug names can have multiple SMILES associated with them and the bug caused the function to only retrieve the first SMILES available
- Fixed through some code refactoring. The titles of the compounds retrieved by PubChem would also be returned and compared with the drug name we are currently searching. If we find a match we return that specific compound CID and SMILES
- Had to repopulate the whole dataset which took a couple of hours
- While performing automated google searches, after roughly 100 search results were retrieved, a 429: Too many requests error would pop up. Possibly caused by a website discovering a bot was being used to scrape data
- Multiple queries were used to gather as much data as possible until that error was thrown
- In the end we decided againt directly using Google Searches and made use of the APIs offered by PubMed and Springer for a more targeted approach
- First time working with Machine Learning in such depth
- The Machine Learning course definetely helped
- Watched multiple scikit-learn and ML tutorials to get up to speed
- A large percentage of the dataset includes previous datasets created by other researches. Therefore our current dataset is as good as those it has built upon
Some data validation required human check which could have added errors into the dataset
[Time plan, in roughly weekly to monthly blocks, up until submission week]