Report 4

Jump to bottom

George Iniatis edited this page Oct 28, 2021 · 8 revisions

Worked on the dataset
- Discovered a minor bug, involving the get_pubchem_cid_and_smiles_using_name function, that could introduce some errors in the dataset
- Only affected the entries associated with the Gao et al. datasets
- Some drug names can have multiple SMILES associated with them and the bug caused the function to only retrieve the first SMILES available
- To remove uncertainty from my dataset, drugs with multiple SMILES were decided to be removed
- Instead of just fixing the mistake and removing the affected entries I decided to repopulate the whole dataset which was easy given my functions, but it took a couple of hours
- Retrieved indicators from SIDER database
- Produced one hot encodings for both side effects and indicators
Performed Automated Google Searches
- Performing specific drug-targeted google searches was proven to be ineffective. Would get irrelevant results
- Performing different general queries to gather as much data as possible was proven to be ineffective as well. Too noisy
- Decided to perform site-targeted queries to gather as much reliable data as possible without the need of manual verification
- For each query roughly 100 URLs were retrieved and regular expressions were used to retrieved matches
- Google seemed to discovered a bot was being used after roughly 100 URLs were retrieved
Question/Topics to discuss
- Next steps. ML tutorials and material?