Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with Lead Optimization Task #179

Open
Danny32968 opened this issue Jan 3, 2025 · 5 comments
Open

Help with Lead Optimization Task #179

Danny32968 opened this issue Jan 3, 2025 · 5 comments

Comments

@Danny32968
Copy link

Hello,

Please excuse this elementary question - I am an inexperienced student on a learning journey! I am trying to compare different existing methods for lead optimization. I want to fairly compare how well different models can optimize a set of 700 DrugBank molecules using QED, MW and TPSA as objectives.

I'm unsure what configurations/tools I should use from REINVENT (since there are so many) to best accomplish this task. Is sampling using Mol2Mol sufficient?

Thank you in advance!

@halx
Copy link
Contributor

halx commented Jan 6, 2025

Hi,

many thanks for your interest in REINVENT and welcome to the community!

When you sample from the Mol2Mol prior model you would, preferentially, receive high probability (as per NLL) SMILES sequences. You would then need to filter the resulting molecules for your objectives. You may have to sample a very large amount of compounds.

Typically though, you would use Reinforcement Learning (RL, also called staged learning in REiNVENT because you can run multiple successive RL runs) using your objectives in the form of scoring functions. In this way you directly train a model to produce compounds with your desired objectives with high probability. Input examples can be found in config/toml in the repository.

Note that QED is a weighted sum of several physico-chemical properties including MW and TPSA.

Many thanks,
Hannes.

@Danny32968
Copy link
Author

Hi Hannes,

Thanks so much for your reply!

  1. Just to make sure I'm on the right track, this is what I am thinking of doing. Do you think it is sufficient?:

  - Optimize my leads one by one with batch_size of 64, min_step of 10, max_step of 60, no max_score, and "plateau" as the termination criteria. The final molecule generated in the final step is what would be used as the "optimized version" of the lead molecule.

  1. I would like to encourage similarity between the lead and optimized molecules. Looking at Issue 107, it is suggested that the similarity component is to be included as one of the reward objectives. My question is: how will the model behave if I use high-similarity prior vs. a lower-similarity prior? In my case, what prior would be most suitable? Maybe mol2mol_similarity since it was trained on a larger dataset?

Thanks again!

@halx
Copy link
Contributor

halx commented Jan 7, 2025

Hi,

if your leads bind to the same target, you may want to co-optimize them. When you run RL you will generate batch_size molecules in each step. "Good" molecules as per your objective may be generated at any step i.e. you will look at all the the molecules in the CSV file and filter out those that you like. RL will optimze the model as per your scoring components which means that the likelihood of generating "good" molecules increases over time. With a single RL stage it may be most practical to set max_steps to a sensible value and max_score to 1.0 as you will be generating batch_size x max_steps molecules. The termination criteria is of no relevance here as you are carrying out only a single step.

I think the newest prior is pubchem_ecfp4_with_count_with_rank_reinvent4_dict_voc.prior. You can use TanimotoSimilarity but keep in mind that that is a hard constraint i.e. the generated molecules must match the pattern. If you want to enforce a certain scaffold it may be more sensible to use a Libinvent prior (but the ones in the repository are only based on ChEMBL).

Cheers,
Hannes.

@Danny32968
Copy link
Author

Hi Hannes,

Once again, thanks so much for your help!

Final question - am I correct in saying that the pretraining method for REINVENT is from Exhaustive local chemical space exploration using a transformer model by Tibo et al. and the RL method for molecular optimization is from LibINVENT: reaction-based generative scaffold decoration for in silico library design by Fialková et al. ?

@halx
Copy link
Contributor

halx commented Jan 10, 2025

The RL algorithm has first been published in Molecular de-novo design through deep reinforcement learning. The other paper you mention is probably implemented in the Mol2Mol prior pubchem_ecfp4_with_count_with_rank_reinvent4_dict_voc.prior, I believe. Contact the authors directly to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants