Abstract. A central task in computational drug discovery is to construct models from known active molecules to find further promising molecules for subsequent screening. However, typically only very few active molecules are known. Therefore, few-shot learning methods have the potential to improve the effectiveness of this critical phase of the drug discovery process. We introduce a new method for few-shot drug discovery. Its main idea is to enrich a molecule representation by knowledge about known context or reference molecules. Our novel concept for molecule representation enrichment is to associate molecules from both the support set and the query set with a large set of reference (context) molecules through a modern Hopfield network. Intuitively, this enrichment step is analogous to a human expert who would associate a given molecule with familiar molecules whose properties are known. The enrichment step reinforces and amplifies the covariance structure of the data, while simultaneously removing spurious correlations arising from the decoration of molecules. Our approach is compared with other few-shot methods for drug discovery on the FS-Mol benchmark dataset. On FS-Mol, our approach outperforms all compared methods and therefore sets a new state-of-the art for few-shot learning in drug discovery. An ablation study shows that the enrichment step of our method is the key to improve the predictive quality. In a domain shift experiment, we further demonstrate the robustness of our method. Code is available at https://github.com/ml-jku/MHNfs.

Few-shot learning is a machine learning sub-field which aims to provide predictive models for scenarios in which only little data is known/available.

MHNfs is a few-shot learning model which is specifically designed for drug discovery applications. It is built to use the input prompts in a way such that the provided available knowledge, i.e. the known active and inactive molecules, functions as context to predict the activity of the new requested molecules. Precisely, the provided active and inactive molecules are associated with a large set of general molecules - called context molecules - to enrich the provided information and to remove spurious correlations arising from the decoration of molecules. This is analogous to a Large Language Model which would not only use the provided information in the current prompt as context but would also have access to way more information, e.g., a prompting history.

The predictive model (MHNfs) used in this application was specifically designed and trained for low-data scenarios. The model predicts whether a molecule is active or inactive. The predicted activity value is a continuous value between 0 and 1, and, similar to a probability, the higher/lower the value, the more confident the model is that the molecule is active/inactive. The model was trained on the FS-Mol dataset which includes 5120 tasks (roughly 5000 tasks were used for training, rest for evaluation). The training tasks are listed here: https://github.com/microsoft/FS-Mol/tree/main/datasets/targets.

To use this application, you need to provide 3 different sets of molecules:

active molecules: set of known active molecules,
inactive molecules: set of known inactive molecules, and
molecules to predict: set of molecules you want to predict.

These three sets can be provided via the sidebar. The sidebar also includes two buttons predict and reset to run the prediction pipeline and to reset it.

Molecules have to be provided in SMILES format
For each input, the maximum number of molecules which can be provided is restricted to 20
You can provide the molecules via the text boxes or via CSV upload
- Text box
  - Replace the pseudo input by directly typing your molecules into the text box
  - Separate the molecules by comma
- CSV upload
  - The CSV file should include a "smiles" column (both upper and lower case "SMILES" are accepted)
  - All other columns will be ignored
  - Examples are provided here:
    assets/example_csv/

Just like all other machine learning models, the performance of MHNfs varies and, generally, the model works well if the task is somehow close to tasks which were used to train the model. The model performance for very different tasks is unclear and might be poor.

MHNfs was trained on the FS-Mol dataset which includes 5120 tasks (roughly 5000 tasks were used for training, rest for evaluation). The training tasks are listed here: https://github.com/microsoft/FS-Mol/tree/main/datasets/targets.

Since the predicitve model has seen a lot of kinase related tasks during training, the model is expected to generally perform well on kinase targets. For this example, we use data for the target CHEMBL5914. Notably, this specific kinase has not been seen during training. Precisely, we use the available inhibition data while molecules with an inhibition value greater (smaller) than 50 % are considered as active (inactive).
From the known available data, we have selected 4 "known" active molecules, 8 "known" inactive molecules, and 11 molecules to predict.
Molecules to predict:

FC(F)(F)c1ccc(Cl)cc1CN1CCNc2ncc(-c3ccnc(N4CCNCC4)c3)cc21,
CS(=O)(=O)c1ccc(-n2nc(-c3cnc4[nH]ccc4c3)c3c(N)ncnc32)cc1,
O=C(Nc1ccccc1Cl)c1cnc2ccc(C3CCNCC3)cn12.O=C(O)C(=O)O,
CC(C)n1cnc2c(Nc3cccc(Cl)c3)nc(N[C@@H]3CCCC[C@@H]3N)nc21,
Nc1ncc(-c2ccc(NS(=O)(=O)C3CC3)cc2F)cc1-c1ccc2c(c1)CCNC2=O,
CCN1CCN(Cc2ccc(NC(=O)c3ccc(C)c(C#Cc4cccnc4)c3)cc2C(F)(F)F)CC1,
CN1CCN(c2ccc(-c3cnc4c(c3)N(Cc3cc(Cl)ccc3C(F)(F)F)CCN4)cn2)CC1,
CC(C)n1nc(-c2cnc(N)c(OC(F)(F)F)c2)cc1[C@H]1[C@@H]2CN(C3COC3)C[C@@H]21,
Nc1ncc(-c2cc([C@H]3[C@@H]4CN(C5COC5)C[C@@H]43)n(CC3CC3)n2)cc1C(F)(F)F,
Cc1ccc(NC(=O)C2(C(=O)Nc3ccc(Nc4ncc(F)c(-c5cc(F)c6nc(C)n(C(C)C)c6c5)n4)cc3)CC2)cc1,
C[C@@H](Oc1cc(-c2cnn(C3CCNCC3)c2)cnc1N)c1c(Cl)ccc(F)c1Cl

Known active molecules:

CC(=O)N1CCN(c2cc(-c3cnc4c(c3)N(Cc3cc(Cl)ccc3C(F)(F)F)CCN4)ccn2)CC1,
CS(=O)(=O)c1cccc(Nc2nccc(-c3sc(N4CCOCC4)nc3-c3cccc(NS(=O)(=O)c4c(F)cccc4F)c3)n2)c1,
COc1cnccc1Nc1nc(-c2nn(Cc3c(F)cc(OCCO)cc3F)c3ccccc23)ncc1OC,
CN(C)[C@@H]1CC[C@@]2(C)[C@@H](CC[C@@H]3[C@@H]2CC[C@]2(C)C(c4cccc5cnccc45)=CC[C@@H]32)C1

Known inactive molecules:

c1cc(-c2c[nH]c3cnccc23)ccn1,
COc1ccc2c3ccnc(C(F)(F)F)c3n(CCCCN)c2c1,
CNS(=O)(=O)c1ccc(N(C)C)c(Nc2ncnc3cc(OC)c(OC)cc23)c1,
CN(C1CC1)S(=O)(=O)c1ccc(-c2cnc(N)c(-c3ccc4c(c3)CCNC4=O)c2)c(F)c1,
CCN1CCN(Cc2ccc(NC(=O)c3ccc(C)c(C#Cc4cnc5[nH]ccc5c4)c3)cc2C(F)(F)F)CC1,
CC(C)n1cc(-c2cc(-c3ccc(CN4CCOCC4)cc3)cnc2N)nn1,
CC(C)(O)[C@H](F)CN1Cc2cc(NC(=O)c3cnn4cccnc34)c(N3CCOCC3)cc2C1=O,
[2H]C([2H])([2H])C1(C([2H])([2H])[2H])Cn2nc(-c3ccc(F)cn3)c(-c3ccnc4[nH]ncc34)c2CO1

Predictions:

For this example, we use data for the auxiliary transport protein target CHEMBL5738. Precisely, we use the available Ki data while molecules with a pCHEMBL value greater (smaller) than 5 are considered as active (inactive).
From the known available data, we have selected 4 "known" active molecules, 3 "known" inactive molecules, and 10 molecules to predict.
Molecules to predict:

CC(C(=O)O)c1ccc(-c2ccccc2)c(F)c1,
O=S(=O)(O)Oc1cccc2cccc(Nc3ccccc3)c12,
CCCCCCCC/C=C\CCCCCCCC(=O)O,
C[C@]12C=CC(=O)C=C1CC[C@@H]1[C@@H]2[C@@H](O)C[C@@]2(C)[C@H]1CC[C@]2(O)C(=O)CO,
CCOC(=O)C(C)(C)Oc1ccc(Cl)cc1,
Cc1ccc(Cl)c(Nc2ccccc2C(=O)O)c1Cl,
O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl,
CC(C)(Oc1ccc(CCNC(=O)c2ccc(Cl)cc2)cc1)C(=O)O,
O=C(c1ccccc1)c1ccc2n1CCC2C(=O)O,
CC(C)OC(=O)C(C)(C)Oc1ccc(C(=O)c2ccc(Cl)cc2)cc1

Known active molecules:

CC(C(=O)O)c1ccc(N2Cc3ccccc3C2=O)cc1,
CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21,
CC(C)(Oc1ccc(C(=O)c2ccc(Cl)cc2)cc1)C(=O)O,
CC(=O)[C@H]1CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C

Known inactive molecules:

CC(C)Cc1ccc(C(C)C(=O)O)cc1,
O=C1Nc2ccc(Cl)cc2C(c2ccccc2Cl)=NC1O,
C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO

Predictions: