mhnfs / src /app /constants.py
Tschoui's picture
Update src/app/constants.py
ebb90db verified
"""
This file includes all the constant content shown in the app
"""
# --------------------------------------------------------------------------------------
summary_text = ('''
This application allows you to make **activity predictions** for
**biological targets** for which you have only a **little knowledge** in
terms of known active and inactive molecules.
**Provide** via the sidebar:\n
- some active molecules,
- some inactive molecules, and
- molecules you want to predict.
Hit **Predict** and explore the predictions!
For more **information** about the **model** and **how to provide the
molecules**, please visit the **Additional Information** tab.
If you encounter any problems, we would be glad if you could report them
to us: **[email protected]**.
''')
mhnfs_text =('''
<div style="text-align: justify">
<b>MHNfs</b> is a few-shot drug discovery model which consists of a <b>context
module</b> , a <b>cross-attention module</b> , and a <b>similarity module</b>
as described here: <a href="https://openreview.net/pdf?id=XrMWUuEevr"
target="_blank">https://openreview.net/pdf?id=XrMWUuEevr</a>.
</div>
<br>
<div style="text-align: justify">
<b>Abstract</b>. A central task in computational drug discovery is to construct
models from known active molecules to find further promising molecules for
subsequent screening. However, typically only very few active molecules are
known. Therefore, few-shot learning methods have the potential to improve the
effectiveness of this critical phase of the drug discovery process. We introduce
a new method for few-shot drug discovery. Its main idea is to enrich a molecule
representation by knowledge about known context or reference molecules. Our
novel concept for molecule representation enrichment is to associate molecules
from both the support set and the query set with a large set of reference
(context) molecules through a modern Hopfield network. Intuitively, this
enrichment step is analogous to a human expert who would associate a given
molecule with familiar molecules whose properties are known. The enrichment step
reinforces and amplifies the covariance structure of the data, while
simultaneously removing spurious correlations arising from the decoration of
molecules. Our approach is compared with other few-shot methods for drug
discovery on the FS-Mol benchmark dataset. On FS-Mol, our approach outperforms
all compared methods and therefore sets a new state-of-the art for few-shot
learning in drug discovery. An ablation study shows that the enrichment step of
our method is the key to improve the predictive quality. In a domain shift
experiment, we further demonstrate the robustness of our method. Code is
available at <a href="https://github.com/ml-jku/MHNfs"
target="_blank">https://github.com/ml-jku/MHNfs</a>.
</div>
<br>
<br>
''')
citation_text = '''
###
@inproceedings{
schimunek2023contextenriched,
title={Context-enriched molecule representations improve few-shot drug discovery},
author={Johannes Schimunek and Philipp Seidl and Lukas Friedrich and Daniel Kuhn and Friedrich Rippmann and Sepp Hochreiter and Günter
Klambauer},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=XrMWUuEevr}
}
'''
few_shot_learning_text = (
'''
<div style="text-align: justify">
<b>Few-shot learning</b> is a machine learning sub-field which aims to provide
predictive models for scenarios in which only little data is known/available.<br>
<br>
<b>MHNfs</b> is a few-shot learning model which is specifically designed for drug
discovery applications. It is built to use the input prompts in a way such that
the provided available knowledge, i.e. the known active and inactive molecules,
functions as context to predict the activity of the new requested molecules.
Precisely, the provided active and inactive molecules are associated with a
large set of general molecules - called context molecules - to enrich the
provided information and to remove spurious correlations arising from the
decoration of molecules. This is analogous to a Large Language Model which would
not only use the provided information in the current prompt as context but would
also have access to way more information, e.g., a prompting history.
</div>
''')
under_the_hood_text = ('''
<div style="text-align: justify">
The predictive model (MHNfs) used in this application was specifically designed and
trained for low-data scenarios. The model predicts whether a molecule is active or
inactive. The predicted activity value is a continuous value between 0 and 1, and,
similar to a probability, the higher/lower the value, the more confident the model
is that the molecule is active/inactive.
The model was trained on the FS-Mol dataset which
includes 5120 tasks (roughly 5000 tasks were used for training, rest for evaluation).
The training tasks are listed here:
<a href="https://github.com/microsoft/FS-Mol/tree/main/datasets/targets"
target="_blank">https://github.com/microsoft/FS-Mol/tree/main/datasets/targets</a>.
</div>
''')
usage_text = ('''
<div style="text-align: justify">
To use this application, you need to provide <b>3 different sets of molecules</b>:
<ol>
<li><b>active</b> molecules: set of known active molecules,</li>
<li><b>inactive</b> molecules: set of known inactive molecules, and</li>
<li>molecules to <b>predict</b>: set of molecules you want to predict.</li>
</ol>
These three sets can be provided via the <b>sidebar</b>. The sidebar also includes two
buttons <b>predict</b> and <b>reset</b> to run the prediction pipeline and to
reset it.
</div>
''')
data_text = ('''
<div style="text-align: justify">
<ul>
<li> Molecules have to be provided in SMILES format</li>
<li> For each input, the maximum number of molecules which can be provided is
restricted to 20 </li>
<li> You can provide the molecules via the text boxes or via CSV upload
<ul>
<li> Text box
<ul>
<li> Replace the pseudo input by directly typing your molecules
into
the text box </li>
<li> Separate the molecules by comma </li>
</ul>
</li>
<li> CSV upload
<ul>
<li> The CSV file should include a "smiles" column (both upper
and lower case "SMILES" are accepted) </li>
<li> All other columns will be ignored </li>
<li> Examples are provided here:
<div style="background-color: #efefef">
assets/example_csv/ </li>
</div>
</ul>
</li>
</ul>
</li>
</ul>
</div>
''')
trust_text = ('''
<div style="text-align: justify">
Just like all other machine learning models, the performance of MHNfs varies
and, generally, the model works well if the task is somehow close to tasks which
were used to train the model. The model performance for very different tasks is
unclear and might be poor.<br>
<br>
MHNfs was trained on the FS-Mol dataset which includes 5120 tasks (roughly
5000 tasks were used for training, rest for evaluation). The training tasks are
listed here: <a href= https://github.com/microsoft/FS-Mol/tree/main/datasets/targets
target="_blank">https://github.com/microsoft/FS-Mol/tree/main/datasets/targets</a>.
</div>
''')
example_trustworthy_text = ('''
<div style="text-align: justify">
Since the predicitve model has seen a lot of kinase related tasks during training,
the model is expected to generally perform well on kinase targets. For this example,
we use data for the target
<a href=https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL5914/
target="_blank">CHEMBL5914</a>. Notably, this specific kinase has not been seen
during training. Precisely, we use the available inhibition data while molecules
with an inhibition value greater (smaller) than 50 % are considered as active
(inactive).<br>
From the known available data, we have selected 4 "known" active molecules,
8 "known" inactive molecules, and 11 molecules to predict.<br>
<b>Molecules to predict</b>:
<div style="background-color: #efefef">
FC(F)(F)c1ccc(Cl)cc1CN1CCNc2ncc(-c3ccnc(N4CCNCC4)c3)cc21,<br>
CS(=O)(=O)c1ccc(-n2nc(-c3cnc4[nH]ccc4c3)c3c(N)ncnc32)cc1,<br>
O=C(Nc1ccccc1Cl)c1cnc2ccc(C3CCNCC3)cn12.O=C(O)C(=O)O,<br>
CC(C)n1cnc2c(Nc3cccc(Cl)c3)nc(N[C@@H]3CCCC[C@@H]3N)nc21,<br>
Nc1ncc(-c2ccc(NS(=O)(=O)C3CC3)cc2F)cc1-c1ccc2c(c1)CCNC2=O,<br>
CCN1CCN(Cc2ccc(NC(=O)c3ccc(C)c(C#Cc4cccnc4)c3)cc2C(F)(F)F)CC1,<br>
CN1CCN(c2ccc(-c3cnc4c(c3)N(Cc3cc(Cl)ccc3C(F)(F)F)CCN4)cn2)CC1,<br>
CC(C)n1nc(-c2cnc(N)c(OC(F)(F)F)c2)cc1[C@H]1[C@@H]2CN(C3COC3)C[C@@H]21,<br>
Nc1ncc(-c2cc([C@H]3[C@@H]4CN(C5COC5)C[C@@H]43)n(CC3CC3)n2)cc1C(F)(F)F,<br>
Cc1ccc(NC(=O)C2(C(=O)Nc3ccc(Nc4ncc(F)c(-c5cc(F)c6nc(C)n(C(C)C)c6c5)n4)cc3)CC2)cc1,<br>
C[C@@H](Oc1cc(-c2cnn(C3CCNCC3)c2)cnc1N)c1c(Cl)ccc(F)c1Cl
</div><br>
<b>Known active molecules</b>:
<div style="background-color: #efefef">
CC(=O)N1CCN(c2cc(-c3cnc4c(c3)N(Cc3cc(Cl)ccc3C(F)(F)F)CCN4)ccn2)CC1,<br>
CS(=O)(=O)c1cccc(Nc2nccc(-c3sc(N4CCOCC4)nc3-c3cccc(NS(=O)(=O)c4c(F)cccc4F)c3)n2)c1,<br>
COc1cnccc1Nc1nc(-c2nn(Cc3c(F)cc(OCCO)cc3F)c3ccccc23)ncc1OC,<br>
CN(C)[C@@H]1CC[C@@]2(C)[C@@H](CC[C@@H]3[C@@H]2CC[C@]2(C)C(c4cccc5cnccc45)=CC[C@@H]32)C1<br>
</div><br>
<b>Known inactive molecules</b>:
<div style="background-color: #efefef">
c1cc(-c2c[nH]c3cnccc23)ccn1,<br>
COc1ccc2c3ccnc(C(F)(F)F)c3n(CCCCN)c2c1,<br>
CNS(=O)(=O)c1ccc(N(C)C)c(Nc2ncnc3cc(OC)c(OC)cc23)c1,<br>
CN(C1CC1)S(=O)(=O)c1ccc(-c2cnc(N)c(-c3ccc4c(c3)CCNC4=O)c2)c(F)c1,<br>
CCN1CCN(Cc2ccc(NC(=O)c3ccc(C)c(C#Cc4cnc5[nH]ccc5c4)c3)cc2C(F)(F)F)CC1,<br>
CC(C)n1cc(-c2cc(-c3ccc(CN4CCOCC4)cc3)cnc2N)nn1,<br>
CC(C)(O)[C@H](F)CN1Cc2cc(NC(=O)c3cnn4cccnc34)c(N3CCOCC3)cc2C1=O,<br>
[2H]C([2H])([2H])C1(C([2H])([2H])[2H])Cn2nc(-c3ccc(F)cn3)c(-c3ccnc4[nH]ncc34)c2CO1<br>
</div><br>
<b>Predictions</b>:<br>
</div>
''')
example_nottrustworthy_text = ('''
<div style="text-align: justify">
For this example, we use data for the auxiliary transport protein target
<a href=https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL5738/
target="_blank">CHEMBL5738</a>. Precisely, we use the available Ki data
while molecules with a pCHEMBL value greater (smaller) than 5 are considered
as active (inactive).<br>
From the known available data, we have selected 4 "known" active molecules,
3 "known" inactive molecules, and 10 molecules to predict.<br>
<b>Molecules to predict</b>:
<div style="background-color: #efefef">
CC(C(=O)O)c1ccc(-c2ccccc2)c(F)c1,<br>
O=S(=O)(O)Oc1cccc2cccc(Nc3ccccc3)c12,<br>
CCCCCCCC/C=C\CCCCCCCC(=O)O,<br>
C[C@]12C=CC(=O)C=C1CC[C@@H]1[C@@H]2[C@@H](O)C[C@@]2(C)[C@H]1CC[C@]2(O)C(=O)CO,<br>
CCOC(=O)C(C)(C)Oc1ccc(Cl)cc1,<br>
Cc1ccc(Cl)c(Nc2ccccc2C(=O)O)c1Cl,<br>
O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl,<br>
CC(C)(Oc1ccc(CCNC(=O)c2ccc(Cl)cc2)cc1)C(=O)O,<br>
O=C(c1ccccc1)c1ccc2n1CCC2C(=O)O,<br>
CC(C)OC(=O)C(C)(C)Oc1ccc(C(=O)c2ccc(Cl)cc2)cc1<br>
</div><br>
<b>Known active molecules</b>:
<div style="background-color: #efefef">
CC(C(=O)O)c1ccc(N2Cc3ccccc3C2=O)cc1,<br>
CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21,<br>
CC(C)(Oc1ccc(C(=O)c2ccc(Cl)cc2)cc1)C(=O)O,<br>
CC(=O)[C@H]1CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C
</div><br>
<b>Known inactive molecules</b>:
<div style="background-color: #efefef">
CC(C)Cc1ccc(C(C)C(=O)O)cc1,<br>
O=C1Nc2ccc(Cl)cc2C(c2ccccc2Cl)=NC1O,<br>
C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO
</div><br>
<b>Predictions</b>:<br>
</div>
''')