ibm
/

biomed.omics.bl.sm.ma-ted-458m.protein_solubility

Safetensors

PyTorch

biomed-multi-alignment

Model card Files Files and versions Community

SagiPolaczek

moshe-raboh commited on 10 days ago

Commit

f2cab23

•

1 Parent(s): ff2db0b

Update README.md (#1)

Browse files

- Update README.md (a4fc3ad4af5f0518595cd6bd6e371eb48d1047f7)

Co-authored-by: Moshe Raboh <[email protected]>

Files changed (1) hide show

README.md +88 -4

README.md CHANGED Viewed

@@ -1,8 +1,92 @@
 ---
 tags:
-- model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: [More Information Needed]
-- Docs: [More Information Needed]

 ---
 tags:
+- protein
+- ibm
+- mammal
+- pytorch
+- transformers
+library_name: biomed
+license: apache-2.0
 ---
+Protein solubility is a critical factor in both pharmaceutical research and production processes, as it can significantly impact the quality and function of a protein.
+This is an example for finetuning `ibm/biomed.omics.bl.sm-ted-400m` for protein solubility prediction (binary classification) based solely on the amino acid sequence.
+The benchmark defined in: https://academic.oup.com/bioinformatics/article/34/15/2605/4938490
+Data retrieved from: https://zenodo.org/records/1162886
+## Model Summary
+- **Developers:** IBM Research
+- **GitHub Repository:** https://github.com/BiomedSciAI/biomed-multi-alignment
+- **Paper:** TBD
+- **Release Date**: Oct 28th, 2024
+- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
+## Usage
+Using `ibm/biomed.omics.bl.sm.ma-ted-400m` requires installing [https://github.com/BiomedSciAI/biomed-multi-alignment](https://github.com/TBD)
+```
+pip install git+https://github.com/BiomedSciAI/biomed-multi-alignment.git
+```
+A simple example for a task already supported by `ibm/biomed.omics.bl.sm.ma-ted-400m`:
+```python
+import os
+from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp
+from mammal.examples.protein_solubility.task import ProteinSolubilityTask
+from mammal.keys import CLS_PRED, SCORES
+from mammal.model import Mammal
+# Load Model
+model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility")
+# Load Tokenizer
+tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility")
+# convert to MAMMAL style
+sample_dict = {"protein_seq": protein_seq}
+sample_dict = ProteinSolubilityTask.data_preprocessing(
+    sample_dict=sample_dict,
+    protein_sequence_key="protein_seq",
+    tokenizer_op=tokenizer_op,
+    device=nn_model.device,
+)
+# running in generate mode
+batch_dict = nn_model.generate(
+    [sample_dict],
+    output_scores=True,
+    return_dict_in_generate=True,
+    max_new_tokens=5,
+)
+# Post-process the model's output
+ans = ProteinSolubilityTask.process_model_output(
+    tokenizer_op=tokenizer_op,
+    decoder_output=batch_dict[CLS_PRED][0],
+    decoder_output_scores=batch_dict[SCORES][0],
+)
+# Print prediction
+print(f"{ans=}")
+```
+For more advanced usage, see our detailed example at: on `https://github.com/BiomedSciAI/biomed-multi-alignment`
+## Citation
+If you found our work useful, please consider to give a star to the repo and cite our paper:
+```
+@article{TBD,
+  title={TBD},
+  author={IBM Research Team},
+  jounal={arXiv preprint arXiv:TBD},
+  year={2024}
+}
+```