SagiPolaczek moshe-raboh commited on
Commit
f2cab23
1 Parent(s): ff2db0b

Update README.md (#1)

Browse files

- Update README.md (a4fc3ad4af5f0518595cd6bd6e371eb48d1047f7)


Co-authored-by: Moshe Raboh <[email protected]>

Files changed (1) hide show
  1. README.md +88 -4
README.md CHANGED
@@ -1,8 +1,92 @@
1
  ---
2
  tags:
3
- - model_hub_mixin
 
 
 
 
 
 
4
  ---
5
 
6
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
7
- - Library: [More Information Needed]
8
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
+ - protein
4
+ - ibm
5
+ - mammal
6
+ - pytorch
7
+ - transformers
8
+ library_name: biomed
9
+ license: apache-2.0
10
  ---
11
 
12
+ Protein solubility is a critical factor in both pharmaceutical research and production processes, as it can significantly impact the quality and function of a protein.
13
+ This is an example for finetuning `ibm/biomed.omics.bl.sm-ted-400m` for protein solubility prediction (binary classification) based solely on the amino acid sequence.
14
+
15
+ The benchmark defined in: https://academic.oup.com/bioinformatics/article/34/15/2605/4938490
16
+ Data retrieved from: https://zenodo.org/records/1162886
17
+
18
+
19
+ ## Model Summary
20
+
21
+ - **Developers:** IBM Research
22
+ - **GitHub Repository:** https://github.com/BiomedSciAI/biomed-multi-alignment
23
+ - **Paper:** TBD
24
+ - **Release Date**: Oct 28th, 2024
25
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
26
+
27
+ ## Usage
28
+
29
+ Using `ibm/biomed.omics.bl.sm.ma-ted-400m` requires installing [https://github.com/BiomedSciAI/biomed-multi-alignment](https://github.com/TBD)
30
+
31
+ ```
32
+ pip install git+https://github.com/BiomedSciAI/biomed-multi-alignment.git
33
+ ```
34
+
35
+ A simple example for a task already supported by `ibm/biomed.omics.bl.sm.ma-ted-400m`:
36
+ ```python
37
+ import os
38
+
39
+ from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp
40
+
41
+ from mammal.examples.protein_solubility.task import ProteinSolubilityTask
42
+ from mammal.keys import CLS_PRED, SCORES
43
+ from mammal.model import Mammal
44
+
45
+ # Load Model
46
+ model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility")
47
+
48
+ # Load Tokenizer
49
+ tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility")
50
+
51
+ # convert to MAMMAL style
52
+ sample_dict = {"protein_seq": protein_seq}
53
+ sample_dict = ProteinSolubilityTask.data_preprocessing(
54
+ sample_dict=sample_dict,
55
+ protein_sequence_key="protein_seq",
56
+ tokenizer_op=tokenizer_op,
57
+ device=nn_model.device,
58
+ )
59
+
60
+ # running in generate mode
61
+ batch_dict = nn_model.generate(
62
+ [sample_dict],
63
+ output_scores=True,
64
+ return_dict_in_generate=True,
65
+ max_new_tokens=5,
66
+ )
67
+
68
+ # Post-process the model's output
69
+ ans = ProteinSolubilityTask.process_model_output(
70
+ tokenizer_op=tokenizer_op,
71
+ decoder_output=batch_dict[CLS_PRED][0],
72
+ decoder_output_scores=batch_dict[SCORES][0],
73
+ )
74
+
75
+ # Print prediction
76
+ print(f"{ans=}")
77
+ ```
78
+
79
+ For more advanced usage, see our detailed example at: on `https://github.com/BiomedSciAI/biomed-multi-alignment`
80
+
81
+
82
+ ## Citation
83
+
84
+ If you found our work useful, please consider to give a star to the repo and cite our paper:
85
+ ```
86
+ @article{TBD,
87
+ title={TBD},
88
+ author={IBM Research Team},
89
+ jounal={arXiv preprint arXiv:TBD},
90
+ year={2024}
91
+ }
92
+ ```