moshe-raboh's picture
Update README.md
55d7fa1 verified
|
raw
history blame
2.65 kB
---
tags:
- protein
- ibm
- mammal
- pytorch
- transformers
library_name: biomed
license: apache-2.0
---
Protein solubility is a critical factor in both pharmaceutical research and production processes, as it can significantly impact the quality and function of a protein.
This is an example for finetuning `ibm/biomed.omics.bl.sm-ted-400m` for protein solubility prediction (binary classification) based solely on the amino acid sequence.
The benchmark defined in: https://academic.oup.com/bioinformatics/article/34/15/2605/4938490
Data retrieved from: https://zenodo.org/records/1162886
## Model Summary
- **Developers:** IBM Research
- **GitHub Repository:** https://github.com/BiomedSciAI/biomed-multi-alignment
- **Paper:** TBD
- **Release Date**: Oct 28th, 2024
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
## Usage
Using `ibm/biomed.omics.bl.sm.ma-ted-400m` requires installing [https://github.com/BiomedSciAI/biomed-multi-alignment](https://github.com/TBD)
```
pip install git+https://github.com/BiomedSciAI/biomed-multi-alignment.git
```
A simple example for a task already supported by `ibm/biomed.omics.bl.sm.ma-ted-400m`:
```python
import os
from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp
from mammal.examples.protein_solubility.task import ProteinSolubilityTask
from mammal.keys import CLS_PRED, SCORES
from mammal.model import Mammal
# Load Model
model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility")
# Load Tokenizer
tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility")
# convert to MAMMAL style
sample_dict = {"protein_seq": protein_seq}
sample_dict = ProteinSolubilityTask.data_preprocessing(
sample_dict=sample_dict,
protein_sequence_key="protein_seq",
tokenizer_op=tokenizer_op,
device=model.device,
)
# running in generate mode
batch_dict = model.generate(
[sample_dict],
output_scores=True,
return_dict_in_generate=True,
max_new_tokens=5,
)
# Post-process the model's output
ans = ProteinSolubilityTask.process_model_output(
tokenizer_op=tokenizer_op,
decoder_output=batch_dict[CLS_PRED][0],
decoder_output_scores=batch_dict[SCORES][0],
)
# Print prediction
print(f"{ans=}")
```
For more advanced usage, see our detailed example at: on `https://github.com/BiomedSciAI/biomed-multi-alignment`
## Citation
If you found our work useful, please consider giving a star to the repo and cite our paper:
```
@article{TBD,
title={TBD},
author={IBM Research Team},
jounal={arXiv preprint arXiv:TBD},
year={2024}
}
```