|
--- |
|
tags: |
|
- protein |
|
- ibm |
|
- mammal |
|
- pytorch |
|
- transformers |
|
library_name: biomed |
|
license: apache-2.0 |
|
--- |
|
|
|
Protein solubility is a critical factor in both pharmaceutical research and production processes, as it can significantly impact the quality and function of a protein. |
|
This is an example for finetuning `ibm/biomed.omics.bl.sm-ted-400m` for protein solubility prediction (binary classification) based solely on the amino acid sequence. |
|
|
|
The benchmark defined in: https://academic.oup.com/bioinformatics/article/34/15/2605/4938490 |
|
Data retrieved from: https://zenodo.org/records/1162886 |
|
|
|
|
|
## Model Summary |
|
|
|
- **Developers:** IBM Research |
|
- **GitHub Repository:** https://github.com/BiomedSciAI/biomed-multi-alignment |
|
- **Paper:** TBD |
|
- **Release Date**: Oct 28th, 2024 |
|
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
## Usage |
|
|
|
Using `ibm/biomed.omics.bl.sm.ma-ted-400m` requires installing [https://github.com/BiomedSciAI/biomed-multi-alignment](https://github.com/TBD) |
|
|
|
``` |
|
pip install git+https://github.com/BiomedSciAI/biomed-multi-alignment.git |
|
``` |
|
|
|
A simple example for a task already supported by `ibm/biomed.omics.bl.sm.ma-ted-400m`: |
|
```python |
|
import os |
|
|
|
from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp |
|
|
|
from mammal.examples.protein_solubility.task import ProteinSolubilityTask |
|
from mammal.keys import CLS_PRED, SCORES |
|
from mammal.model import Mammal |
|
|
|
# Load Model |
|
model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility") |
|
|
|
# Load Tokenizer |
|
tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility") |
|
|
|
# convert to MAMMAL style |
|
sample_dict = {"protein_seq": protein_seq} |
|
sample_dict = ProteinSolubilityTask.data_preprocessing( |
|
sample_dict=sample_dict, |
|
protein_sequence_key="protein_seq", |
|
tokenizer_op=tokenizer_op, |
|
device=nn_model.device, |
|
) |
|
|
|
# running in generate mode |
|
batch_dict = nn_model.generate( |
|
[sample_dict], |
|
output_scores=True, |
|
return_dict_in_generate=True, |
|
max_new_tokens=5, |
|
) |
|
|
|
# Post-process the model's output |
|
ans = ProteinSolubilityTask.process_model_output( |
|
tokenizer_op=tokenizer_op, |
|
decoder_output=batch_dict[CLS_PRED][0], |
|
decoder_output_scores=batch_dict[SCORES][0], |
|
) |
|
|
|
# Print prediction |
|
print(f"{ans=}") |
|
``` |
|
|
|
For more advanced usage, see our detailed example at: on `https://github.com/BiomedSciAI/biomed-multi-alignment` |
|
|
|
|
|
## Citation |
|
|
|
If you found our work useful, please consider to give a star to the repo and cite our paper: |
|
``` |
|
@article{TBD, |
|
title={TBD}, |
|
author={IBM Research Team}, |
|
jounal={arXiv preprint arXiv:TBD}, |
|
year={2024} |
|
} |
|
``` |
|
|