--- tags: - protein - ibm - mammal - pytorch - transformers library_name: biomed license: apache-2.0 --- Protein solubility is a critical factor in both pharmaceutical research and production processes, as it can significantly impact the quality and function of a protein. This is an example for finetuning `ibm/biomed.omics.bl.sm-ted-400m` for protein solubility prediction (binary classification) based solely on the amino acid sequence. The benchmark defined in: https://academic.oup.com/bioinformatics/article/34/15/2605/4938490 Data retrieved from: https://zenodo.org/records/1162886 ## Model Summary - **Developers:** IBM Research - **GitHub Repository:** https://github.com/BiomedSciAI/biomed-multi-alignment - **Paper:** TBD - **Release Date**: Oct 28th, 2024 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). ## Usage Using `ibm/biomed.omics.bl.sm.ma-ted-400m` requires installing [https://github.com/BiomedSciAI/biomed-multi-alignment](https://github.com/TBD) ``` pip install git+https://github.com/BiomedSciAI/biomed-multi-alignment.git ``` A simple example for a task already supported by `ibm/biomed.omics.bl.sm.ma-ted-400m`: ```python import os from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp from mammal.examples.protein_solubility.task import ProteinSolubilityTask from mammal.keys import CLS_PRED, SCORES from mammal.model import Mammal # Load Model model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility") # Load Tokenizer tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility") # convert to MAMMAL style sample_dict = {"protein_seq": protein_seq} sample_dict = ProteinSolubilityTask.data_preprocessing( sample_dict=sample_dict, protein_sequence_key="protein_seq", tokenizer_op=tokenizer_op, device=nn_model.device, ) # running in generate mode batch_dict = nn_model.generate( [sample_dict], output_scores=True, return_dict_in_generate=True, max_new_tokens=5, ) # Post-process the model's output ans = ProteinSolubilityTask.process_model_output( tokenizer_op=tokenizer_op, decoder_output=batch_dict[CLS_PRED][0], decoder_output_scores=batch_dict[SCORES][0], ) # Print prediction print(f"{ans=}") ``` For more advanced usage, see our detailed example at: on `https://github.com/BiomedSciAI/biomed-multi-alignment` ## Citation If you found our work useful, please consider to give a star to the repo and cite our paper: ``` @article{TBD, title={TBD}, author={IBM Research Team}, jounal={arXiv preprint arXiv:TBD}, year={2024} } ```