ESM-1b

ESM-1b (paper, repository) is a transformer protein language model, trained on protein sequence data without label supervision. The model is pretrained on Uniref50 with an unsupervised masked language modeling (MLM) objective, meaning the model is trained to predict amino acids from the surrounding sequence context. This pretraining objective allows ESM-1b to learn generally useful features which can be transferred to downstream prediction tasks. ESM-1b has been evaluated on a variety of tasks related to protein structure and function, including remote homology detection, secondary structure prediction, contact prediction, and prediction of the effect of mutations on function, producing state-of-the-art results.

Model description

The ESM-1b model is based on the RoBERTa architecture and training procedure, using the Uniref50 2018_03 database of protein sequences. Note that the pretraining is on the raw protein sequences only. The training is purely unsupervised -- during training no labels are given related to structure or function.

Training is with the masked language modeling objective. The masking follows the procedure of Devlin et al. 2019, randomly masking 15% of the amino acids in the input, and includes the pass-through and random token noise. One architecture difference from the RoBERTa model is that ESM-1b uses pre-activation layer normalization.

The learned representations can be used as features for downstream tasks. For example if you have a dataset of measurements of protein activity you can fit a regression model on the features output by ESM-1b to predict the activity of new sequences. The model can also be fine-tuned.

ESM-1b can infer information about the structure and function of proteins without further supervision, i.e. it is capable of zero-shot transfer to structure and function prediction. Rao et al. 2020 found that the attention heads of ESM-1b directly correspond to contacts in the 3d structure of the protein. Meier et al. 2021 found that ESM-1b can be used to score the effect of sequence variations on protein function.

Intended uses & limitations

The model can be used for feature extraction, fine-tuned on downstream tasks, or used directly to make inferences about the structure and function of protein sequences.

How to use

You can use this model with a pipeline for masked language modeling:

>>> from transformers import ESMForMaskedLM, ESMTokenizer, pipeline
>>> tokenizer = ESMTokenizer.from_pretrained("facebook/esm-1b", do_lower_case=False)
>>> model = ESMForMaskedLM.from_pretrained("facebook/esm-1b")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker('QERLKSIVRILE<mask>SLGYNIVAT')

[{'sequence': 'Q E R L K S I V R I L E E S L G Y N I V A T',
  'score': 0.0933581069111824,
  'token': 9,
  'token_str': 'E'},
 {'sequence': 'Q E R L K S I V R I L E K S L G Y N I V A T',
  'score': 0.09198431670665741,
  'token': 15,
  'token_str': 'K'},
 {'sequence': 'Q E R L K S I V R I L E S S L G Y N I V A T',
  'score': 0.06775771081447601,
  'token': 8,
  'token_str': 'S'},
 {'sequence': 'Q E R L K S I V R I L E L S L G Y N I V A T',
  'score': 0.0661069005727768,
  'token': 4,
  'token_str': 'L'},
 {'sequence': 'Q E R L K S I V R I L E R S L G Y N I V A T',
  'score': 0.06330915540456772,
  'token': 10,
  'token_str': 'R'}]

Here is how to use this model to get the features of a given protein sequence in PyTorch:

from transformers import ESMForMaskedLM, ESMTokenizer
tokenizer = ESMTokenizer.from_pretrained("facebook/esm-1b", do_lower_case=False )
model = ESMForMaskedLM.from_pretrained("facebook/esm-1b")
sequence_Example = "QERLKSIVRILE"
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
output = model(**encoded_input)

Training data

The ESM-1b model was pretrained on Uniref50 2018-03, a dataset consisting of approximately 30 million protein sequences.

Training procedure

Preprocessing

The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The inputs of the model are then of the form:

<cls> Protein Sequence A

During training, sequences longer than 1023 tokens (without CLS) are randomly cropped to a length of 1023.

The details of the masking procedure for each sequence follow Devlin et al. 2019:

15% of the amino acids are masked.
In 80% of the cases, the masked amino acids are replaced by <mask>.
In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
In the 10% remaining cases, the masked amino acids are left as is.

Pretraining

The model was trained on 128 NVIDIA v100 GPUs for 500K updates, using sequence length 1024 (131,072 tokens per batch). The optimizer used is Adam (betas=[0.9, 0.999]) with a learning rate of 1e-4, a weight decay of 0, learning rate warmup for 16k steps and inverse square root decay of the learning rate after.