A library to analyze and explore protein sequences using BERT models


pip install berteome

Getting started

Berteome makes use of the masked language model of BERT to determine predictions for all residues in a protein sequence.

The main berteome library can be imported as follows:

from berteome import berteome

The modelLoader class can be used to show what models are supported by berteome.

berteome_models = berteome.modelLoader()

All of these models are distributed through huggingface, and berteome makes great use of it’s API.

Load model

To load prot_bert model, run the following:

bert_tokenizer, bert_model = berteome_models.load_model("Rostlab/prot_bert")
The language models utilized by berteome were trained using a masked token approach. In this approach, a random amino acid is masked in a protein and the model is trained to predict what the amino acid should be. These models do this on an incredibly large amount of protein sequences, to the point that they begin to learn the language of protein sequence space as we currently know it. For instance, it can start to learn, which residues are unlikely to exist at a given point in a protein. Using these models, you can place a mask at any given residue in the protein, and the model will generate a probability score for all the possible amino acids that could go there.

berteome allows the user to take the models and begin to really investigate these predictions for a given protein sequence, by masking every single residue in the protein sequence and predicting the probabilities for all the possible amino acids. The result is a nice, easy to work with pandas data frame. To make this dataframe for a very simple peptide sequence (MENDEL), do the following:

mendel_berteome = berteome.modelPredDF("MENDEL",bert_tokenizer, bert_model)
This dataframe is where the true berteomic magic begins. Each row corresponds to each residue in the input protein sequence.

Here is a breakdown of some the columns in the dataframe.

  • wt represents the actual amino acid at the given position `
  • wtIndex is just a one-based index of the residue which makes plotting easier, may not stick around forever though..-
  • wtScore is a very interesting and important value. For a given protein, one would hope that the model would predict that the masked residue would be the same as the wild-type in the sequence. This column gives us the actual probability that the model provided for the wild type residue at that position.
  • n_effective is a measure of site-specific variability which gives a proxy of how many amino acids could occupy that site and is defined as $N_{eff}(i) = exp(-\sum p_{ji} \ln p_{ji})$
  • topAA is the top scoring amino acid at a given position in the protein
  • topAAscore is the score of the top scoring amino acid at a given position in the protein

The remaining columns are simply the probabilities of each possible amino acid generated by the model when placing a mask at every residue in the input protein.

Score sequence

The average score for the wild type sequence and the top sequence are recorded as following using the scoreSeq() function

print(mendel_berteome.wtSeq, mendel_berteome.wtSeqScore)
MENDEL 0.06513695385878104
print(mendel_berteome.topAASeq, mendel_berteome.topAASeqScore)
ELELLE 0.127035315825644

To test the score of another given protein of the same length as the input provide it to scoreSeq()


Amino acid correlation

For a given berteome dataframe, to investigate how correlated the predictions of the different amino acids are to each other, the aa_correlation() can be used to generate a correlation dataframe

Most probable variants

berteome can also be used to generate single residue substitution variants for the top k amino acids for a given residue in a protein. To generate the top 3 mutational variants for MENDEL the generate submodule can be loaded and used as follows:

from berteome import generate
generate.top_k_variants(mendel_berteome, 3)
This returns a dataframe with L x k possible single amino acid variants. - sub is the substitution id that indicates which residue was substitued with what amino acid following the pattern {residue_number}sub{substituted_amino_acid} - seq is the new variant sequence.

Random sequences

If you’d like to take the amino acid probabilities at each residue position to randomly generate proteins from the probability dataframe provided by berteome, you can use n_random_seqs

generate.n_random_seqs(mendel_berteome, 10)
from berteome import berteome_plot

If you would like to visualize what how wtScore varies across the sequence, do the following:

(<Figure size 432x288 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7fc806ab2460>)

Additionally, you can plot the n_effective to visualize sites that the model infers as having a lower likelyhood of possible substitutions.

(<Figure size 432x288 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7fc80699d070>)

berteome also provides a method for visually inspecting the correlations of the amino acid predictions

<seaborn.matrix.ClusterGrid at 0x7fc80640bb80>

If you would like to get a visual of the berteome predictions in the form of a seqlogo, that can also be accomplished! Doing so potentially reqires having a few additional dependencies installed, something along the lines of:

!apt install ghostscript
!apt-get install -y pdf2svg


To build the library run the following

nbdev export

Then, pip install in a development environment

pip install -e '.[dev]'

I do quite a bit of work on a chromebook, which allows for doing stuff on github through codespace and also on google colab. To install a particular commit hash of berteome you can do the following:

!pip uninstall berteome
