Skip to content

Encoder for Gene Ontology terms from their definitions or positions on the GO tree

Notifications You must be signed in to change notification settings

datduong/EncodeGeneOntology

Repository files navigation

Encode Gene Ontology terms using their definitions or positions on the GO tree.

We apply the following methods to embed GO terms:

  • Defintion encoder

    1. BiLSTM
    2. ELMo
    3. Transformer based on BERT strategy.
  • Position encoder

    1. GCN
    2. Onto2vec

The key objective is to capture the relatedness of GO terms by encoding them into similar vectors.

Consider the example below. We would expect child-parent terms to have similar vector embeddings; whereas, two unrelated terms should have different embeddings. Moreover, child-parent terms are in the same neighborhood, so that the position embeddings should also be the same.

GoTermExampl

Libraries needed

pytorch, pytorch-pretrained-bert, pytorch-geometric

How to use Definition and Position encoder?

We embed the definition or position of a term. The key idea is that child-parent terms often have simlar defintions or positions in the GO tree, so that we can embed them into comparable vectors.

All models are already trained, and ready to be used. You can download the embeddings here. There are different types of embeddings, you can try any of these embeddings. For example, download these files if you want to use the BiLSTM embedding for Task 1 and 2 discussed in our paper.

You can also use our trained model to produce vectors for any GO definitions, see example script here. You will have to prepare the go.obo definition input in this format here.

Alternatively, you can also train your own embedding by following the same example script. You only need to prepare your train/dev/test datasets into the same format here.

Applications for Definition and Position encoders

Compare functions of proteins

Almost every protein is annotated by a set of GO terms, for example see the Uniprot database. Once you can express each GO term as a vector, then for any 2 proteins, you can compare the sets of terms annotating them. We used the Best-Match Average metric to compare 2 sets; however, there other options to explore. Our example to compare 2 proteins is here.

Predict GO labels based on protein sequences

We can use Uniprot database to train a model that predicts GO labels for an unknown protein sequence. In our paper, we demonstrate that GO embeddings can be used to predict GO labels not included in the training data (zeroshot learning). There are two advantages. First, many machine learning methods exclude rare labels because these methods often have problem when training data contains very rare labels. GO embeddings allow us to adopt the zeroshot learning philosophy, where we train models on labels in training data, but test models on new unseen labels. Second, as the GO database is constantly updating with new terms, we do not need to train a brand new model with each update.

About

Encoder for Gene Ontology terms from their definitions or positions on the GO tree

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published