Encode Gene Ontology terms using their definitions or positions on the GO tree.

This is our paper.

We apply the following methods to embed GO terms:

Defintion encoder
1. BiLSTM
2. ELMo
3. Transformer based on BERT strategy.
Position encoder
1. GCN
2. Onto2vec

The key objective is to capture the relatedness of GO terms by encoding them into similar vectors.

Consider the example below. We would expect child-parent terms to have similar vector embeddings; whereas, two unrelated terms should have different embeddings. Moreover, child-parent terms are in the same neighborhood, so that the position embeddings should also be the same.

Libraries needed

pytorch, pytorch-pretrained-bert, pytorch-geometric

How to use Definition and Position encoder?

We embed the definition or position of a term. The key idea is that child-parent terms often have simlar defintions or positions in the GO tree, so that we can embed them into comparable vectors.

All models are already trained, and ready to be used. You can download the embeddings here. There are different types of embeddings, you can try any of these embeddings. For example, download these files if you want to use the BiLSTM embedding for Task 1 and 2 discussed in our paper.

You can also use our trained model to produce vectors for any GO definitions, see example script here. You will have to prepare the go.obo definition input in this format here.

Alternatively, you can also train your own embedding by following the same example script. You only need to prepare your train/dev/test datasets into the same format here.

Applications for Definition and Position encoders

Compare functions of proteins

Almost every protein is annotated by a set of GO terms, for example see the Uniprot database. Once you can express each GO term as a vector, then for any 2 proteins, you can compare the sets of terms annotating them. We used the Best-Match Average metric to compare 2 sets; however, there other options to explore. Our example to compare 2 proteins is here.

Predict GO labels based on protein sequences

We can use Uniprot database to train a model that predicts GO labels for an unknown protein sequence. In our paper, we demonstrate that GO embeddings can be used to predict GO labels not included in the training data (zeroshot learning). There are two advantages. First, many machine learning methods exclude rare labels because these methods often have problem when training data contains very rare labels. GO embeddings allow us to adopt the zeroshot learning philosophy, where we train models on labels in training data, but test models on new unseen labels. Second, as the GO database is constantly updating with new terms, we do not need to train a brand new model with each update.

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
AICbaseline		AICbaseline
BERT		BERT
CompareManyGO		CompareManyGO
DemoScript		DemoScript
EvaluateLabelType		EvaluateLabelType
Figure		Figure
GCN/encoder		GCN/encoder
GetSimScore2GoPretrainVec/encoder		GetSimScore2GoPretrainVec/encoder
ProtSeq2GO		ProtSeq2GO
biLSTM/encoder		biLSTM/encoder
blastp		blastp
compare_set		compare_set
onto2vec		onto2vec
process_deepgo		process_deepgo
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Encode Gene Ontology terms using their definitions or positions on the GO tree.

This is our paper.

We apply the following methods to embed GO terms:

The key objective is to capture the relatedness of GO terms by encoding them into similar vectors.

Libraries needed

How to use Definition and Position encoder?

Applications for Definition and Position encoders

Compare functions of proteins

Predict GO labels based on protein sequences

About

Releases

Packages

Contributors 2

Languages

datduong/EncodeGeneOntology

Folders and files

Latest commit

History

Repository files navigation

Encode Gene Ontology terms using their definitions or positions on the GO tree.

This is our paper.

We apply the following methods to embed GO terms:

The key objective is to capture the relatedness of GO terms by encoding them into similar vectors.

Libraries needed

How to use Definition and Position encoder?

Applications for Definition and Position encoders

Compare functions of proteins

Predict GO labels based on protein sequences

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages