Skip to content

bio-ontology-research-group/Genomic_context

Repository files navigation

Context-aware protein function prediction in bacterial genomes

We developed a novel context-only dependent protein function prediction method by leveraging the transformer model on bacterial genomic context. This repository contains scripts which we used to train BERT model, along with scripts we used for function prediction and evaluation of contextual approach.

Dependencies

  • The code was developed and tested using Python 3.9.13
  • Clone the repository
git clone https://github.com/bio-ontology-research-group/Genomic_context.git
  • Create conda environment
conda create --name genomic_context python=3.9.13
  • Activate your environment
conda activate genomic_context
  • Install dependencies
pip install -r requirements.txt
  • The training data we used in this study is deposited in Zenodo database under accession code _____ (link). Data includes NLP formatted genomes, cluster-representative protein sequences and mmseqs2 clustering results.

Repo guide

  • BERT_word2vec_benchmark - contains scripts to run BERT and word2vec evaluations. The genome corpus for evaluation can be obtained via following link. Pre-trained BERT model exported to HF Hub
  • Defense_InterPro's - contains tsv files with InterPro ID's annotating bacterial defense systems. Data obtained from InterPro website
  • Secretion_InterPro's - contains tsv file with InterPro ID's annotating bacterial secretion systems. Data obtained from InterPro website

Citations

If you find this work useful in your work, please cite our paper:

About

Protein function prediction (GO classes) using genomic context

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages