Skip to content

Latest commit

 

History

History
36 lines (30 loc) · 1.78 KB

README.md

File metadata and controls

36 lines (30 loc) · 1.78 KB

Context-aware protein function prediction in bacterial genomes

We developed a novel context-only dependent protein function prediction method by leveraging the transformer model on bacterial genomic context. This repository contains scripts which we used to train BERT model, along with scripts we used for function prediction and evaluation of contextual approach.

Dependencies

  • The code was developed and tested using Python 3.9.13
  • Clone the repository
git clone https://github.com/bio-ontology-research-group/Genomic_context.git
  • Create conda environment
conda create --name genomic_context python=3.9.13
  • Activate your environment
conda activate genomic_context
  • Install dependencies
pip install -r requirements.txt
  • The training data we used in this study is deposited in Zenodo database under accession code _____ (link). Data includes NLP formatted genomes, cluster-representative protein sequences and mmseqs2 clustering results.

Repo guide

  • BERT_word2vec_benchmark - contains scripts to run BERT and word2vec evaluations. The genome corpus for evaluation can be obtained via following link. Pre-trained BERT model exported to HF Hub
  • Defense_InterPro's - contains tsv files with InterPro ID's annotating bacterial defense systems. Data obtained from InterPro website
  • Secretion_InterPro's - contains tsv file with InterPro ID's annotating bacterial secretion systems. Data obtained from InterPro website

Citations

If you find this work useful in your work, please cite our paper: