Skip to content
Austin Richardson edited this page Jan 22, 2015 · 1 revision

There are two graphs that are constructed in genome-explorer: the feature-feature (proteins) relationship graph and the genome-genome relationships graph.

The feature-feature relationships graph is generated using all-vs-all BLAST (read USEARCH) search and creating a join table when sequences matches are found.

The join table looks like this:

feature_relationship:
  - id
  - feature_id
  - related_feature_id
  - identity

Where identity is the sequence similarity determined by USEARCH.

This is used to create the genome-genome relationships graph:

genome_relationship:
  - id
  - genome_id
  - related_genome_id
  - n_shared_features

Where n_shared_features is the number of shared features. The genome-genome relationships graph is constructed by iterating over feature relationship and counting the number of shared features between each pair of genomes.

Both graphs are weighted and bi-directional: there is an inverse relationship for every relationship, or at least there should be (at the moment, this is not verified).

Iterative/Streaming Graph Updating

Both graphs can be updated without having to reconstruct the entire graph. This is achieved by only creating new feature-relationships after performing an all-versus-all protein search, then creating only new genome-relationships. When a new genome or genomes is/are added, its/their proteins are added to a copy of proteins.fasta stored on the big worker node. This node then performs the all-versus-all USEARCH and creates only new feature relationships. Then, new genome relationships are built using only the new feature relationships. As the protein sequence writing step and graph construction steps take much longer than the sequence-sequence search, updating the graph rather than re-building it the entire time takes a lot less time using the partial update method even though it requires performing the protein-protein homology search every time.

Clone this wiki locally