-
Notifications
You must be signed in to change notification settings - Fork 1
Graph Construction
There are two graphs that are constructed in genome-explorer: the feature-feature (proteins) relationship graph and the genome-genome relationships graph.
The feature-feature relationships graph is generated using all-vs-all BLAST (read USEARCH) search and creating a join table when sequences matches are found.
The join table looks like this:
feature_relationship:
- id
- feature_id
- related_feature_id
- identity
Where identity
is the sequence similarity determined by USEARCH.
This is used to create the genome-genome relationships graph:
genome_relationship:
- id
- genome_id
- related_genome_id
- n_shared_features
Where n_shared_features
is the number of shared features. The genome-genome relationships graph is constructed by iterating over feature relationship and counting the number of shared features between each pair of genomes.
Both graphs are weighted and bi-directional: there is an inverse relationship for every relationship, or at least there should be (at the moment, this is not verified).
Both graphs can be updated without having to reconstruct the entire graph. This is achieved by only creating new feature-relationships after performing an all-versus-all protein search, then creating only new genome-relationships. When a new genome or genomes is/are added, its/their proteins are added to a copy of proteins.fasta
stored on the big
worker node. This node then performs the all-versus-all USEARCH and creates only new feature relationships. Then, new genome relationships are built using only the new feature relationships. As the protein sequence writing step and graph construction steps take much longer than the sequence-sequence search, updating the graph rather than re-building it the entire time takes a lot less time using the partial update method even though it requires performing the protein-protein homology search every time.