Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dumps pipeline to remove bottleneck caused by connectomics data #24

Open
dosumis opened this issue Jan 25, 2022 · 1 comment
Open
Assignees

Comments

@dosumis
Copy link
Member

dosumis commented Jan 25, 2022

To be documented:

  • What content needs to be added where in dumps to preserve current VFB functionality

Q. What axioms need to be present for automated classification of individuals?
A. (I think) KB content + ontologies is currently sufficient. We don't classify by neuron:neuron connectivity (@Clare72 is this true) and classification by neuron - region connectivity is currently opaque to data driven recording of connectivity (although that could change.

Q. What axioms need to be present for SPARQL generated neo labels?
A: All connectivity - but this step does not require reasoning.

Q. What axioms need to be present for reasoning generated neo labels?
A: KB + ontologies (I think)

Q. What A-box axioms need to be loaded into ELK to drive reasoning?
A. Currently only neuron-region connectivity is needed in addition to KB content + ontologies (although check API)

  • TODO Document how long each step takes.
@dosumis
Copy link
Member Author

dosumis commented Jan 25, 2022

Short term solutions:

Nico's suggestion - use named graphs to exclude connectomics data from reasoning step

this will still require connectomics to be loaded and dumped to/from triplestore which is slow(ish). A faster solution will be to load connectomics OWL files later in the pipeline. However, this approach would require quite a bit of re-engineering of the Makefile. This is because the SPARQL-based neo: label addition runs against the triplestore and is used in a patsub to structure the Makefile and direct content to be loaded into the various endoints.

Proposal for a clean quick-ish fix.

Dump named graphs separately in initial step.

graph 1: everything except connectomics
graph 2: neuron:neuron connectomics only
graph 3: neuron:region connectomics only
label_graphs: preferred_roots deprecation_label has_image has_neuron_connectivity has_region_connectivity

Reasoning can be done with graph1 alone

Neo4j needs all 3 graphs + label graphs as in current build
OWLERY needs graph1 + graph3. (pre)-reasoning is not needed. Needs to retain the filter step that removes annotation axioms.
SOLR needs graph + label graphs as in current build.

For this approach to work we need to be able to distinguish neuron:neuron connectivity from neuron:region connectivity - which would require edits to code here https://github.com/VirtualFlyBrain/VFB_connectomics_import (needs @admclachlan ) - may take some time.

Super quick and dirty fix to get pipeline running again:

Remove loading of connectomics data to triplestore
Merge in all connectomics OWL files at dumps steps - owlery and neo get all connectomics.
Add additional script to add connectomics flag neo4j:labels directly in PDB & side load these to SOLR.

We will do this. @Robbie1977 will edit Makefile -> PR for us to review.

Experiment worth doing

Change reasoning step and owlery to use WHELK reasoner.

hkir-dev added a commit that referenced this issue Jan 31, 2022
hkir-dev added a commit that referenced this issue Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants