Skip to content

Commit

Permalink
figure crossreference
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Dec 9, 2023
1 parent 39f37d0 commit e40a833
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions content/25.sup.note.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ We have created an example on how to share a UniProt adapter between resources a

We have written such an adapter for UniProt data, using software infrastructure provided by the OmniPath backend PyPath (for downloading and locally caching the data).
The adapter provides the data as well as convenient access points and an overview of the available property fields using Python Enum classes, offering automatic suggestion and autocomplete functionality.
Using these methods, selecting specific content from the entirety of UniProt data and integrating this content with other resources is greatly facilitated (Figure S1), since the alternative would be, in many cases, to use a manual script to access the UniProt API and rely on manual harmonisation with other datasets.
Using these methods, selecting specific content from the entirety of UniProt data and integrating this content with other resources is greatly facilitated (Figure @fig:S1), since the alternative would be, in many cases, to use a manual script to access the UniProt API and rely on manual harmonisation with other datasets.

Similarly, we have added adapters for protein-protein interactions from the popular sources IntAct 7, BioGRID [@doi:10.1002/pro.3978], and STRING [@doi:10.1093/nar/gkaa1074], as well as other resources.
For an up-to-date overview of the BioCypher pipelines and adapters, please visit the [Components board](https://github.com/orgs/biocypher/projects/3) and the [meta-graph](https://meta.biocypher.org).
Expand Down Expand Up @@ -48,9 +48,9 @@ The code for this project can be found at https://github.com/oncodash/oncodashkb

We make use of the ontology manipulation facilities provided by BioCypher to extend the broad but basic Biolink ontology in certain branches where it is useful to have more granular information about the data that enters the KG.
For example, the exact type of genetic variants are of high importance in the molecular tumour board process, but Biolink only provides a generic “sequence variant” class in its schema.
Therefore, we extended the ontology tree at this node with the very granular corresponding subtree of the Sequence Ontology (SO, [@doi:10.1186/gb-2005-6-5-r44]), yielding a hybrid ontology with the generality of Biolink and the accuracy of a specialised ontology of sequence variants (Figure S2).
Therefore, we extended the ontology tree at this node with the very granular corresponding subtree of the Sequence Ontology (SO, [@doi:10.1186/gb-2005-6-5-r44]), yielding a hybrid ontology with the generality of Biolink and the accuracy of a specialised ontology of sequence variants (Figure @fig:S2).
Building on the mechanism provided by BioCypher, this hybridisation can be performed by providing only the minimal input of the sequence ontology URL and the nodes that should be the point of merging (“sequence variant” in Biolink and “sequence_variant” in SO).
The same process is used with the Disease Ontology [@doi:10.1093/nar/gkab1063] and OncoTree [@doi:10.1200/CCI.20.00108] (see Figure S2).
The same process is used with the Disease Ontology [@doi:10.1093/nar/gkab1063] and OncoTree [@doi:10.1200/CCI.20.00108] (see Figure @fig:S2).
We use Biolink v3.2.1 and the most recent version of Disease Ontology (as provided by the OBO Foundry at http://purl.obolibrary.org/obo/so.owl).

![
Expand Down Expand Up @@ -89,7 +89,7 @@ After BioCypher adaptation, the KG (covering all information used by Barrio-Hern
This lossless reduction is possible due to 1) the semantic abstraction and 2) the removal of information in the original graph that is not relevant to the task.
Compared to the original file of the database dump (zipped, 1.1 GB), the BioCypher output is ~20-fold smaller (zipped, 63 MB), which greatly facilitates sharing and accessibility (e.g.
by simplifying online access via Jupyter notebooks).
The Cypher query for an interaction has been reduced from 13 query lines, 15 nodes, and 25 edges to 2 query lines, 3 nodes, and 2 edges (Figure S3).
The Cypher query for an interaction has been reduced from 13 query lines, 15 nodes, and 25 edges to 2 query lines, 3 nodes, and 2 edges (Figure @fig:S3).
This change comes with a reduction in complexity, which may be beneficial for the experience of interacting with the KG.
If the Cypher query is programmatically generated, this does not play a role for the user.
However, in that case, the complexity is shifted upstream to the code that generates the query.
Expand Down Expand Up @@ -168,7 +168,7 @@ For instance, we extended the metapath to connect the subjects’ protein readou
Importantly, due to the gigantic size of the CKG, it was fundamental to use a CKG BioCypher adapter to extract the pertinent subgraphs containing only the required knowledge (e.g., patient-protein data and pathways).
Indeed, selecting the desired KG entities from the complete adapter required negligible time (demonstrated at https://github.com/biocypher/clinical-knowledge-graph).
Finally, the protein- and pathway-based patient descriptors were obtained by running the Bioteque embedding pipeline (https://gitlabsbnb.irbbarcelona.org/bioteque/).
The two resulting patient embedding spaces and their corresponding cluster similarity are provided in Figure S4.
The two resulting patient embedding spaces and their corresponding cluster similarity are provided in Figure @fig:S4.

![
**Bioteque-based patient embeddings.**
Expand Down Expand Up @@ -237,7 +237,7 @@ Embedded in the Medical Informatics Initiative (MII) Germany, MeDaX builds on th
We envision extending the existing MIRACOLIX toolbox [@doi:10.3414/ME17-02-0025] with the MeDaX pipeline to set up local KGs, combining complex heterogeneous data from multiple resources: in addition to biomedical data available only at the DICs due to patient privacy, we include the MII core data set [@{https://www.medizininformatik-initiative.de/sites/default/files/2018-07/2018-03_mdi_Der%20Kerndatensatz%20der%20Medizininformatik-Initiative%20Ein%20Schritt%20zur%20Sekund%C3%A4rnutzung%20von%20Versorgungsdaten%20auf%20nationaler%20Ebene.pdf}], local population studies [@doi:10.1007/BF01324255;@doi:10.1186/1479-5876-12-144], biomedical ontologies [@doi:10.1093/nar/gkp440], and public information portals [@doi:10.1186/s12911-020-01374-w].
BioCypher’s ontology mapping process facilitates future integration of additional data sources (see also the case study “Data integration”).

We enable federated learning pipelines by supplying build instructions for each local database in the form of the schema configuration that can be publicly and centrally maintained, since it contains no sensitive data (Figure S5).
We enable federated learning pipelines by supplying build instructions for each local database in the form of the schema configuration that can be publicly and centrally maintained, since it contains no sensitive data (Figure @fig:S5).
At each training location, a task-specific KG is created from public data (e.g., with the Clinical Knowledge Graph as baseline), using the subsetting facilities described in the case study “Subgraph extraction”.
Afterwards, the sensitive patient data (e.g., germ-line genetic variants) are integrated into this KG at each location, using the BioCypher schema configuration to specify the type of data involved (e.g., clinical measurements, genetic profiling).
This ensures that, regardless of how the sensitive data are represented at each location, the machine learning algorithm works with the exact same structure of KG, preventing accidental or malicious data leakage in the federated learning step.
Expand Down

0 comments on commit e40a833

Please sign in to comment.