Skip to content

Commit

Permalink
add links
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Nov 22, 2023
1 parent b6a052a commit c55b92f
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 11 deletions.
6 changes: 3 additions & 3 deletions content/22.sup.note.2.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ In addition, the ontological information projected onto each KG entity allows fo
Since sharing the databases themselves is often prohibited by their large size, BioCypher facilitates the creation of task-specific subsets of databases to be shared alongside analyses.
Extensive automation reduces development time and file sizes, while additionally making the shared dataset independent of database software versions (see case studies “Network expansion”, “Subgraph extraction”, and “Embedding”).

4) Reusability and accessibility: Our template repository for a BioCypher pipeline with adapters, including a Docker Compose setup, is available on GitHub.
To enable learning by example, we curate existing pipelines as well as all adapters they use in a GitHub project that is tied to the BioCypher repository.
4) Reusability and accessibility: Our template repository for a BioCypher pipeline with adapters, including a Docker Compose setup, is available [on GitHub](https://github.com/biocypher/project-template).
To enable learning by example, we curate existing pipelines as well as all adapters they use in a [GitHub project](https://github.com/orgs/biocypher/projects/3) that is tied to the BioCypher repository.
With these data, using the GitHub API and a pipeline based on our template, we build a BioCypher “meta-graph” for the simple browsing and analysis of existing BioCypher workflows (https://meta.biocypher.org/).
To inform the structure of this meta-graph, we have reactivated and now maintain the Biomedical Resource Ontology (BRO [@doi:10.1016/j.jbi.2010.10.003]), which allows the categorisation of pipelines and adapters (now on GitHub).
To inform the structure of this meta-graph, we have reactivated and now maintain the [Biomedical Resource Ontology](https://bioportal.bioontology.org/ontologies/BRO/?p=summary) (BRO [@doi:10.1016/j.jbi.2010.10.003]), which allows the categorisation of pipelines and adapters (now [on GitHub](https://github.com/biocypher/biomedical-resource-ontology)).

While data FAIRness is a necessary part of open science communication, it is not sufficient for the adoption and sustainability of a software project such as BioCypher.
As such, we also implement measures based on the TRUST principles, to increase usability, accessibility, and extensibility of our framework.
Expand Down
2 changes: 1 addition & 1 deletion content/23.sup.note.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
We build on recent technological and conceptual developments in biomedical ontologies that greatly facilitate the harmonisation of biomedical knowledge and advocate a philosophy of reuse of open-source software.
For instance, we integrate a comprehensive “high-level” biomedical ontology, the Biolink model 1, which can be replaced or extended by more domain-specific ontologies as needed, and an extensive catalogue and resolver for biomedical identifier resources, the Bioregistry 3.
Both projects, like BioCypher, are open-source and community-driven.
The ontologies serve as a framework for the representation of biomedical concepts; by supporting the Web Ontology Language (OWL), BioCypher allows integration and manipulation of most ontologies, including those generated by Large Language Models.
The ontologies serve as a framework for the representation of biomedical concepts; by supporting the Web Ontology Language (OWL), BioCypher allows integration and manipulation of most ontologies, including [those generated by Large Language Models](https://github.com/monarch-initiative/ontogpt).

Separating the ontology framework from the modelled data allows implementation of reasoning applications at the ontology level, for instance the ad-hoc harmonisation of multiple disease ontologies before mapping the data points.
For instance, with a group of users that are knowledgeable in ontology, a way to harmonise the divergent or incomplete ontologies can be developed, e.g.
Expand Down
14 changes: 7 additions & 7 deletions content/25.sup.note.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ The adapter provides the data as well as convenient access points and an overvie
Using these methods, selecting specific content from the entirety of UniProt data and integrating this content with other resources is greatly facilitated (Figure S1), since the alternative would be, in many cases, to use a manual script to access the UniProt API and rely on manual harmonisation with other datasets.

Similarly, we have added adapters for protein-protein interactions from the popular sources IntAct 7, BioGRID [@doi:10.1002/pro.3978], and STRING [@doi:10.1093/nar/gkaa1074], as well as other resources.
For an up-to-date overview of the BioCypher pipelines and adapters, please visit the Components board and the meta-graph.
For an up-to-date overview of the BioCypher pipelines and adapters, please visit the [Components board](https://github.com/orgs/biocypher/projects/3) and the [meta-graph](https://meta.biocypher.org).
By using the UniProt accession of proteins in the KG and BioCypher functionality, the sources are seamlessly integrated into the final KG despite their differences in original data representation.
As with UniProt data, access to interaction data is facilitated by provision of Enum classes for the various fields in the original data.
The adapters and a script demonstrating their usage are available on GitHub.
The adapters and a script demonstrating their usage are available [on GitHub](https://github.com/HUBioDataLab/CROssBAR-BioCypher-Migration).
The project uses Biolink version 3.2.1.

Figure S1: Modularity of knowledge input.
Expand Down Expand Up @@ -79,7 +79,7 @@ used this graph database to inform their method of network expansion [@doi:10.11
The database runs on Neo4j, containing about 9 million nodes and 43 million edges.
It focuses on interactions between biomedical agents such as proteins, DNA/RNA, and small molecules.
Returning one particular interaction from the graph requires a Cypher query of ~13 lines which returns ~15 nodes with ~25 edges (variable depending on the amount of information on each interaction).
A procedure to collect information about these interactions from the graph is provided with the original manuscript [@doi:10.1101/2021.07.19.452924], containing Cypher query code of almost 400 lines.
A procedure to collect information about these interactions from the graph is provided with the original manuscript [@doi:10.1101/2021.07.19.452924], containing [Cypher query code of almost 400 lines](http://ftp.ebi.ac.uk/pub/databases/intact/various/ot_graphdb/current/apoc_procedures_ot_data.txt).
Still, this extensive query only covers 11 of the 37 source labels, 10 of the 43 target labels, and 24 of the 76 relationship labels that are used in the graph database, offering a large margin for optimisation in creating a task-specific KG.

After BioCypher adaptation, the KG (covering all information used by Barrio-Hernandez et al.) has been reduced to ~700k nodes and 2.6 million edges, a more than ten-fold reduction, without loss of information with regard to this specific task.
Expand Down Expand Up @@ -138,8 +138,8 @@ Of note, the creation from BioCypher files using the admin import command is Neo
in the “Network expansion” case study is a Neo4j v3 dump, which is no longer supported by the current Neo4j Desktop application.
Finally, after the subsetting procedure, the reduced KG (including 5M nodes and 50M edges) in BioCypher format has a compressed size of 333 MB.

Since a complete CKG adapter already existed, the subsetting required minimal effort; i.e., the only required step was to remove unwanted contents from the complete schema configuration.
The code for this task can be found in the same repository.
Since a complete [CKG adapter](https://github.com/biocypher/clinical-knowledge-graph) already existed, the subsetting required minimal effort; i.e., the only required step was to remove unwanted contents from the complete schema configuration.
The code for this task can be found in the same [repository](https://github.com/biocypher/clinical-knowledge-graph/blob/main/scripts/subset_ckg_script.py).
This project uses Biolink v3.2.1.

### Embedding
Expand Down Expand Up @@ -284,7 +284,7 @@ As biomedical data become larger, integrated analysis pipelines become more expa
For numerous projects in systems biomedicine to succeed, a flexible way of maintaining and analysing large sets of knowledge is necessary.
This is done most effectively by separating data storage and analysis (such that each component can be individually scaled), while using distributed computing infrastructure to perform both tasks in close vicinity, such as computing clusters.
We have recently published an open-source software, called Sherlock, to perform this type of data management for biomedical projects [@doi:10.12688/f1000research.52791.3].
However, this pipeline in some ways depends on manual maintenance, for instance in its data transformation from primary resource to internal format.
However, this pipeline in some ways depends on manual maintenance, for instance in its [data transformation from primary resource to internal format](https://github.com/earlham-sherlock/earlham-sherlock.github.io/tree/master/loaders).

Using BioCypher, we facilitate the maintenance of Sherlock’s input sources by reusing existing adapters and converting the manual scripts to additional adapters for unrepresented resources.
Combined with the unambiguous BioCypher schema configuration, this will make Sherlock’s input side automatable and greatly decrease maintenance effort, unlocking its full potential in managing complex bioinformatics projects and their resources.
Expand Down Expand Up @@ -313,7 +313,7 @@ All these databases use different identifiers for their metabolite, proteins or
Using BioCypher, we systematically and reproducibly integrate the knowledge from these databases, facilitating the creation and maintenance of a comprehensive metabolite-receptor interaction database (https://github.com/biocypher/metalinks).

The effectiveness of this approach is exemplified by examining metabolite-mediated CCC in the kidney.
By employing a few concise lines of Cypher, metabolites and proteins can be filtered to focus on those active in the kidney or present in urine.
By employing a [few concise lines of Cypher](https://github.com/biocypher/metalinks/blob/main/cypher_query.txt), metabolites and proteins can be filtered to focus on those active in the kidney or present in urine.
Likewise, metabolite-receptor interactions are filtered using confidence levels.
Applying these contextualization parameters reduces the overall size of the dataset by decreasing the number of metabolites from approximately 1400 to a more manageable 394 (derived from enzyme sets), and metabolite-receptor interactions from ~ 100 000 to 3864, featuring 807 unique receptors and 261 unique metabolites.
The resulting table can either be used in Python directly via BioCypher’s support of Pandas data frames, or exported to CSV from Neo4j, and seamlessly integrated into downstream analysis tools performing CCC, such as LIANA [@doi:10.1038/s41467-022-30755-0].

0 comments on commit c55b92f

Please sign in to comment.