Skip to content

Commit

Permalink
first round of supplement references
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Nov 21, 2023
1 parent b547ee0 commit d534a73
Show file tree
Hide file tree
Showing 6 changed files with 73 additions and 73 deletions.
6 changes: 3 additions & 3 deletions content/20.sup.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ It also manages the mapping of data inputs to ontologies with the help of an ont
This modular architecture facilitates extension of all modules according to the community’s needs.

The resulting knowledge graphs (KGs) can be described as “instance-based” realisations of biomedical concepts: using the concept definition from the ontology, each entity in the graph becomes an instance of this concept.
We recommend the use of a generic “high-level” ontology such as the Biolink model 1, a comprehensive and generic biomedical ontology; where needed, this ontology can be exchanged with or extended by more specific and task-directed ontologies, for instance from the OBO Foundry 2.
We recommend the use of a generic “high-level” ontology such as the Biolink model [@doi:10.1111/cts.13302], a comprehensive and generic biomedical ontology; where needed, this ontology can be exchanged with or extended by more specific and task-directed ontologies, for instance from the OBO Foundry [@doi:10.1038/nbt1346].
The versions of all used ontologies should be specified by each pipeline, which can most effectively be realised by specifying a persistent URL (PURL) for the versioned ontology file (most commonly in OWL format) in the BioCypher configuration.
Identifier namespaces are collected from the community-curated and frequently updated Bioregistry service 3, which is important for ensuring continued compatibility among the created KGs.
Identifier namespaces are collected from the community-curated and frequently updated Bioregistry service [@doi:10.1038/s41597-022-01807-3], which is important for ensuring continued compatibility among the created KGs.
Bioregistry also supplies convenient methods for parsing identifier Compact URIs (CURIEs), which are the preferred method of unambiguously specifying identities of KG entities.
For identifier mapping, where required, the corresponding facilities of pypath 4 are used and extended.
For identifier mapping, where required, the corresponding facilities of pypath [@doi:10.1038/nmeth.4077] are used and extended.

The preferred way of entering data into a BioCypher graph attaches scientific provenance to each entry, allowing the aggregation of data with respect to their sources (for instance, the publication an interaction was derived from) and thus avoiding problems such as duplicate counting of the same primary data from different secondary curations.
For author attribution, the preferred way of entering data into BioCypher also includes the exact provenance of each entry.
Expand Down
12 changes: 6 additions & 6 deletions content/21.sup.note.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

We here give some background and references on the problem of standardising biomedical knowledge representation.
Biomedical knowledge, although increasingly abundant, is fragmented across hundreds of resources.
For instance, a clinical researcher may use protein information from UniProtKB 5, genetic variants from COSMIC 6, protein interactions from IntAct 7, and information on clinical trials from ClinicalTrials.gov 8.
For instance, a clinical researcher may use protein information from UniProtKB [@doi:10.1093/nar/gku989], genetic variants from COSMIC [@doi:10.1093/nar/gku1075], protein interactions from IntAct [@doi:10.1093/nar/gkh052], and information on clinical trials from ClinicalTrials.gov [@doi:10.1001/jama.297.19.2112].


Finding the most suitable KG for a specific task is challenging and time-consuming; they are published in isolation and there is no registry 9,10.
Few available KG solutions perfectly fit the task the individual researcher wants to perform, but creating custom KGs is only possible for those that can afford years of development time by an individual 11,12 or even entire teams 13.
Finding the most suitable KG for a specific task is challenging and time-consuming; they are published in isolation and there is no registry [@doi:10.1093/bib/bbac404],[@doi:10.1146/annurev-biodatasci-010820-091627].
Few available KG solutions perfectly fit the task the individual researcher wants to perform, but creating custom KGs is only possible for those that can afford years of development time by an individual [@doi:10.1016/j.celrep.2019.09.017],[@doi:10.1038/s41467-022-33026-0] or even entire teams [@doi:10.1101/2021.10.28.466262].
Smaller or non-bioinformatics labs need to choose from publicly available KGs, limiting customisation and the use of non-public data.
There exist frameworks to build certain kinds of KG from scratch 14,15, but these are difficult to use for researchers outside of the ontology subfield and often have a rigid underlying data model 10,16.
Even task-specific knowledge graphs sometimes need to be built locally by the user due to licensing or maintenance reasons, which requires significant technical expertise 17.
Modifying an existing, comprehensive KG for a specific purpose is a non-trivial and often manual process prone to lack of reproducibility 18.
There exist frameworks to build certain kinds of KG from scratch [@doi:10.1101/2020.04.30.071407],[@doi:10.1101/631812], but these are difficult to use for researchers outside of the ontology subfield and often have a rigid underlying data model 10,[@doi:10.1101/2020.08.17.254839].
Even task-specific knowledge graphs sometimes need to be built locally by the user due to licensing or maintenance reasons, which requires significant technical expertise [@doi:10.1038/s41467-022-28348-y].
Modifying an existing, comprehensive KG for a specific purpose is a non-trivial and often manual process prone to lack of reproducibility [@doi:10.1101/2022.11.29.518441].
2 changes: 1 addition & 1 deletion content/22.sup.note.2.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Extensive automation reduces development time and file sizes, while additionally
4) Reusability and accessibility: Our template repository for a BioCypher pipeline with adapters, including a Docker Compose setup, is available on GitHub.
To enable learning by example, we curate existing pipelines as well as all adapters they use in a GitHub project that is tied to the BioCypher repository.
With these data, using the GitHub API and a pipeline based on our template, we build a BioCypher “meta-graph” for the simple browsing and analysis of existing BioCypher workflows (https://meta.biocypher.org/).
To inform the structure of this meta-graph, we have reactivated and now maintain the Biomedical Resource Ontology (BRO 19), which allows the categorisation of pipelines and adapters (now on GitHub).
To inform the structure of this meta-graph, we have reactivated and now maintain the Biomedical Resource Ontology (BRO [@doi:10.1016/j.jbi.2010.10.003]), which allows the categorisation of pipelines and adapters (now on GitHub).


While data FAIRness is a necessary part of open science communication, it is not sufficient for the adoption and sustainability of a software project such as BioCypher.
Expand Down
26 changes: 13 additions & 13 deletions content/24.sup.note.4.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,45 @@
## Supplementary Note 4 - Prior Art

There have been numerous attempts at standardising knowledge graphs and making biomedical data stores more interoperable 9,10.
There have been numerous attempts at standardising knowledge graphs and making biomedical data stores more interoperable [@doi:10.1093/bib/bbac404],[@doi:10.1146/annurev-biodatasci-010820-091627].
They can be divided into three broad classes representing increasing levels of abstraction of the KG build process:

1) Centrally maintained databases include task-oriented data collections such as OmniPath 4 or the CKG 20.
1) Centrally maintained databases include task-oriented data collections such as OmniPath 4 or the CKG [@doi:10.1038/s41587-021-01145-6].
They are the least flexible form of knowledge representation, usually bound to a specific research purpose, and are highly dependent on their primary maintainers for continuous functioning.
BioCypher reduces the development and maintenance overhead that usually goes along with such a resource, making a task-specific KG feasible for smaller and less bioinformatics-focused groups.
These databases usually do not conform to any standard in their knowledge representation, hindering their integration.
In contrast, with BioCypher, we migrate OmniPath, CKG, and other popular databases onto an interoperable KG framework.


2) Explicit standard formats or modelling languages include the Biolink model 1, BEL 21, GO-CAM 22, SBML 23, BioPAX 24, and PSI-MI 25.
There are many more, each a solution to a very specific problem, as reviewed elsewhere 21,26; some are part of the COMBINE standard ecosystem 27.
2) Explicit standard formats or modelling languages include the Biolink model 1, BEL [@doi:10.1016/j.drudis.2013.12.011], GO-CAM [@doi:10.1038/s41588-019-0500-1], SBML [@doi:10.15252/msb.20199110], BioPAX [@doi:10.1038/nbt.1666], and PSI-MI [@doi:10.1038/nbt926].
There are many more, each a solution to a very specific problem, as reviewed elsewhere [@doi:10.1016/j.drudis.2013.12.011],[@doi:10.1093/bioinformatics/bti718]; some are part of the COMBINE standard ecosystem [@doi:10.3389/fbioe.2015.00019].
Their main shortcoming is the rigidity that follows from their data model definitions: to represent data in one of these languages, the user needs to fully adopt it.
If the task exceeds the scope of the language, the user needs to either look for alternatives, or introduce new features into the language, which can be a lengthy process.
In addition, some features may be incompatible, and thus, one centrally maintained language definition is fundamentally limited.
With BioCypher, each of the above languages can be adopted as the basis for a particular knowledge graph; in fact, we use the Biolink model as a basic ontology.
Inside our framework, these languages can be freely and transparently exchanged, modified, extended, and hybridised, as we show in several of our case studies (e.g., “Tumour board” extends Biolink with Sequence Ontology and Disease Ontology).

3) KG frameworks provide a means to build KGs, similar to the idea of BioCypher 14–16,28.
3) KG frameworks provide a means to build KGs, similar to the idea of BioCypher 14;[@doi:10.1101/631812];[@doi:10.1101/2020.08.17.254839];[@doi:10.1186/s12859-022-04932-3].
However, most tie themselves tightly to a particular standard format or modelling language ecosystem, thereby inheriting many of the limitations described above.
The Knowledge Graph Hub provides a data loader pipeline, KGX allows conversion of KGs between different technical formats, and RTX-KG2 builds a fixed semantically standardised KG; all three adhere to the Biolink model 16,28.
Bio2BEL is an extensive framework to transform primary databases into BEL 15.
The Knowledge Graph Hub provides a data loader pipeline, KGX allows conversion of KGs between different technical formats, and RTX-KG2 builds a fixed semantically standardised KG; all three adhere to the Biolink model [@doi:10.1101/2020.08.17.254839],[@doi:10.1186/s12859-022-04932-3].
Bio2BEL is an extensive framework to transform primary databases into BEL [@doi:10.1101/631812].
PheKnowLator is the only tool that is conceptually similar to BioCypher in that it allows the creation of knowledge graphs under different data models 14.
However, it appears to be aimed at knowledge representation experts, requiring considerable bioinformatics and ontology expertise.
While being fully customisable, it does not feature flexible recombination of modular components.


The strategy of subgraph extraction to yield smaller, user-specific KGs has been implemented previously, for instance by CROssBAR (v1), ROBOKOP, and the BioThings Explorer 29–31.
However, these rely on single (and thus enormous) harmonised KGs for extracting the subgraphs as opposed to BioCypher’s modular approach 32.
The strategy of subgraph extraction to yield smaller, user-specific KGs has been implemented previously, for instance by CROssBAR (v1), ROBOKOP, and the BioThings Explorer [@doi:10.1093/nar/gkab543];[@doi:10.1093/bioinformatics/btz604];[@doi:10.1186/s12859-018-2041-5].
However, these rely on single (and thus enormous) harmonised KGs for extracting the subgraphs as opposed to BioCypher’s modular approach [@doi:10.1111/cts.12592].
While the “top-down” approach of first building a massive KG and then extracting subgraphs from it is a valid means to arrive at a particular knowledge representation, the effort involved is detrimental to efficiency and democratisation of the process.
A secondary consequence of this large primary effort is that alternative representations of the initial KG will probably not be attempted, hindering flexible knowledge representation.
In contrast, the “bottom-up” approach we follow in BioCypher emphasises modular recombination and flexible representation with small effort overheads.

Ontology mapping has been leveraged for data integration by consortia such as the Monarch Initiative (which is the parent organisation of the MONDO Disease Ontology and the Biolink model, among others) as well as single projects, such as KaBOB 33,34.
Ontology mapping has been leveraged for data integration by consortia such as the Monarch Initiative (which is the parent organisation of the MONDO Disease Ontology and the Biolink model, among others) as well as single projects, such as KaBOB [@doi:10.1534/genetics.116.188870],[@doi:10.1186/s12859-015-0559-3].
While conceptually related to BioCypher in the use of ontology and biomedical data, these are massive efforts that are not amenable to replication by the average research group.
We aim to close this gap by providing an agile and modular framework that facilitates the reuse of the valuable resources generated by those projects.

There exist alternatives to workflows that involve KGs.
While the premise of our manuscript is that KGs are an important part of sustainable and trustworthy machine learning in the biomedical sciences, “zero domain knowledge” approaches such as UniHPF 35 can do without prior knowledge in their inference process.
Whether methods that forego knowledge representation entirely can be as good or better than methods that use knowledge representation is still a matter of discussion 36–42.
While the premise of our manuscript is that KGs are an important part of sustainable and trustworthy machine learning in the biomedical sciences, “zero domain knowledge” approaches such as UniHPF [@doi:10.48550/arXiv.2211.08082] can do without prior knowledge in their inference process.
Whether methods that forego knowledge representation entirely can be as good or better than methods that use knowledge representation is still a matter of discussion [@doi:10.1038/s41551-022-00942-x];[@doi:10.1101/2022.05.01.489928];[@doi:10.1101/2022.12.07.22283238];[@doi:10.48550/arxiv.2210.09338];[@doi:10.1016/j.artint.2021.103627];[@doi:10.48550/arXiv.2205.15952];[@doi:10.1093/bioinformatics/btac001].
One aspect that is apparent from modern developments in large language models is that prior knowledge-free models appear to be very data hungry; while billion parameter models are very impressive in their text and image processing capabilities, we do not nearly have enough data in molecular biomedicine to train a GPT-like model, even if we had the funds to train it.
In addition, even in prior knowledge-free deep models, a semantically enriched knowledge graph can still play a role and be useful as an in-process component 43.
In addition, even in prior knowledge-free deep models, a semantically enriched knowledge graph can still play a role and be useful as an in-process component [@doi:10.1609/aaai.v36i10.21286].
To address these and other performance-related questions, we want to facilitate the creation of benchmarks and standard datasets through the modular nature of our framework.
Loading

0 comments on commit d534a73

Please sign in to comment.