Skip to content

Latest commit

 

History

History
206 lines (171 loc) · 26.4 KB

04.network_applications.md

File metadata and controls

206 lines (171 loc) · 26.4 KB

Applying Knowledge Graphs to Biomedical Challenges

Knowledge graphs can help researchers tackle many biomedical problems such as finding new treatments for existing drugs [@doi:10.7554/eLife.26726], aiding efforts to diagnose patients [@arxiv:1611.07012] and identifying associations between diseases and biomolecules [@doi:10.1155/2017/2498957]. In many cases, solutions rely on representing knowledge graphs in a low dimensional space, which is a process called representational learning. The goal of this process is to retain and encode the local and/or global structure of a knowledge graph that is relevant to the problem while transforming the graph into a representation that can be readily used with machine learning methods to build predictors. In the following sections we review methods that construct a low dimensional space (Unifying Representational Learning Techniques) and discuss applications that use this space to solve biomedical problems (Unifying Applications).

Unifying Representational Learning Techniques

Mapping high dimensional data into a low dimensional space greatly improves modeling performance in fields such as natural language processing [@arxiv:1301.3781; @arxiv:1310.4546] and image analysis [@arxiv:1512.03385]. The success of these approaches served as rationale for a sharper focus on representing knowledge graphs in a low dimensional space [@arxiv:1709.05584]. Methods of this class are designed to capture the essence of a knowledge graph in the form of dense vectors [@doi:10.1007/BF02288367; @doi:10.1162/089976603321780317]. These vectors are often assigned to nodes in a graph [@arxiv:1607.00653], but edges can be assigned as well [@raw:bordestranse]. Techniques that construct a low dimensional space often require information on how nodes are connected with one another [@raw:gongreplearn; @doi:10.1109/TKDE.2016.2606098; @raw:hanreplearn; @doi:10.1145/2806416.2806512], while other approaches can work directly with the edges themselves [@arxiv:1610.04073]. Once this space has been constructed, machine learning techniques can utilize the space for downstream analyses such as classification or clustering. We group techniques that construct this space into the following three categories: matrix factorization, translational distance models, and neural network models (Figure {@fig:unifying_techniques_overview}).

 Pipeline for representing knowledge graphs in a low dimensional space. Starting with a knowledge graph, this space can be generated using one of the following options: Matrix Factorization (a), Translational Models (b) or Neural Network Models (c). The output of this pipeline is an embedding space that clusters similar node types together. {#fig:unifying_techniques_overview}

Matrix Factorization

Matrix factorization is a class of techniques that use linear algebra to map high dimensional data into a low dimensional space. This projection is accomplished by decomposing a matrix into a set of small rectangular matrices (Figure {@fig:unifying_techniques_overview} (a)). Notable methods for matrix decomposition include Isomap [@doi:10.1126/science.290.5500.2319], Laplacian eigenmaps [@doi:10.1162/089976603321780317] and Principal Component Analysis (PCA) [@doi:10/bm8dnf]/Singular Vector Decomposition (SVD) [@doi:10.1007/BF02288367]. These methods were designed to be used on many different types of data; however, we discuss their use in the context of representing knowledge graphs in a low dimensional space and focus particularly on SVD and laplacian eigenmaps.

SVD [@doi:10.1007/BF02288367] is an algorithm that uses matrix factorization to portray knowledge graphs in a low dimensional space. The input for this algorithm is an adjacency matrix ($A$), which is a square matrix where rows and columns represent nodes and each entry is a binary representation of the presence of an edge between two nodes. $A$ is constructed based on the knowledge graph's structure itself and collapses all edges between two nodes into one unique entity. Following construction, $A$ is decomposed into the following three parts: a square matrix $Σ$ and a set of two small rectangular matrices $U$ and $V^{T}$. Values within $Σ$ are called singular values, which are akin to eigenvalues [@doi:10.1007/BF02288367]. Each row in $U$ and each column in $V^{T}$ represents nodes within a low dimensional space [@doi:10.1007/BF02288367; @doi:10/bm8dnf]. In practice, $U$ is usually used to represent nodes in a knowledge graph and can be used as input for machine learning classifiers to perform tasks such as link prediction or node classification [@doi:10.1093/bioinformatics/btz718]; however, $V^{T}$ has also been used [@doi:10.1007/BF02288367; @raw:Jiezhon2018]. Typically, matrix factorization algorithms such as SVD are used for recommendation systems via collaborative filtering [@doi:10.1155/2009/421425]; however, this technique can also provide a standalone baseline for other relational learning approaches [@doi:10.1093/bioinformatics/btz718].

Laplacian eigenmaps assume there is low dimensional structure in a high dimensional space and preserves this structure when projecting data into a low dimensional space [@doi:10.1162/089976603321780317]. The first step of this technique is to preserve the low dimensional structure by representing data in the form of a graph where nodes are datapoints and edges are the distance between two points. Knowledge graphs already provide this representation, so no additional processing is necessary at this stage. The second step of this technique is to obtain both an adjacency matrix ($A$) and a degree matrix ($D$) from the graph representation. A degree matrix is a diagonal matrix where each entry represents the number of edges connected to a node. The adjacency and degree matrices are converted into a laplacian matrix ($L$), which is a matrix that shares the same properties as the adjacency matrix. The laplacian matrix is generated by subtracting the adjacency matrix from the degree matrix ($L=D-A$) and, once constructed, the algorithm uses linear algebra to calculate the laplacian's eigenvalues and eigenvectors ($Lx = \lambda Dx$). The generated eigenvectors represent the knowledge graph's nodes represented in a low dimensional space [@doi:10.1162/089976603321780317]. Other efforts have used variants of this algorithm to construct low dimensional representations of knowledge graphs [@raw:gongreplearn; @doi:10.1109/TKDE.2016.2606098; @arxiv:1905.09763]. Typically, eigenmaps work well when knowledge graphs have a sparse number of edges between nodes but struggle when presented with denser networks [@doi:10.1093/bioinformatics/btz718; @doi:10.1371/journal.pcbi.1005621; @arxiv:1905.09763]. An open area of exploration is to adapt these methods to accommodate knowledge graphs that have a large number of edges.

Matrix factorization is a powerful technique that represents high dimensional data in a low dimensional space. The representation of a knowledge graph in this reduced space does not meet our definition of a knowledge graph; however, this representation supports many use cases including similarity-based (e.g., cosine similarity [@arxiv:1910.09129]) and machine learning applications. Common matrix factorization approaches involve using SVD, Laplacian eigenmaps or variants of the two to decompose matrices into smaller rectangular forms. Regarding knowledge graphs, the adjacency matrix ($A$) is the typical matrix that gets decomposed, but the laplacian matrix ($L=D-A$) can be used as well. Despite reported success, the dependence on matrices creates an issue of scalability as matrices of large networks may reach memory limitations. Furthermore, the approaches we discussed consider all edge types as equivalent. These limitations could be mitigated by new approaches designed to accommodate multiple node and edge types separately.

Translational Distance Models

Translational distance models treat edges in a knowledge graph as linear transformations. For example, one such algorithm, TransE [@raw:bordestranse], treats every node-edge pair as a triplet with head nodes represented as $\textbf{h}$, edges represented as $\textbf{r}$, and tail nodes represented as $\textbf{t}$. These representations are combined into an equation that mimics the iconic word vectors translations ($\textbf{king} - \textbf{man} + \textbf{woman} \approx \textbf{queen}$) from the word2vec model [@arxiv:1310.4546]. The described equation is shown as follows: $\textbf{h} + \textbf{r} \approx \textbf{t}$. Starting at the head node ($\textbf{h}$), one adds the edge vector ($\textbf{r}$) and the result should be the tail node ($\textbf{t}$). TransE optimizes vectors for $\textbf{h}$, $\textbf{r}$, $\textbf{t}$, while guaranteeing the global equation ($\textbf{h} + \textbf{r} \approx \textbf{t}$) is satisfied [@raw:bordestranse]. A caveat to the TransE approach is that it forces relationships to have a one to one mapping, which may not be appropriate for all relationship types.

Wang et al. attempted to resolve the one to one mapping issue by developing the TransH model [@raw:wangtransH]. TransH treats relations as hyperplanes rather than a regular vector and projects the head ($\textbf{h}$) and tail ($\textbf{t}$) nodes onto a hyperplane. Following this projection, a distance vector ($\textbf{d}{r}$) is calculated between the projected head and tail nodes. Finally, each vector is optimized while preserving the global equation: $\textbf{h} + \textbf{d}{r} \approx \textbf{t}$ [@raw:wangtransH]. Other efforts have built off of the TransE and TransH models [@raw:lintransR, @arxiv:1610.04073; @arxiv:1909.00672]. In the future, it may be beneficial for these models to incorporate other types of information such as edge confidence scores, textual information, or edge type information when optimizing these distance models.

Neural Networks

Neural networks are a class of machine learning models inspired by the concept of biological neural networks [@pmid:11084225]. These networks are reputable for making non-linear transformations of high dimensional data to solve classification and regression problems [@pmid:11084225]. In the context of knowledge graphs, the most commonly used structures are based on word2vec [@arxiv:1310.4546; @arxiv:1301.3781]. The word2vec term applies to a set of conceptually related approaches that are widely used in the natural language processing field. The goal of word2vec is to project words onto a low dimensional space that preserves their semantic meaning. Strategies for training word2vec models use one of two neural network architectures: skip-gram and continuous bag of words (CBOW). Both models are feed-forward neural networks, but CBOW models are trained to predict a word given its context while skip-gram models are trained to predict the context given a word [@arxiv:1310.4546; @arxiv:1301.3781]. Once training is completed, words will be associated with dense vectors that downstream models, such as feed forward networks or recurrent networks, can use for input.

Deepwalk is an early method that represents knowledge graphs in a low dimensional space [@doi:10.1145/2623330.2623732]. The first step of this method is to perform a random walk along a knowledge graph. During the random walk, every generated sequence of nodes is recorded and treated as a sentence in word2vec [@arxiv:1310.4546; @arxiv:1301.3781]. After every node has been processed, a skip-gram model is trained to predict the context of each node thereby constructing a low dimensional representation of a knowledge graph [@doi:10.1145/2623330.2623732]. A limitation for deepwalk is that the random walk cannot be controlled, so every node has an equal chance to be reached. Grover and Leskovec demonstrated that this limitation can hurt performance when classifying edges between nodes and developed node2vec as a result [@arxiv:1607.00653]. Node2vec operates in the same fashion as deepwalk; however, this algorithm specifies a parameter that lets the random walk be biased when traversing nodes [@arxiv:1607.00653]. A caveat to both deepwalk and node2vec is that they ignore information such as edge type and node type. Various approaches have evolved to fix this limitation by incorporating node, edge and even path types when representing knowledge graphs in a low dimensional space [@doi:10.1145/3097983.3098061; @doi:10.1145/3097983.3098036; @arxiv:1809.02269; @arxiv:1808.05611]. An emerging area of work is to develop approaches that capture both the local and global structure of a graph when constructing this low dimensional space.

Though word2vec is the most common framework used to represent graphs, neural networks are sometimes designed to use the adjacency matrix as input [@arxiv:1310.4546; @arxiv:1301.3781]. These approaches use models called autoencoders [@doi:10.1109/DSAA.2018.00034; @arxiv:1611.07308; @arxiv:1802.04407]. Autoencoders are designed to map input into a low dimensional space and then back to a reconstruction of the same input [@doi:10.1016/j.neunet.2014.09.003; @raw:Baldi2011]. It is possible to layer on additional objectives by modifying the loss function to take into account criteria above and beyond reconstruction loss [@arxiv:1312.6114; @arxiv:1802.03480]. In the context of knowledge graphs, the generated space correlates nodes with dense vectors that capture a graph's connectivity structure [@doi:10.1109/DSAA.2018.00034; @arxiv:1611.07308; @arxiv:1802.04407]. Despite the high potential of autoencoders, this method relies on an adjacency matrix for input which can run into scalability issues as a knowledge graph asymptotically increases in size [@doi:10.1109/TKDE.2019.2951398]. Plus, Khosla et al. discovered that approaches akin to node2vec outperformed algorithms using autoencoders when undergoing link prediction and node classification [@doi:10.1109/TKDE.2019.2951398].

Overall, the performance of neural network models largely depends upon the structure of nodes and edges within a knowledge graph [@doi:10.1109/TKDE.2019.2951398]. Furthermore, when these approaches are used only nodes are explicitly represented by these vectors. This means a represented knowledge graph no longer meets our definition of a knowledge graph; however, this representation can make it more suitable for many biomedical applications. Future areas of exploration should include hybrid models that use both node2vec and autoencoders to construct complementary low dimensional representations of knowledge graphs.

Unifying Applications

Knowledge graphs have been applied to many biomedical challenges ranging from identifying proteins' functions [@doi:10.1186/s12859-018-2163-9] to prioritizing cancer genes [@doi:10.1093/bioinformatics/bty148] to recommending safer drugs for patients [@arxiv:1710.05980; @doi:10.1609/aaai.v33i01.33011126] (Figure {@fig:unifying_applications}). In this section we review how knowledge graphs are applied in biomedical settings and put particular emphasis on an emerging set of techniques that represent knowledge graphs in a low dimensional space.

 Overview of various biomedical applications that make use of knowledge graphs. Categories consist of: (a) Multi-Omic applications, (b) Pharmaceutical Applications and (c) Clinical Applications. {#fig:unifying_applications}

Multi-Omic Applications

Multi-omic applications employ knowledge graphs to study the genome, how genes are expressed in the transcriptome, and how the products of those transcripts interact in the proteome. These graphs are used to establish connections between -omic entities as well as diseases. Tasks in this context include gene-symptom prioritization [@doi:10.1093/jamia/ocy117], protein-protein interaction prediction [@doi:10.1089/cmb.2012.0273; @doi:10.1186/1752-0509-9-S1-S9] and detecting miRNA-disease associations [@doi:10.1155/2017/2498957]. We focus specifically on multi-omic applications that represent knowledge graphs in a low dimensional space to make connections.

Recommendation systems make use of knowledge graphs to establish links between RNA with disease and proteins with other proteins. Shen et al. used an algorithm called collaborative filtering to establish an association between miRNA and diseases [@doi:10.1155/2017/2498957]. The authors constructed a miRNA-Disease network using the Human MicroRNA Disease database (HMDD) [@doi:10.1093/nar/gky1010] and generated an adjacency matrix with the rows representing miRNA and the columns representing diseases. This matrix was decomposed into small rectangular matrices using SVD, then these small matrices were used to calculate similarity scores between miRNAs and diseases. High scores implied a high likelihood that a given miRNA had an association with a given disease [@doi:10.1155/2017/2498957]. Other approaches built off of Shen et al.'s work by incorporating novel ways to perform matrix factorization [@doi:10.3390/genes10020080; @doi:10.1186/s12859-019-2956-5;@doi:10.1186/s12859-019-3260-0] or by integrating machine learning models in conjunction with matrix factorization [@doi:10.1109/TCBB.2019.2937774]. These approaches achieved high area under the receiver operating curve (AUROC), but new discoveries have been hard to validate as experiments in this space are costly and time consuming at best [@doi:10.1155/2017/2498957]. Apart from miRNA, collaborative filtering has been used to predict protein-protein interactions [@doi:10.1109/BIBM.2010.5706537; @doi:10.1089/cmb.2012.0273; @doi:10.1186/1752-0509-9-S1-S9]. Although extensive validation of newly generated candidates may be impractical, it would be helpful to see future efforts in this space include a blinded literature search for prioritized and randomly selected candidates as part of the standard evaluation pipeline.

Applications of neural network models have mainly used the node2vec model [@arxiv:1607.00653] or variants of it. Yang et al. used node2vec to create a recommendation system to infer associations between genes and disease symptoms [@doi:10.1093/jamia/ocy117]. The authors constructed a gene-disease symptom knowledge graph by combining two bipartite graphs: genes with diseases and diseases with disease symptoms. The generated graph was embedded via node2vec and similarity scores were calculated for every gene-symptom pair in the graph. High scores implied a high likelihood of an association [@doi:10.1093/jamia/ocy117]. This approach outperformed methods that didn't use a knowledge graph; however, validation was difficult as it involved manual curation of the literature [@doi:10.1093/jamia/ocy117]. Similar approaches used variants of node2vec to predict gene-disease associations [@doi:10.1093/bioinformatics/bty559; doi:10.3389/fgene.2019.00226 @doi:10.1186/s12920-019-0627-z; @doi:10.1109/BIBM47256.2019.8983134] analyze RNA-seq data [@doi:10.1093/nar/gkx750] and infer novel protein information [@doi:10.1093/bioinformatics/btx275; @doi:10.1007/978-3-030-00825-3_16; @doi:10.1016/j.artmed.2019.04.001; @doi:10.1186/s12859-018-2163-9].

Knowledge graphs benefited the multi-omics field as a resource for generating novel discoveries. Most approaches to date use matrix factorization and node2vec to project knowledge graph into a low dimensional space, while translational models (Figure {@fig:unifying_techniques_overview} (b)) may be an untapped resource that could aid future efforts. Another area of exploration could be incorporating multiple sources of information such as compounds, anatomic locations or genetic pathways to improve the specificity of findings (i.e., to predict that a protein-protein interaction happens in a specific cell type or tissue).

Pharmaceutical Applications

There are a multitude of examples where knowledge graphs have been applied to identify new properties of drugs. Tasks in this field involve predicting drugs interacting with other drugs [@doi:10.1016/j.websem.2017.06.002], identifying molecular targets a drug might interact with [@doi:10.1155/2015/275045] and identifying new disease treatments for previously established drugs [@doi:10.7554/eLife.26726.001]. In this section we concentrate on applications that apply these graphs to discover new properties of drugs and focus on approaches that use these graphs in a low-dimensional space.

Similar to multi-omic applications, recommendation systems have utilized knowledge graphs to infer novel links between drugs and diseases. Dai et al. used collaborative filtering to infer drug-disease associations [@doi:10.1155/2015/275045]. The authors constructed a drug-disease network by integrating two bipartite networks: a drug-gene interaction network and a disease-gene interaction network. They integrated both networks under the assumption that drugs associated with a disease interact with the same gene of interest. Following construction, the authors generated an adjacency matrix where rows represent drugs and columns represent diseases. This matrix was decomposed into two small rectangular matrices and these matrices were used to calculate similarity scores between all drugs and all diseases. High values implied a high chance of an association [@doi:10.1155/2015/275045]. Related approaches used this technique to infer drug-target interactions [@doi:10.1109/TCBB.2016.2530062; @doi:10.1109/BIOCAS.2018.8584817; @doi:10.1093/bioinformatics/btaa010] and drug-disease treatments [@doi:10.1109/TCBB.2018.2830384; @doi:10.1371/journal.pcbi.1004760; @doi:10.1038/srep40376; @doi:10.1021/ci500340n; @doi:10.1186/s12859-018-2220-4]. In spite of reported success, these approaches are limited to the drugs and diseases contained in the graph. Combining these approaches with representations of chemical structures might make it possible to one day make predictions about novel compounds.

Applications that use neural network models have used node2vec [@doi:10.1093/bioinformatics/btx160; @doi:10.1101/539643] and autoencoders [@arxiv:1804.10850v1; @doi:10.1093/bioinformatics/bty294] approaches to represent knowledge graphs in a low dimensional space. Zong et al. used a node2vec-like model to predict drug-target associations [@doi:10.1093/bioinformatics/btx160]. The authors constructed a disease-target-disease network using drug centered databases: Drugbank [@doi:10.1093/nar/gkm958] and Diseasome [@doi:10.1073/pnas.0701361104]. Next, the authors applied a random walk to the graph and trained a skip-gram model to generate a low dimensional representation of the graph. Lastly, the authors constructed a similarity metric that used this space to rank how similar drugs are to their targets [@doi:10.1093/bioinformatics/btx160]. A limitation to this approach is that their graph is missing information such as pharmacological class or drug chemical structure that could improve prediction performance. Overall, neural networks provide a robust set of techniques that have been shown to outperform most linear approaches in this context [@doi:10.1186/s12859-019-3284-5; @doi:10.1145/3307339.3342161].

Applications that discover new properties of drugs have benefited from using knowledge graphs as a resource. Most methods to date use matrix factorization and neural network models to produce a low-dimensional representation. Due to the success of neural networks [@doi:10.1186/s12859-019-3284-5; @doi:10.1145/3307339.3342161] much of the field's focus has shifted to these techniques; however, a possible improvement is to use an ensemble of neural network models and linear methods to improve performance. Another potential avenue for future work would be to incorporate entity-specific hierarchical information or similarity information to improve detection power. For drugs, this could include pharmaceutical classes or chemical structure similarities.

Clinical applications

Clinical applications that use knowledge graphs are in early stages of development, but the long-term goal is to use analyses of these graphs to aid patient care. Typically, graphs for these applications are constructed from electronic health records (EHR): nodes represent patients, drugs and diseases while edges represent relationships such as a patient being prescribed a treatment or a patient being diagnosed with a disease [@pmid:26306276; @doi:10.1145/2110363.2110415; @arxiv:1707.05340; @doi:10.1186/s12859-015-0549-5]. Tasks within this field range from improving patient diagnoses [@doi:10.1109/TKDE.2016.2605687;@arxiv:1709.06908] to recommending safer drugs for patients [@arxiv:1710.05980; @arxiv:1709.06908]. We briefly discuss efforts that use knowledge graphs to accomplish such tasks.

Early work in this field applied translational models (Figure {@fig:unifying_techniques_overview} (b)) to knowledge graphs with the goal of recommending safe drugs. Wang et al. used a variant of the TransH [@raw:wangtransH] model to create such a system for patients [@arxiv:1710.05980]. They constructed a disease-patient-drug network by integrating a patient-disease bipartite network with a patient-drug bipartite network. Every node in the newly constructed graph was embedded while satisfying the following equation: $\textbf{h} - \textbf{r} \approx \textbf{t}$. Following the embedding step, the authors formulated their own similarity metric that selected drug combinations with a low number of interactions [@arxiv:1710.05980]. Researchers in [@arxiv:1909.00672] applied a similar variant of the TransH model to a medical knowledge graph and evaluated their model for link prediction rather than patient recommendation.

In contrast with most applications where node2vec and autoencoder models have become established, this field have focused on using graph attention models [@arxiv:1706.03762]. These models mimic machine translation models [@arxiv:1409.0473] and aim to simultaneously represent knowledge graphs in a low dimensional space and perform the task at hand. Choi et al. used a graph attention model to predict patient diagnoses [@arxiv:1611.07012]. The authors constructed a directed graph using medical concepts from patient EHR data. This directed graph was fed into a graph attention network and then used to predict a patient's likelihood of heart failure [@arxiv:1611.07012]. Other approaches have used graph attention models to perform clinical tasks such as drug safety recommendations [@doi:10.1609/aaai.v33i01.33011126] and patient diagnoses [@arxiv:1906.04716].

Knowledge graphs have shown promising results when used for clinical applications; however, there is still room for improvement. Most approaches have run into the common problem of missing data within EHR [@arxiv:1611.07012; @arxiv:1710.05980; @doi:10.1609/aaai.v33i01.33011126]. Future directions for the field consist of designing algorithms that can fill in this missing data gap or construct models that can take missing data into account.