# 0. Index

- Introduction
- Graph and GraphDB
- GNN
- GraphDB tasks, classical algorithms, and deep model based approaches
- Remaining challenges in applying deep model based algorithms to GraphDB
- Conclusion
- Reference

# 1. Introduction

This post aims to introduce the integration of graph database (GraphDB) and deep model based Machine Learning (ML) on a graph. While this has not caught a wide audience yet, we believe this can be one of the successful applications of Machine Learning. Like improvement in ML opened the promising field of optimizing Relational Database System (RDBS) using ML, we expect the improvement of Machine Learning on a graph can shed a new light on the optimization of GraphDB.

## π Integration of database and machine learning in RDBS

Here we show the example how ML can be helpful in enhancing the optimization of RDBS.

The order of "JOIN" operations in SQL can significantly change the overall performance. Therefore, determining the order of "JOIN" operations is one of the important tasks to optimize the RDBS operations. In 2018, UC Berkeley Rise Lab introduced optimizing the order of "JOIN" queries using Reinforcement Learning.

Classical algorithms solve this problem by dynamic programming (DP). For instance, assume we have three tables in the database, which are S (Salary), E (Employee), and T (Taxes), and want to perform the query below.

This query requires RDBS to join three tables, {E}, {S}, and {T}. To solve the optimal cost of "JOIN

" operation for two tables, we can formulate the below recursive equation:

$$\text{Best}(T_1,T_2)=\text{Best}(T_1)+\text{Best}(T_2)+J(T_1,T_2)$$

Therefore, we first save $\text{Best}(E), \text{ Best}(S), \text{ Best}(T)$, which is just time scanning each table, and then define $\text{Best}({E,S}),\text{ Best}({S,T}),\text{ Best}({T,E})$ using above recursive equation. Then, we finally define $\text{Best}({E,S,T})$ by findning minimum from$\text{Best}({E,S},T), \text{ Best}({S,T},E), \text{ Best}({T,E},S)$.

This problem can be solved using DP. When we have sufficient memory, DP is a powerful way to find optimal in a recursive equation. However, when it is hard to save all possible operations in a table, Reinforcement Learning (RL) can be one of the alternatives for DP, by constructing a neural network that can approximate operation tables. We omit the detailed approaches for executing "JOIN" operation using RL, and link the paper and blog posts instead.

This shows that ML can be one of the promising approaches to optimize an operation of a database. We believe MLs and deep models are not only applicable to RDBS, but also in the field of GraphDB, with the recent progress in GraphDBs and GNNs.

## π Content of the article

The content of this post is organized like the following. First, we explain what a graph is, and what a GraphDB is with a few well-known GraphDB libraries. Second, we introduce the field of Graph Neural Network (GNN), one of the ways to efficiently perform a deep model on a graph, and a couple of well-known libraries for GNN. Third, we summarize tasks in GraphDB, introduce how classical algorithms solve those problems, and analyze how deep models on a graph can be efficient for those tasks. Finally, we observe what are the future research challenges in this field.

The aim of this article is to overview previous studies and possible research directions of utilizing deep moels and GNNs for GraphDB, instead of explaining each algorithm in detail. Therefore, instead of delving into the specifics of algorithms and theories, we will point out key ideas and refer to well-written references for the details.

# 2. Graph and GraphDB

## π Graph

A graph is an object that aims to illustrate the relationships among entities. In a mathematical language, a graph can be defined as a tuple G = (V,E), where V is a vertex set and E is an edge set. But in various domains more specific definitions, rather than this general one, are more useful. Based on how to define the vertices and the edges, there are many types of graph representations, one of which is a simple undirected unweighted graph. We can have more complicated graph representations by appending weights, directions, node properties, and edge properties as needed.

## πGraphDB

A relational database can represent a graph using a table schema. But graphs usually do not fit well into such a representation, and therefore classical relational databases often fail to process graph data in an efficient manner. Graph Database (GraphDB) aims to enhance the performance of storing, processing, and managing graph datasets, by implementing a graph-specific database.

### Graph Storing methods in GraphDB

A GraphDB needs to store nodes, edges, and properties in a efficient manner.

Nodes and properties are usually stored in a vector, and edges are usually stored in an adjacency matrix or adjacency list, with each relationship as an element of the list or matrix.

While a matrix representation has advantages in indexing and referencing, a list representation has the benefits of saving memory spaces. Many GraphDBs prefer to choose the list representation due to the limit of the memory capacity, but each has a different implementation.

### Libraries

According to the https://db-engines.com/en/ranking/graph+dbms, Neo4j and JanusGraph are the GraphDBs that are widely used in April, 2021. We briefly introduce how these t GraphDBs differ.

Neo4j is a Java and Scala-based GraphDB, and it is currently the most widely used GraphDB. Neo4j stores graph data using the variation of an adjacency list. In Neo4j, each node has a pointer to the first relationship. Then, a doubly-linked list of relationships follows after this first relationship. This enables Neo4j to store multiple relationships.

Like there is a query language SQL for RDBS, Neo4j also has its own query language, called Cypher. We can interact with Neo4j Database using Cypher. Cypher has an advantage in code readability, while it suffers the lack of optimization possibilities.

JanusGraph is an open-source Java-based GraphDB. JanusGRaph stores data using adjacency lists. It maintains the adjacency list of each vertex in sort order with the order being defined by the sort key and sort order the edge labels. The sort order enables efficient retrievals of subsets of the adjacency list. The below shows a brief structure of an adjacency list in JanusGraph. Each property contains what property it is with the key. And each edge contains what relationship it is and what is adjacent vertices.

*JanusGraph Internal. Reference: https://docs.janusgraph.org/advanced-topics/data-model/*

This adjacency list is stored by Bigtable data model, which is a wide column storing table. Each vertex is a key, and each adjacency list is a value. Since JanusGraph uses a Bigtable, Janusgraph has a freedom in storage backends if the storage backend supports the JanusGraph. The possible candidates of storage backends for JanusGraph includes Apache Cassandra, Apache HBase, and Oracle Berkeley DB Java Edition.

JanusGraph recommends Gremlin query language for interactions. Gremlin has the advantage of very compact syntax and higher optimization level compared to Cypher, but it lacks code readability compared to Cypher.

*Comparison between query languages. Reference: Querying a graph database β language selection and performance considerations*

# 3. GNN

A graph neural network (GNN) takes graph data as an input and implement Neural Network architectures in a graph-specific way.

## π Representation learning and GNN

GNN is closely related to representation learning, whose goal is representing raw input data in a more "meaningful" way, that is, encoding the input data to some particular embedding that enables to extract information more easily.

When we use NN architectures, we regard the space of vectors before passing the final layer as a "representation". The final layer works as aggregating the representation to solve the given task. This point of view is applicable to all NN-based architectures. The field of obtaining meaningful representation for the task, especially using ML, is called "Representation Learning".

To use ML-based approaches, GNN tries to learn the "good representation" of the graph, and use this representation to solve the given problem. Therefore, the performance of a GNN is closely related to the quality of the representation. Typical representation learning tries to embed data into Euclidean space, but it does not have to be a Euclidean or even a metric space.

## π GNN architectures

### Implementing Neural Network Operations in Graph Domain

Inspired from recent progresses in Computer Vision (CV) and Natural Language Processing (NLP), there were attempts to integrate the architectures that successfully worked in those domains, such as Convolutional Neural Network (CNN) and Attention Network, with graph-theoretical approaches. With some attempts, It revealed that a "message-passing" scheme in a graph theory fits well with the NN architectures. As a result, many architectures are developed based on the message-passing scheme.

In the following sections, we first introduce the message-passing scheme, and review some widely-known GNN architectures, Graph Convolutional Network (GCN), Graph Attention Network (GAT), and GraphSAGE, which are based on the message-passing. There are more algorithms developed recently, for example, Graph Neural Tangent Kernel and Graph Wavelet Convolutional Network, but we omit those works for this article.

### Message-Passing Scheme: A general framework of GNNs

One of the fundamental approaches to interpret the various graph algorithms is Message-Passing, a theoretical framework that views each graph node containing a message and analyzes how these messages are passed through their neighborhood. This implies that we can calculate backward from the neighbors to retrieve the original node. In other words, aggregating the message from neighbor nodes can reconstruct the message of the original node. This consists of 3 steps.

- Computation of neighbor's messages
- First, we need to compute the message between our original node and neighbor nodes.
- For a node $u$ and one of its neighbor nodes $v$, we denote current representations with $i$th update of $u$ and $v$ as $h^i_u, h^i_v$, and the current relationship between these two representations as $r^i_{h_u, h_v}$.
- For the initial step ( $i = 0)$, we set $h^0_u = u, h^0_v = v$ and $r^0_{h_u, h_v} =$ edge $e_{u,v}$.
- Then, by pre-defined message computation function $\text{MSG}$, we compute a message between $h^i_u$ and $h^i_v$ as $m_{h^i_u, h^i_v} = \text{MSG}(h^i_u, h^i_v, r^i_{h_u,h_v})$.

- Aggregation of messages
- Now, we aggregate all messages of neight nodes computed in 1.
- With a pre-defined aggregation function $\text{AGGR}$, we aggregate messages of $\mathcal{N}(u)$, a set of neighbor nodes of $u$.
- With a slight abuse of notation, an aggregated message for $h^i_u$ would be
- This aggregated message can be interpreted as information of the node $h^i_u$ reconstructed by $\mathcal{N}(u)$.

- Updating the representation with the aggregated messages
- Lastly, we update a representation $h^{i+1}_{u}$ using pre-defined update function
- This means that we update representation based on current representation and neighborhood messages obtained in current representation.

The underlying idea of message-passing is intuitive but powerful. It gives us a theoretical justification for capturing structural information of the graph via iterative methods.

### Graph Convolutional Network (GCN)

GCN is a most widely known way to implement a CNN in graph domain. It has its theoretical background in graph spectral theory. We omit the theoretical backgrounds of GCN and refer this well-written blog post (Korean) for detailed theoretical explanations. Instead, we propose general steps of GCN, and explain key ideas.

While the original GCN paper did not denote their procedures in this message passing viewpoint, it is easy to check if we set the $\text{MSG}$ as a linear function using weights and degree normalization, $\text{AGGR}$ as a sum, and $\text{UPDATE}$ as an activation function, then procedure 1~3 becomes a GCN layer. We describe the procedure in details for $i$th iteration.

- Computation of the messages
- We set a linear $\text{MSG}$ function as follows: $m_{h^i_u, h^i_v}=\frac{1}{\sqrt{deg(u)deg(v)}}W^i \cdot h^i_{v}$, where $W^i$ is a linear matrix. In this case, we consider the relationship $r^i_{h^i_u, h^i_v}=w_{u,v} \text{ for } \forall {i}$. The $W$is a parameter of Neural Network and will be trained.

- Aggregation of the messages
- By defining $\text{AGGR}$ as a sum of all neighbors with adding the own node to its neighbor, we can calculate the aggregation $M_{h^i_u}=\sum_{v \in \mathcal{N}(u) \cup {u}} m_{h^i_u, h^i_v}$ .

- Updating the representation with the aggregated messages
- We define the $\text{UPDATE}$ function as the activation function $\sigma (\cdot)$. Then, the final architecture would be

We can easily check that this message-passing scheme corresponds to expressing the original formulation of GCN as follows,

$$H^{i+1}= \sigma(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} H^iW^i),$$

where $\hat{A} = A + I$ where $A$ is an adjacency matrix, and $\hat{D}$ is a diagonal node degree of $\hat{A}$. This enables us to interpret the GCN network as the following message-passing scheme: each GCN layer aggregates 1st hops of current nodes, therefore $i$th GCN layer is an aggregation of message till $i$th-hop information. This interpretation works well for experimental results: unlike general NNs that usually perform better with deeper layers (if we ignore overfitting), GCN tends to perform worse when we append more layers. Appending too many layers means aggregating too many hops, and it can lead to diluting the information from nearby hops.

### Graph Attention Network (GAT)

GAT is an architecture that is inspired from both GCN and Attention network, which is a widely used architecture in NLP. GAT also follows the message-passing scheme. Meanwhile, GAT tries to be more flexible than GCN by making some part of its architecture more implicit. Like we did in GCN, we will review GAT from the viewpoint of the message-passing scheme.

- Computation of the messages
- GAT also takes the same formula with GCN, by setting $\text{MSG}$ as a linear function. However, while GCN has explicit normalizing constants defined by degrees, GAT makes this coefficient implicit, so that we can have more flexibility.
- First, we define attention function $a:\mathbb{R}^{F'} \times \mathbb{R}^{F'} \rightarrow \mathbb{R}$ by NN, where $F'$ is a hyperparameter (pre-defined intermediate layer's dimension). The original paper proposes the attention function as a concatenation of two nodes and followed by a single-layer feedforward NN, written by $a(x, y) = \sigma(\textbf{a}^T[x;y])$. $[;]$ denotes the concatenation operation.
- Then, we compute $e_{uv}= a(Wh^i_u, Wh^i_v)$, and , where $\alpha_{uv}$ will act as a normalizing coefficient.
- Then, the message is computed by $m_{h^i_u, h^i_v} = \alpha_{uv}W^i \cdot h^i_v$
- The fundamental difference between GCN and GAT lies in the coefficient. GCN has fixed coefficient structures for all nodes, whereas GAT can have different coefficient structures node by node by defining $\alpha_{uv}$ implicitly. It means that we can consider different weights, or attentions, of messages, and therefore able to handle more flexible structures.

- Aggregation of the messages
- The aggregation part of GAT is same with GCN, using a sum of the messages. It can be formulated as $M_{h^i_u}=\sum_{v \in \mathcal{N}(u)}\alpha_{uv}W^ih^i_v$.

- Updating the representation with the aggregated messages
- The update part is also same, if we use a single-head attention. It can be written as
- To stabilize the learning process, GAT can use a multi-head attention, i.e. multiple attention networks. To be specific, we construct attentions,and accordingly Then, we make an update by concatenating messages from multiple attentions.

In this case, until final layers, and at the last layer

As mentioned above, GAT can contain more information about the structures of the graph implicitly. Moreover, GAT has an advantage that calculating $\alpha$'s does not require an entire graph structure. Note that GCN's coefficients are constructed by degree information. It implies that we need to have degree information of all nodes "before" we actually conduct the algorithm. On the other hand, GAT's $\alpha$'s are derived from the network, so that the algorithm does not require us to have the information of the entire graph at the time of the computation.

This means GAT can learn the representation of the graph that is "inductive". In other words, GAT can learn the representation that can handle even "unobserved" nodes. If there is an unobserved node, GCN may not work as it intended as its degree information is likely to be incorrect. Meanwhile, GAT works without the degree information. and we expect attention networks to implicitly capture the information of unobserved nodes even if we are not able to observe such nodes. Simply put, even when we cannot observe a full graph we can use GAT to construct an ML model for the graph.

### GraphSage

As mentioned in the GAT section, GCN is not capable of inductive learning, as it requires all degree information. GraphSAGE did not use this normalizing coefficient and modifies the aggregation function from "Aggregation of neighbors" to "Aggregation of samples of neighbors" (step 2 in message-passing). The sampling is to reduce the time complexity and the memory complexity, but it can also serve as a new viewpoint that unobserved nodes are just nodes that are not sampled.

## π Libraries

Many GNNs use matrix operations for computations, so any libraries that support efficient matrix computations and gradient descent algorithms can conduct GNNs. Meanwhile, two well-known Python-based libraries exist, which are Pytorch-Geometric (PyG) and Deep-Graph-Library (DGL).

While both libraries provide similar features, a major difference between two is in their data representations. PyG represents data in Pytorch tensors. Therefore, it is more compatible with Pytorch than DGL. Meanwhile, DGL's data interface follows NetworkX, which is a widely used Python graph library. Therefore, DGL has an advantage in processing raw graph data using NetworkX. The choice of the library depends on whether you want a compatibility for raw data or the training process.

# 4. GraphDB tasks, classical algorithms, and deep model based approaches

All databases try to reduce time and memory complexity and at the same time perform various querying algorithms accurately, and GraphDB is not an exception. Time-complexity for GraphDB is essential when algorithms require more than $O(E)$. The number of edges in a graph can be at maximum $V^2$, and therefore many graph algorithms suffer from time complexity when $O(E)$ or more. Also, many classical algorithms are based on linear models, which prohibits algorithms to encode rich data-driven information. One way to overcome these obstacles is using deep ML models for the graph domain. This may increase performances, enable algorithms to encode data-driven information, or both.

Among deep models on a graph, GNNs are especially powerful when detecting the overall structures of the graph. Message passing enables GNNs to aggregate neighborhood information into the node without fully exploring exponential search space. (need reference) Therefore, GNNs may shed light on graph algorithms that are related to the global structures of the graph. There are also other deep models that do not use GNNs and instead use their own approaches to efficiently encode the information of the graph. We observe deep models based on both GNNs and not comprehensively.

GraphDBs implement many graph algorithms. The list of algorithms needed for GraphDB is mentioned in https://neo4j.com/developer/graph-data-science/graph-algorithms/. In the listed algorithms, this article focuses on the tasks of node similarity, link prediction, centrality, and community detection. While there exist many classical algorithms that work well for the tasks, they are based on heuristics. Replacing those algorithms with data-driven ones is an active field of research, and using a deep model is one possible approach. In this section, we introduce baseline classical algorithms and deep model based approaches, including approaches using GNN, for each algorithm.

## π Node similarity

Node similarity algorithms are one of the most used algorithms in GraphDBs. For example, when using a node similarity algorithm for a recommendation system, we first set a threshold $t$. Then, for a given node $n$, recommended nodes are defined as follow: $\text{Recommending Nodes for } n= \cup_{\text{Sim}(n,m)>t}{m}$. In this case, the similarity metric between two nodes determines the quality and the performance of the recommendation. Therefore, it is important to define a "good" similarity metric.

### Classical algorithms

Here we review two criteria to define the node similarity. The first is the proximity between the nodes. This is more intuitive, based on the idea that similar nodes tend to gather as a community. This concept is also noted as network homophily. The proximity is usually measured by the number of hops, instead of the distance itself. The reason hops are more widely used than the distance is that hops determine the similarity of the node information. When we think of a message passed by the node, the more hops the message percolate, the more diluted the message. Therefore, nodes with fewer hops, not the closer distances, share similar messages, or information.

The second is to define a similarity based on node roles. For this criterion, we need to define when two nodes have same role, or in other words, when two nodes are equivalent. Widely known equivalence relations between two nodes are structural equivalence, automorphic equivalence, and regular equivalence.

**Structural equivalence**regards two nodes as equivalent if they share same neighborhood.**Automorphic equivalence**considers two nodes the same when interchanging two nodes does not change the graph structure.- Lastly, when two nodes have the same neighborhood structure, while not necessarily share the same neighbor, they are called
**regular equivalent**.

The relationship between these equivalence can be described as $\text{Structural Equivalence} \subset \text{Automorphism Equivalence} \subset \text{Regular Equivalence}$.

Once we defined the equivalence relation, we can measure "similarity" based on how much they are equivalent. Among those equivalences, structural equivalence is most widely used. In this article, we review one that is the most widely used similarity metric that is based on structural equivalence, the Jaccard similarity, which is based on the Jaccard distance.

Jaccard distance is a distance between two sets. For the two sets $A$ and $B$, Jaccard Distance between $A$ and $B$ is defined as $d_\text{Jaccard}(A,B)= 1-\frac{|A \cap B|}{|A \cup B|}$. We can inversely define Jaccard Similarity, by Intuitively, it compares how many set elements coincide.

Many GraphDBs define similarity of two nodes $v$ and $w$ using Jaccard Distance between neighborhoods of those nodes, $\mathcal{N}(v)$ and $\mathcal{N}(w)$. In formula, $\text{Sim}(v, w)= \text{Sim}_{Jaccard}(\mathcal{N}(v),\mathcal{N}(w))$. It means that similarity of two nodes are based on how similar their neighborhoods are.

Theoretically, Jaccard similarity's time complexity for n sets is quadratic complexity, $O(n^2)$. This complexity can be reduced using heuristics. For example, Neo4j reduces the complexity by ignoring disconnected nodes.

Beside Jaccard similarity, there are other similarity metrics to use.

- Algorithms that are based on structural equivalence are cosine similarity, euclidean similarity, and Degree Pearson Correlation Coefficient.
- RoleSim is an algorithm that tries to capture the similarity based on automorphism equivalence.
- For regular equivalence, there is an algorithm that resembles Katz centrality algorithm. We refer to this presentation for various similarity measures for each equivalence relationship.

While these classical methods are intuitive, they can only measure the similarity based on pre-defined equivalence relations. For example, if you choose to use the Jaccard Similarity, it means you ignore other types of equivalences. Moreover, these algorithms may ignore hidden edges that are not observed yet, by using deterministic algorithms.

For a detailed discussion of defining node similarity, we refer https://web.eecs.umich.edu/~dkoutra/tut/sdm14.html and https://en.wikipedia.org/wiki/Similarity_(network_science)

### Graph Machine Learning based: similarity in embedding space using Node2Vec - a graph embedding algorithms

Creating an embedding is one of the ways to measure the similarity metric. Once the embedding of the graph is trained, we can use the metric of the embedding space as a similarity measure. In this case, the quality of the similarity metric would mainly depend on how the embedding space is trained.

If we have a specific task for calculating node similarity, for example, recommendations, then we can train the embedding that maximizes our task performance. Then, metric in the embedding space naturally become node similarity, that best fits with the task.

Otherwise, if we do not have a specific task but just want to embed the space and calculate the similarity, we build the embedding space that preserve the features of the graph. There are many ways to retrieve features of the graph. Here we review Node2Vec, the random walk based algorithms. For other algorithms, refer http://snap.stanford.edu/proj/embeddings-www/files/nrltutorial-part1-embeddings.pdf.

Node2Vec is an embedding algorithm that is inspired by the skip-gram architecture in Natural Language Processing (NLP). Node2Vec learns features of the graph in two steps. The first step is to convert graph data into "sentence-like" data by sampling strategy, and the second step is to conduct skip-gram architecture. For the details of the Node2Vec algorithm, we refer to the original paper and this blog post.

The first step is to convert graph data into "sentence-like" data, or in a language of graph theory a directed subgraph, by choosing a sampling algorithm that is based on a random walk, which provides a probabilistic way to interpolate Breath-First-Search (BFS) and Depth-First-Search (DFS). BFS focuses more on node-proximity-based similarity, while DFS focuses on node-role similarity, or more precisely, structural equivalence.

A random walk visits the neighborhood of the node based on each edge's probability. Then, the way to determine the probability of selecting each edge becomes the crucial part. Node2Vec proposes to use 2 hyperparameters to determine the probability of the edge.

- $p$, the return parameter, controls revisit probability, and therefore can control how much we will focus on local information.
- On the other hand, $q$, the In-Out parameter, controls the probability of exploration to new nodes. By $q$, we can determine how much we will focus on global information.

In this way, Node2Vec accommodates both structural equivalence similarity metric and proximity-based similarity metric.

After the sampling algorithm is fixed, then for the $\text{node }u$ and $\text{sampling strategy } S,$ we define $\mathcal{N}_{S}(u)$ the neighborhood of $u$ generated by $S$. Once the neighborhood is defined, we can extend the architecture to a skip-gram architecture that is used in Natural Language Processing (NLP).

Let our embedding function be $f: V \rightarrow \mathbb{R}^d$. The objective function of Node2Vec is defined as a log-probability of reconstruction of neighborhood structure given feature embedding. In a mathematical form, it is written as the follow: $\max_{f} \sum_{u \in V}\log \mathbb{P}(\mathcal{N}_{S}(u)|f(u))$.

Under the assumption of

- conditional independence between neighborhood nodes, and
- symmetry in feature space between a node and its neighborhood, the objective function becomes tractable, written by $\max_{f} [-\log(\sum_{v \in V} exp(f(u) \cdot f(v))) + \sum_{n_i \in \mathcal{N}_{S}}f(n_i) \cdot f(u)]$. The former part is hard to compute, so they are computed by negative sampling. We omit the detail and refer for Node2Vec papers for this part.

Therefore, overall procedures can be summarized as follow: first, you perform the random walk. Then, based on a random walk, you compute the log-likelihood based on skip-gram architecture. The second part is usually conducted by Word2Vec algorithm. Then, you use Stochasrtic Gradient Descent (SGD) to train the parameters. Below is the algorithm denoted in the Node2Vec paper, "node2vec: Scalable Feature Learning for Networks".

*From node2vec: Scalable Feature Learning for Networks*

## π Link prediction

Link prediction is also one of the key tasks in recommendation and graph completion. The task is to determine the closeness of a pair of nodes. This task is closely linked to the node similarity task.

### Classical algorithm

Link prediction tasks using classical algorithms are basically the same as defining which similarity metric to use. Most link prediction methods are measuring similarity based on Structural Equivalence. Theoretically, link prediction scores are the same with similarity scores, and we can use any similarity metrics, such as Jaccard Similarity, for predictions. However in practice, link prediction tasks use less rigorous but more simple scores, since we do not need the accurate similarity scores.

Therefore, Link prediction algorithms use heuristics to calculate the score that is similar to similarity metrics, with efficient computations. Existing link prediction algorithms can be categorized based on the maximum hop of neighbors needed to calculate the score. Common Neighborhood (CN) and Preferential Attachment (PA) are the first-order heuristics. CA score is defined as $\text{CA}(u, v) = |\mathcal{N}(u) \cap \mathcal{N}(v)|$, and PA score is defined as $\text{PA}(u,v)=|\mathcal{N}(u)| \cdot |\mathcal{N}(v)|$. While they are the simplest way, they omit the normalization by the size or the type of neighbors as Jaccard did. To efficiently normalize without large computations, there are Resource Allocation (RA) and Adamic-Adar (AA), which is second-order heuristics. RA is calculated by $\text{RA}(u,v)=\sum_{w \in \mathcal{N}(u) \cap \mathcal{N}(v)} \frac{1}{|\mathcal{N}(w)|}$, and AA is calculated by $\text{AA}(u,v)=\sum_{w \in \mathcal{N}(u) \cap \mathcal{N}(v)} \frac{1}{\log |\mathcal{N}(w)|}$. By summing $\frac{1}{|\mathcal{N}(w)|}$ or $\frac{1}{\log |\mathcal{N}(w)|}$, if sharing neighbor is universally connected, then $|\mathcal{N}(w)|$ be higher, and the score would be lower. Therefore, by this we can normalize the score by valuing the common neighbors that are uniquely connected each other more than that are universally connected with another nodes.

There are also more algorithms for link prediction tasks, including higher-order heuristics. We refer to Appendix A in SEAL paper (SEAL algorithm will be discussed in the below) for other algorithms.

### Deep model & GNN based algorithms: WLNM and SEAL

Instead of defining the similarity metric and directly apply it for link prediction tasks, Deep model based algorithms try to capture structural and inherent information of the graph. We introduce two well-known algorithms in this section, WLNM (Weisfeiler-Lehman Neural Machine) and SEAL (Subgraphs, Embeddings and Attributes). WL is deep model that uses MLP, while SEAL is GNN-based model, as it uses GCN.

- WLNM

WLNM is inspired by the Weisfeiler-Lehman (WL) graph isomorphism test algorithm. WL tries to label nodes and figure out whether two graphs are labeled the same. To this end, WL test proceeds like a message passing scheme: 1) we initialize nodes with the same label, then 2) we aggregate neighbor nodes to update the current label. 3) This continues until node labels converge. The difference is in the initialization, the terminal conditions, the aggregation function and the update function. Naturally, WL algorithms consider structural features of the graph as the message passing scheme does.

Likewise, the WLNM algorithm captures structural information of the graph and uses this information to predict the link. WLNM consists of three steps. We illustrate these steps in the case of predicting a link between two nodes $u$ and $v$.

- Extracting sub-graphs
- For a pre-defined parameter $k$, we define $k$-enclosing subgraph: Starting from $V_k={u,v}$, append neighborhood nodes, $\mathcal{N}(u),\mathcal{N}(v), \mathcal{N}^2(u), \mathcal{N}^2(v),...$ , of the set until $|V_k|\geq k$.

- Subgraph pattern encoding (Or, Node labeling and encoding)
- Now, for the obtained subgraph in 1, we perform the modified WL algorithm, called a Palette-WL algorithm, to obtain node labeling (or coloring in a notation of the original paper). If there is a tie, the algorithm uses a Nauty algorithm, a graph canonization tool, to break the ties. We omit the detail of Nauty algorithm.
- While classical WL algorithm can also perform the same task, Palette-WL modifies an aggregation function using hashing to label nodes fastly. We omit this part since it is not our interest.
- After obtaining the sorted labels, or colors, the algorithm builds the adjacency matrix for the subgraph. Then, we set each column of the adjacency matrix as input vectors.

- Neural Network Training
- Since input (column of the enclosing subgraph's adjacency matrix) can be regarded as labeled data (1 if two nodes have connection, 0 otherwise), we can now view a problem as a classification task. Therefore, we train the Neural Network using the labeled data with Fully connected Multi-Layer Perception (MLP), classify the node, and conclude that output probability is the probability of link between these two nodes.
- Note that WLNM uses fully connected MLP. While WLNM used a NN to predict the link, it is not, strictly speaking, using a GNN. WLNM just uses vectors, the column of adjacency matrix, as inputs, instead of using these "graphs" as NN inputs. It discards the information the graph contains, especially node features.

While the WLNM is more capable of incorporating various graph properties than similarity-based heuristics when predicting the link, the role of hyperparameters, especially the number of neighbors of the enclosing subgraph $k$, may be crucial in the algorithm. Therefore, it depends on pre-defined variables as heuristics do. Moreover, the NN architecture used in WLNM discards information of the graph using MLP, as mentioned above.

- SEAL

SEAL is a variant of WLNM, in the sense that it also consists of the above 3 steps. However, SEAL modifies some details in those steps to better capture the node features and latent information. The major difference between SEAL and WLNM is that SEAL uses GCN instead of MLP, which enables it to preserve the information required for link prediction.

- Extracting sub-graphs
- Like WLNM does, SEAL also tries to capture the enclosing subgraph of the $u$ and $v$. However, instead of using $k$-enclosing subgraph proposed in WLNM, SEAL proposes to use $k$-hops enclosing subgraph, which is the union of 1 ~ $k$-hop neighbors of $u$ and $v$.

- Subgraph pattern encoding (Or, Node labeling and encoding)
- Instead of using the WL labeling algorithm, SEAL proposes Double-Radius Node Labeling (DRNL). It is a label based on how far each node is from our target nodes $u$ and $v$. For node $n \in \text{Enclosing Subgraph}$, $n$ is labeled with higher number as the sum of distances from $u$ and $v$, $d(n, x) + d(n, y)$, goes larger.
- Unlike WLNM which only considers the enclosing subgraph, SEAL tries to incorporate node features and latent information as well. Therefore, SEAL constructs enclosing subgraph's node feature matrix $X$ by the following procedures.
- First, we make one-hot encoding of the labels obtained above.
- Then, SEAL embeds the original graph (not the enclosing subgraph) and uses this embedding to consider latent information. When generating the embedding, the algorithm uses the trick called "negative injection", which is to add false negative edges. By this we can avoid GNN from overlooking non-existing edges. The embedding algorithm can be chosen from existing methods, such as Node2Vec or Graph Autoencoder.
- After obtaining the one-hot encoding of the labels and the graph embedding, we concatenate each node's one-hot encoding and embedding, and regard it as a "node feature vectors". We construct an enclosing subgraph's node feature matrix $X$ based on this node feature vectors.

- Neural Network Training
- After we obtained $X$, the node feature matrix of the enclosing subgraph, and $A$, the adjacency matrix of the enclosing subgraph, we can compute GNN using GCN network, which can both consider node features and the adjacency matrix. By this, we can have more richer information besides WL labels, such as latent information and adjacency matrix.

## π Centrality

Centrality algorithms measure the importance of distinct nodes in the graph. Centrality algorithms are crucial in many graph tasks, such as search engines and recommendation systems.

### Classical algorithms: PageRank and Betweenenss Centrality

Like node similarity tasks and link prediction tasks, classical centrality algorithms first need to define what centrality is. Among various definitions for centrality, two concepts are the most widely used: **walk-structure-based centrality**, which considers the neighbors of the given nodes, and **flow-based centrality**, which considers the alignment of the given node in the network flows.

We review two widely used classical centrality algorithms, PageRank, which is based on walk-structure-based centrality, and Betweenness Centrality, which is based on flow-based centrality. There are more definitions for the centrality and derived algorithms. We refer to https://en.wikipedia.org/wiki/Centrality for the details.

PageRank algorithm is probably the most widely used centrality algorithm. Its idea is that "a page is only as important as the pages that link to it". In this sense, a centrality of one single node is defined as an aggregation of neighbor's centralities, like a message-passing scheme.

To be specific, PageRank calculates the node centrality score using a recursive equation. The Page rank score of node $u$, $\text{PR}(u)$, is defined by the following recursive system equation.

$$\text{PR}(u) = (1-d) + d\sum_{v \in \mathcal{N}(u)}\frac{\text{PR}(v)}{C(v)}$$

- $d$ is called a "damping factor" and is a hyperparameter that determines the weight of the neighborhood information. Or, it can be regarded as the weight of an arbitrary random node in the graph. It is because if $d= 0$, then all $\text{PR}(\cdot)$ would have the same value, meaning centrality scores are same, and no nodes are overwhelming. This means we are dealing with a fully connected graph, where all nodes have equal contributions. In this sense, we can regard $(1-d)$ term as the probability that all nodes have equal contribution to the $u$ so that we just randomly selected one. Heuristically a damping factor is set to 0.85. The reason why we consider those damping factor will be discussed below.
- In the above equation, the aggregation function is defined as a sum. $C(\cdot)$ is an out-degree of the node.
- The recursive equation is a linear system. However, it is hard to directly solve the linear system if the graph becomes larger. Therefore, the algorithm uses a stochastic method by starting from a randomly initialized distribution of $\text{PR}(nodes)$ and view the coefficient matrix as a Markov transition matrix. Then the problem becomes finding a stationary distribution of $\text{PR}(nodes)$. We can write it as $\pi = P\pi$, where $\pi$ is a targeting stationary distribution and $P$ is a Markov transition matrix. Then, this problem boils down to finding the eigenvector of $P$ with corresponding eigenvalue 1. This approach enables us to use methods such as a power method.

Sometimes, there is a node, or a loop of nodes, that has no exit. This is why we consider the damping factor. Without the damping factor, the stationary distribution may result in concentration to nodes in the loop. However, as discussed above, damping factor enables us to randomly select arbitrary nodes, possibly outside the loop.

PageRank usually displays good performances. However, sometimes it fails to capture the complex information of the graph, since its aggregation function is linear.

Betweenness centrality algorithm is based on the idea that central nodes work as bridges of messages between nodes, which is the philosophy of the flow-based centrality. Betweenness centrality algorithm defines centrality by considering how much a node influences the overall flows. To measure the influence, we note that messages flow via the shortest paths between nodes. Therefore, nodes that more frequently end up being on the shortest paths between other nodes will have higher centrality scores.

The original betweenness algorithm computes the centrality score of the node $u$ using the following equation,

$$\text{BC}(u)=\sum_{v,w \neq u}\frac{\sigma_{v,w}(u)}{\sigma_{v,w}},\\text{Score}(u)=\frac{\text{BC}(u)-\text{min}(\text{BC})}{max(\text{BC})-min(\text{BC})},$$

where $\sigma_{v,w}$ denotes the number of the shortest paths from $v$ to $w$, and $\sigma_{v,w}(u)$ denotes the number of the shortest paths from $v$ to $w$ which pass $u$.

The above equation can be interpreted as the following. $\text{BC}$ value is just the proportion of paths that passes $u$. The score is just a normalization of the $\text{BC}$ value by 0~1. Despite its intuitive meaning, it requires intensive computations. To compute the value, we need to calculate the shortest paths for every node. This results in $O(|V|^3)$ time-complexity. For unweighted graphs, Brandes' approximate algorithm reduces time-complexity to $O(|V||E|)$. The point is that we reformulate the above formula by recursive structures. For the detail, we refer https://www.cl.cam.ac.uk/teaching/1617/MLRD/handbook/brandes.pdf.

### Deep model & GNN based algorithms: Improving time-performances of centrality algorithms

Some of the classical centrality algorithms like Betweenness centrality algorithm require heavy computations to exactly compute the score. However, in many centrality tasks, we do not need exact scores as measuring centrality is usually not the main goal but we use those centrality measures to solve other tasks. Therefore, it is often efficient to sacrifice some of the accuracies and approximate the score. Deep model and GNNs can be one of the tools to approximate the value with efficient computations. In this section, we review Network Centrality Approximation using Graph Embedding (NCA-GE) model, the algorithm that aims to measure the centrality when we have some nodes' centrality scores and need to calculate others' scores.

NCA-GE model is just a simple GNN based approximation. It uses the out-degree matrix as a node feature matrix, and implement a GNN to create the embedding of the graph. Then, we just perform the training with the embedding vectors as input and centrality scores as output. The original papers propose GCN or Structure2Vec as embedding algorithms and $\mathcal{L}^2-$regularized $\text{MSE}$ as a loss function, but there are freedoms in choosing embedding algorithms and loss functions.

While not dramatic, mean execution times of NCA-GE Model did show time improvement compared with Brandes' Betweenness Centrality algorithms. Also the paper verifies that NCA-GE model does not lose much accuracy, either. This simple model shows the possibility that GNNs can provide a good approximation for the graph algorithms that require heavy computation.

## π Community detection / clustering

Community Detection is a task that tries to find clustered structures of the graph. It differs from the node classification in the sense that the community detection not only classifies nodes but also considers the relationship between communities.

### Classical Algorithm: Modularity and Louvain Community Detection

To conduct a community detection, we first need to answer "what is a good community clustering?". There are many criteria, but "modularity" is the most fundamental one.

- Modularity

Modularity is a measure of "how well modulized the graph is". Suppose we ideally clustered nodes properly into some distinct communities and constructed a subgraph of communities. Then, links inside each community subgraph, or self-loop, would be dense, and links between distinct communities would be sparse. In other words, the distribution of edges between communities is not uniform and skewed to self-loops. Therefore we can obtain how modulized the graph is by comparing the current distribution of edges and the random distribution of edges between communities. The modularity $Q$ is actually computed using the below equation.

*Reference: https://en.wikipedia.org/wiki/Louvain_method*

The above equation can be interpreted as the gap between actual edges between clusters and expected edges between clusters for a random graph with the same number of links.

https://en.wikipedia.org/wiki/Louvain_method

- Louvain Community Detection

https://mons1220.tistory.com/129

Since we now have a criterion for a community detection task, we can a conduct community detection task by optimizing this criterion. However, optimizing modularity is an NP-Hard task. Louvain Community Detection is a heuristic that aims to optimize modularity.

Louvain algorithm is a bottom-up greedy algorithm. It means that it starts from one node, and enlarges the cluster by adding a relevant node. Louvain algorithm first assigns each node a different cluster, and then uses an iterative greedy method, repeating 2 steps: 1. local moving of nodes, and 2. Aggregation of the network.

- Local moving of nodes
- For each node, we change its cluster into the nearby cluster. Then, if modularity increases, we accept this change. Otherwise, we cancel this change. We conduct this process for every node, until modularity reaches to local optima.

- Aggregation of the network
- For the resulting clusters in 1, we consider each cluster as a node, and edges inside the cluster as a self-loop. Then, we obtain a new network, each node being the clusters. Then, with this aggregated network, we move on to 1 again, and then 2, and on and on until there is no change in step 1.

Louvain algorithm is a powerful algorithm due to its speed and easy implementation. Meanwhile, it requires a lot of memory space. Also, while the order of nodes to perform step 1 does not change the final clusters, it does change the speed of the algorithm. For more details, we refer to the original paper.

For other criteria and other algorithms, we refer to https://towardsdatascience.com/community-detection-algorithms-9bd8951e7dae

### Deep model based Community Detection: ModNet

While Louvain algorithm is a powerful algorithm, there are attempts to solve modularity optimization problem using deep models. Recently, ModNet showed strong performance. We briefly introduce this method.

The problem with directly applying modularity as a loss function for ML is that this loss is not differentiable. Therefore, the paper uses a soft modularity $M$, which is differentiable, as a loss function.

$$M=\frac{-tr(UBU^{T})}{\sum|A_{ij}|}$$

Here, $A$ is an adjacency matrix, $U \in \mathbb{R}^{N \times C}$ is a matrix of probabilities of nodes attribution to clusters ($N$ is a number of nodes and $C$ is a hyperparameter denoting the number of communities), and $B=A-\frac{dd^{T}}{\sum|A_{ij}|}$, where $d$ is a node degree vector.

The reason the above $M$ is "soft" modularity is that it is derived from the natural relaxation of the original modularity formula. We can modify the above modularity formula as follows:

$$Q=\frac{1}{2m}\sum_{ij}[A_{ij}-\frac{k_ik_j}{2m}]\delta(c_i,c_j) = \frac{1}{2m}\sum_{ij}\sum_{l}[A_{ij}-\frac{k_ik_j}{2m}]\delta(c_i,c_l)\delta(c_j,c_l) $$

This is just a decomposition of the Kronecker delta into possible cases, $l$. Then, we can generalize $\delta(c_i,c_l)$ and $\delta(c_j,c_l)$ as a probability of containing in certain communities, $p_{il}$ and $p_{jl}$. This relaxation results in the formula below, with $\textbf{p}_{i \cdot}$ denotes a probability vector.

It is known that the above two formula for soft modularity coincide. We omit this detail.

Now, we can use this differentiable modularity loss as a loss function for the unsupervised community detection task. However, this soft modularity might have multiple local optima. Therefore, the paper suggests to append the regularizer $R= \sum_{i}^{C}(\sum_{j}^{N}U_{ij}-\frac{1}{C})^2$. They claim this regularizer leads to the solution with similar community size from Symmetric Stochastic Block Model (SSBM), a graph generative model which tends to create a uniform distribution of communities.

Lastly, the model divides the cases into two cases, based on theories in the SSBM model. An associative case is when we can well-perform the community detection, where the number of communities is sufficiently large and each community is distinguishable. A disassociative case is an opposite. Whether each community is distinguishable is determined by the value of Signal-To-Noise Ratio (SNR). We omit the explanation regarding SNR, since it requires a background in SSBM. We can distinguish whether the case is associative or not before we conduct the model, since the number of communities is a hyperparameter and the parameters for SNR are obtained from the raw input graph.

First, The final loss function would be $\mathcal{L}_a = M + \lambda R$ if the case is associative, and $\mathcal{L}_b = -M + \lambda R$. It is because the disassociative case often struggles with negative modularity.

After determining the loss function, we can use deep models to optimize the loss function and solve the community detection task. The choice of network is widely open, including GNN architectures like GCN. However, the most well-performing algorithm was regarding graph data like a NLP data, and use embedding and attention network like NLP task. (Although they did not tried to directly use GAT architecture.) The resulting performance overwhelmed most of both classical and other deep model algorithms, including even supervised methods.

# 5. Remaining challenges in applying deep model based algorithms to GraphDB

As discussed above, deep models on a graph can handle graph data with efficient implementation and capture its information in a data-driven way instead of pre-determined rules. However, several obstacles still exist when using deep models in GraphDBs.

## π Memory efficiency

To use a deep models in a GraphDB, it should keep 1) a neural network for graph input or 2) an embedding for graph data. Either way, we need extra memory spaces to hold those data. 1) requires both memory complexity for NN and time-complexity to actually process the data through NN. 2) can save time cost for generating the embedding, but requires more memory space than 1).

## π Dynamic graph embedding

GraphDBs should be able to handle frequent insertions and deletions of nodes and edges. This implies that our embedding or NN needs to handle dynamic graphs, hopefully in real-time. Many current deep models on a graph mainly focus on static graphs instead of a dynamic graph since learning representations for a dynamic graph is not an easy task. Still, there are some attempts to implement deep models in dynamic graphs.

A dynamic graph can be represented as an ordered list or an asynchronous stream of timed events, such as additions or deletions of nodes and edges. Representing a dynamic graph for a GraphDB is even more complicated since, in addition to the previous streams of timed events, we have also real-time streams. Here we will consider the dynamic graphs with the stream of timed events. Handling real-time streams is an active topic of research.

Here we review DyRep, the representation learning algorithm for dynamic graphs. We review the brief key ideas of the papers and omit details.

For other algorithms, we refer http://web.cs.ucla.edu/~patricia.xiao/files/Paper_Reading_Group_20201006.pdf.

The complicated methods to embed dynamic information shows that there are still many challenges to implementing dynamic graph embedding, even if it is not real-time. We believe researches in this area would play a pivotal role when implementing deep models in a GraphDB.

### Dynamic Graph Embedding using a deep model: DyRep

DyRep proposes a method that can handle additions of nodes for an undirected unweighted graph. Note that it does not handle the deletion. DyRep is a complicated algorithm, which shows how hard it is to encode the information of a dynamic graph.

First, We define two types of "event" that change the graph:

- a communication that is temporal, and
- an association that is permanent.

Note that they are not exclusive. For example, when you think of a friendship graph, meeting a new person once is a communication. After the meeting, you might become a permanent friend with the person, which is an association. Or, you might decide not to be friend with the person, which does not lead to the association. An event is defined as a tuple $e = (u,v,t,k)$ where $u,v$ are nodes, $t$ is a time, and $k \in {0,1}$ is an event type indicator, where 0 stands for association and 1 stands for communication.

Instead of training node representation $z_v=f(v)$, the model focuses on learning the representation function of time $z_v(t)=f(v(t))$. When embedding a dynamic graph, the model needs to capture the effect of an event. The algorithm analyzes the effect of the event by decomposing to three messages: Self-propagation, Exogeneuous Drive, and Localized Embedding Propagation. We illustrate this part with the case when event $p = (u_p,v_p,t_p,k_p)$ occurs. We denote the $\bar{t}$ a time point right before the current time $t$, and $\bar{t^v_p}$ as a time point of the last event on $v$ before $t_p$.

- Self-Propagation is how the previous embedding of the given node evolves as the new event occurs. It is written by the linear transform of the $v$'s representation at the last time the node got a direct event. Note that it is not the last event of the "graph", but of the "given node". we can write the message from this as
- Exogenous Drive is how much exogenous effects occur during the exogenous time, the time from the last time $v$ directly effected to the current event time. Exogenous time would be and the message is measured by linear transform of exogenous time,
- Localized Embedding Propagation is the part we calculate the direct effect of the event, from $u$. First, we aggregate the neighborhood information of $u$ and propagate it to $v$. We write it as Then, we can compute the message by linear transform:
- How to aggregate the neighborhood information and construct $h$ is a key problem. The authors propose "Temporal Point Process based Attention Mechanism", a modified version of GAT, to do this. The idea is to use both an adjacency matrix, which shows associations, and a stochastic strength matrix, which shows temporal relations that can potentially influence associations. For the time $t$, they are denoted by $A(t)$ and $S(t)$.
- $A(t)$ and $S(t)$ also need to be evolved dynamically, as an event occurs. The update procedures for the event $p=(u,v,t,k)$ is described below:
- $A(t)$ is updated if and only if $k=0$ (association). We update $A(t)$ by $A_{uv}(t)=A_{vu}(t)=1$ in this case, and do not update otherwise.
- Updating $S(t)$ requires branches.
- If the event is a communication ($k=1$), and there was an association previously ($A_{uv}(\bar{t}) = 1$)
- We define $b_u$, the background attention of the graph for the given node $u$. It is a weight and initialized by the uniform distribution of $\mathcal{N}(u)$. If the event occurs, we update $b_u$ by $b_u + \lambda_{uv}(t)$, which is done by adding intensity function $\lambda_{uv}(t)$ that will be obtained below. We plug in new $b_u$into $S_{uv}(t)$, and do the same update for $v$.

- If the event is association ($k=0$)
- We use same $b_u$ as above and follow the same updates. But in this case, there was a change in a neighborhood of $u$, so we need to adjust the background attention by $b_{n} = b_{n} - \bar{b_{n}}$, where $n \in \mathcal{N}(u)-{v}$ is a neighbors of $u$ besides $v$. Then, as we did in the above, we plug-in this value to $S(t)$.

- Otherwise
- In this case, we do not need to update $S(t)$.

- If the event is a communication ($k=1$), and there was an association previously ($A_{uv}(\bar{t}) = 1$)

- Then with $S(t)$, The model make the attention coefficient of $h$ as $q_{ui}(t) = \frac{exp(S_{ui}(\bar{t}))}{\sum_{j \in \mathcal{N}(u)}{exp(S_{uj}(\bar{t}))}}$. We then construct

- Now, we aggregate these three messages to calculate the total message of the given event. The paper uses "sum" as a $\text{AGGR}$ function. $M_u(t_p)=m^{rec}_v(t_p) + m^t_v(t_p) + m^{struct}_v(t_p)$
- Lastly, we update the representation using the $\text{UPDATE}$ function. The paper uses the activation function as a update function. $z_v(t_p) = \sigma(M_u(t_p))$.

After we obtained $z$ by above procedures, the model computes a conditional intensity function given the event $p=(u,v,t,k)$ by $f$ and $g$ is defined as $g_{k}^{u,v}(t) = w_{k} \cdot [z_{u}(\bar{t});z_{v}(\bar{t})]$ and $f_{k}(x)= \psi_{k} \log(1+exp(\frac{x}{\psi_{k}}))$. $w_k$ and $\psi_k$ are parameters to learn, and $[;]$ denotes the vector concatenation operation.

Lastly, after we obtained the intensity function, we minimize the negative log-likelihood in the below. $\Lambda(\tau) = \sum_{u=1}^{n} \sum_{v=1}^{n} \sum_{k \in {0,1}} \lambda^{u,v}_k(\tau)$ denotes the total survival probability for events that do not happen.

$$\mathcal{L}=-\sum_{p=1}^{P}\log \lambda_p(t) + \int_{0}^{T}\Lambda(\tau)d\tau$$

Instead of computing the integral, we can use a minibatch for the computation.

## π Multi-task Learning

Multi-task learning is also one of the topics that need to be considered when implementing deep models in GraphDB.

It is common for GraphDBs to take queries that perform multiple tasks sequentially. For example, a query "predict the link between A and the central node of the cluster that A belongs" requires performing community detection, centrality, and link prediction in a row. Therefore it is desirable for a representation of the graph in GraphDB to be able to perform multiple tasks.

Finding an embedding that can handle multiple tasks is not restricted to the graph domain, and was examined in various fields. The most common method is first implementing one single NN for some parts, so that we extract the common features, and then branching a different NN for the rest of the parts per task. This is the easiest implementation but is also easy to sacrifice accuracy since the network is not parameterized to solve a specific task. This implementation is called "hard-parameter sharing". Or, we can implement different NNs for each task, but making them similar. This is "Soft-parameter sharing". This implementation has close links with transfer learning.

We focus on hard-parameter sharing for this article. For hard-parameter sharing methods, a loss function that can integrate various tasks needs to be well-designed, which requires the relationship between tasks to be verified, and we need a guarantee that those tasks share information so that they are able to be handled in a single embedding.

The good news is that multitask learning for GraphDB is relatively easier than other multitask learnings, because of the following two facts.

- First, DBs have a fixed number of tasks, so that we do not have to find various relationships between the tasks.
- Second, recalling the various algorithms we reviewed above, we can observe that many algorithms have their theoretical backgrounds on the "message-passing" scheme. This implies that the tasks in GraphDB may share some properties in common.

Therefore, it is a favorable environment to conduct multitask learning. In fact, there is a work that empirically showed that multitask learning may not ruin the performance significantly. While there are not many works that delve into multi-task learning in the graph domain, we believe it is a field that is both required and promising to pursue.

# 6. Conclusion

So far, we observed Graph, GraphDB, GNN architectures, classical and deep model-based algorithms for graph problems, and remaining challenges. While we discussed various topics in this post, we believe there are still many problems that can be enhanced, are remained unsolved, and are not found yet in the domain of Graph, GraphDB, GNNs, and ML on a graph.

# References

Intro

GraphDB

- https://arxiv.org/pdf/1910.09017.pdf
- https://db-engines.com/en/ranking/graph+dbms
- Florian Holzschuherand RenΓ© Pein, Querying a graph database β language selection and performance considerations (2016), Journal of Computer and System Science.
- https://neo4j.com/graph-databases-book/
- https://www.quora.com/How-is-Neo4j-stored-on-a-disk-Is-it-some-kind-of-adjacent-matrix-stored-in-a-relational-database
- https://neo4j.com/
- https://docs.janusgraph.org/

GCN

GAT

- https://arxiv.org/pdf/1710.10903.pdf
- https://petar-v.com/GAT/
- https://harryjo97.github.io/paper review/Graph-Attention-Networks/

GraphSAGE

Libraries

Graph Algorithms

- https://drops.dagstuhl.de/opus/volltexte/2020/11927/pdf/LIPIcs-ICDT-2020-3.pd
- https://neo4j.com/developer/graph-data-science/graph-algorithms/

Node similarity

- https://neo4j.com/docs/graph-data-science/current/algorithms/node-similarity/
- https://www.kernix.com/article/an-efficient-recommender-system-based-on-graph-database/
- https://faculty.nps.edu/rgera/MA4404/Winter2018/15-nodeSimilarity.pdf
- Axiomatic Ranking of Network Role Similarity (RoleSim) https://arxiv.org/pdf/1102.3937.pdf
- https://en.wikipedia.org/wiki/Similarity_(network_science)

Node2Vec

- node2vec: Scalable Feature Learning for Networks
- https://snap-stanford.github.io/cs224w-notes/machine-learning-with-networks/node-representation-learning
- https://snap.stanford.edu/node2vec/
- http://snap.stanford.edu/proj/embeddings-www/files/nrltutorial-part1-embeddings.pdf
- https://towardsdatascience.com/node2vec-embeddings-for-graph-data-32a866340fef

Link Prediction

- WLNM https://muhanzhang.github.io/papers/KDD_2017.pdf
- Classical algorithms, SEAL https://arxiv.org/pdf/1802.09691.pdf
- Overview https://arxiv.org/pdf/2008.08879.pdf

Centrality

- https://en.wikipedia.org/wiki/Centrality#PageRank_centrality
- https://towardsdatascience.com/pagerank-algorithm-fully-explained-dc794184b4af
- https://www.cl.cam.ac.uk/teaching/1617/MLRD/handbook/brandes.pdf
- https://en.wikipedia.org/wiki/Betweenness_centrality
- https://arxiv.org/pdf/2006.16392.pdf

Community Detection

- Modularity and Louvain algorithm
- Fast unfolding of communities in large networks, https://arxiv.org/abs/0803.0476
- lists of algorithms https://towardsdatascience.com/community-detection-algorithms-9bd8951e7dae
- louvain algorithm https://mons1220.tistory.com/129
- modularity https://mons1220.tistory.com/93
- https://en.wikipedia.org/wiki/Louvain_method

- ModNet
- original paper https://rlgm.github.io/papers/37.pdf
- Soft modularity https://hal.archives-ouvertes.fr/hal-02005331/document
- SSBM https://jmlr.csail.mit.edu/papers/volume18/16-480/16-480.pdf
- Implementation https://github.com/Ivanopolo/modnet/

Remaining challenges

- Dynamic Graph Embedding
- Multi-Task Learning