Deal: Distributed End-to-End GNN Inference for All Nodes

Shiyang Chen¹¹1This work is done during his internship at Amazon Rutgers University shiyang.chen@rutgers.edu , Xiang Song Amazon xiangsx@amazon.com , Vasiloudis Theodore Amazon thvasilo@amazon.com and Hang Liu Rutgers University hl1097@scarletmail.rutgers.edu

Abstract.

Graph Neural Networks (GNNs) are a new research frontier with various applications and successes. The end-to-end inference for all nodes, is common for GNN embedding models, which are widely adopted in applications like recommendation and advertising. While sharing opportunities arise in GNN tasks (i.e., inference for a few nodes and training), the potential for sharing in full graph end-to-end inference is largely underutilized because traditional efforts fail to fully extract sharing benefits due to overwhelming overheads or excessive memory usage.

This paper introduces Deal, a distributed GNN inference system that is dedicated to end-to-end inference for all nodes for graphs with multi-billion edges. First, we unveil and exploit an untapped sharing opportunity during sampling, and maximize the benefits from sharing during subsequent GNN computation. Second, we introduce memory-saving and communication-efficient distributed primitives for lightweight 1-D graph and feature tensor collaborative partitioning-based distributed inference. Third, we introduce partitioned, pipelined communication and fusing feature preparation with the first GNN primitive for end-to-end inference. With Deal, the end-to-end inference time on real-world benchmark datasets is reduced up to $7.70\times$ and the graph construction time is reduced up to $21.05\times$ , compared to the state-of-the-art.

^†^†copyright: none

1. Introduction

Graph learning is emerging as a cutting-edge area in machine learning research (zheng2022distributed, ; wang2023efficient, ; wu2022graph, ; wu2023certified, ; chen2023bitgnn, ; zhu2021graph, ; jiang2020graph, ; polisetty2023gsplit, ; min2022graph, ; adiletta2023characterizing, ; yao2024fedgcn, ; besta2023high, ; chen2023tango, ). Researchers from a wide range of areas, i.e., computer vision (han2022vision, ), natural language processing (lin2021bertgcn, ; orogat2023maestro, ; colas2022gap, ), cybersecurity (he2022illuminati, ; shaham2020enhancing, ), chemistry (wang2023motif, ), material science (banerjee20013, ), bioinformatics (zhang2021graph, ) are exploring the possibility of leveraging GNNs to tackle their problems (hoang2023protecting, ; cuza2022spatio, ; peng2021graph, ; liu2022aspect, ; hope2021gddr, ; choma2018graph, ; dutta2023power, ; besta2022motif, ). The recent successes in such tasks (e.g., chip design (zhang2019circuit, ), computational fluid dynamics (lino2022multi, ), fraud detection (lu2022bright, ), knowledge discovery (suarez2023templet, ; luzuriaga2021merging, ; rossi2022explaining, ; andreou2023using, ; jin2023graph, ; kannan2022exaflops, ; zheng2020dglke, ) and, etc.) are further solidifying the importance of GNNs.

Graph learning often follows an ego network-based computation graph (See Figure 1). Starting from a root node, the ego network of this node contains the full or a sampled subset of its in-neighbors. Moving to the next layer, we do the same for each selected in-neighbor. This process continues until we reach $k$ -hop in-neighbors for a $k$ -layer GNN. Since each node brings in multiple in-neighbors, each ego network is a “tree” with the target node at the top and more and more nodes from the top to the bottom. This “tree” shape nature lets different ego networks share many nodes. Of note, P³ (gandhi2021p3, ), Betty (yang2023betty, ), FlexGraph (wang2021flexgraph, ) term this ego network as the GNN computation graph, multi-level bipartite graph, and hierarchical dependency graph, respectively.

Problem definition. A common task for GNNs is computing the embeddings of all nodes in an unseen graph. For example, fraud detection in e-commerce marketplaces views the millions of transactions in the past period as a graph (lu2022bright, ). In addition, particle physics experiments produce millions of high-dimensional measurements each second in the sensor array, which are represented as a graph because of their heterogeneity and sparsity in space (moreno2020jedi, ). We term this problem “end-to-end inference for all nodes”. This daily update is required to reflect the latest changes in the graph. Computing the embeddings entails applying a trained inductive GNN model to all graph nodes. Specifically, an end-to-end inference process involves (i) constructing the Compressed Sparse Row (CSR) from edge lists, (ii) partitioning the graph because the massive amount of activity records (edges formed) and active entities often result in graphs with billions of nodes and tens of billions of edges, and (iii) computing the GNN embedding for each node.

While sharing opportunities arise during GNN computation because the GNN computation graph of each node is an ego network, we observe that this new “end-to-end inference for all node” maximizes such a sharing opportunity when compared with the inference of a subset of nodes (see DGI (yin2023dgi, )) or training (HAG (jia2020redundancy, ) and P³ (gandhi2021p3, )). For training, a GNN model can only be applied to a batch of nodes. Subsequently, the GNN model will be updated for the next batch, which limits the sharing in a batch.

Traditional GNN inference endeavors fail to fully extract the benefits of sharing for two reasons: (i) SALIENT++ (kaler2023communication, ) caches the node features and reuses them across ego networks. However, the sharing is limited by the cache hit ratio. To increase the hit ratio, one needs to either increase the cache size or employ complicated caching policies, both of which introduce overwhelming system overheads (lin2020pagraph, ; yang2019aligraph, ; liu2023bgl, ). (ii) An alternative is to merge ego networks, automatically allowing nodes in the merged computation graph to enjoy the sharing benefits. However, this method has prohibitively high memory demands. DGI (yin2023dgi, ) and P³ (gandhi2021p3, ) thus only permits a subset of ego networks to work together. Such a design can only exploit the sharing benefits within each subset, leaving cross-subset sharing wasted. Therefore, P³ and DGI can only utilize 33% and 60% sharing in a 3-layer model.

This paper introduces Deal, the first GNN inference system that is dedicated to end-to-end all-node inference on billion-edge graphs and is distributed to maximize the sharing benefits. Particularly, Deal encompasses three contributions:

First, we unveil and exploit an untapped sharing opportunity during sampling, and maximize the benefits from sharing during subsequent GNN computation. During sampling, we take a fundamentally different approach, which completely eliminates the pointer-chasing problem faced by, to the best of our knowledge, all existing sampling approaches (i.e., ego network-centric sampling). Specifically, for $k$ -layer GNN inference of all nodes, we sample 1-layer ego network $k$ times for each node. Subsequently, we collect the ego networks of the same layer across all nodes together to formulate a 1-hop graph. This offers us $k$ 1-hop ego networks for $k$ layers. During subsequent GNN computation, we feed the feature tensors of nodes and edges through these $k$ 1-hop graphs to arrive at the final embedding for all nodes (see Figure 4). This method automatically enjoys all sharing benefits during GNN computation.

Second, we employ a lightweight 1-D graph and feature collaborative partition to partition the graph and introduce memory-saving and communication-efficient distributed primitives for distributed end-to-end inference on the partitioned graph. Specifically, we choose 1-D graph and feature collaborative partitioning for two reasons:

(i) 1-hop graphs, along with the node features, are too big to fit in a single machine; (ii) end-to-end inference only contains one forward iteration, which cannot afford advanced node assignment algorithms whose overhead would outweigh the benefits (liu2023bgl, ; md2021distgnn, ). Further, communication during GNN requires excessive memory to store the received data, especially for 1-D partitioning, where most of the edges are across partitions. Therefore, we customize our distributed memory-efficient primitives, including GEneral Matrix Multiply (GEMM), SParse-dense Matrix Multiply (SPMM), and Sampled Dense-Dense Matrix Multiply (SDDMM). Compared to recent endeavors (kurt2023communication, ; tripathy2020reducing, ), primitives in Deal reduce memory usage during communication and retain communication efficiency.

Third, we introduce three system implementation optimizations specifically for end-to-end inference (Section 3.5). First, we partition the sparse tensors into subgroups. The computation and communication are performed subgroup by subgroup to reduce the peak memory consumption. Second, we judiciously schedule the subgroup computation and communications to pipeline the computations and communications. We overlap the communication and computation in each subgroup and schedule the communication to reduce the bubbles. These optimizations can improve the SPMM and SDDMM performance beyond $3.50\times$ . Third, we fuse the first layer of GNN inference with graph construction to reduce feature tensor exchange overhead by avoiding redistributing the feature tensor based on the partition results. We accelerate graph pre-processing by up to 21.05 $\times$ and reduce its wall-clock time ratio in end-to-end inference from 85% to 29% for various graph datasets.

2. Background

2.1. GNN inference: an ego network centric computation

Refer to caption — Figure 1. An example of inference computation workflow of Deal for a 2-layer GCN.

For a target node, GNN inference works on an ego network of this target node, which is extracted from the graph. The ego network consists of layers containing the neighbor nodes for one or more hops. The GNN computation progresses from layer 0 of the ego network towards this target node. One can rely on either sampling or the full graph to build this ego network.

Ego network defined GNN computation. Node 3’s ego network defines the subsequent GNN computation. While Deal supports a variety of GNN models, for brevity, we use a well-known model, Graph Convolutional Networks (GCN) (kipf2016semi, ), to explain this process. Usually, the computation is formalized as a series of linear primitives to take advantage of intra-ego network parallelism. The first step is deriving the ${\bf H}^{\prime}$ through General Matrix-Matrix Multiplication (GEMM) from layers 0 to 1. Subsequently, we use Sparse-Dense Matrix Multiplication (SPMM) to aggregate features, arriving at ${\bf H}^{1}$ . This process is repeated from layers 1 to 2 to calculate ${\bf H}^{2}$ . ${\bf H}^{2}$ is the final embedding for target node 3.

2.2. GNN partitioning methods

GNN projects mainly adopt three partitioning approaches during computation, i.e., 1-D and 2-D graph partitioning and feature partitioning. Of the three, 1-D and 2-D focus on partitioning the graph, while the last one distributes the feature tensor. As graphs and feature tensors continue to grow, partitioning is necessary for tackling large-scale GNNs. Particularly, feature tensors could grow significantly bigger than graphs. 1-D partition splits nodes based on node IDs and assigns a contiguous ID range to a partition. 2-D partition assigns each edge to one partition. Feature partition duplicates the graph and partitions the feature matrix into columns.

The choice of graph partitioning method significantly impacts the computation and communication of GNN. With 1-D partitioning, communication is required when accessing neighbors’ features across machines during SPMM and SDDMM. In 2-D partitioning, each machine computes partial results of SPMM and communicates with machines holding the same row tiles. Feature partitioning turns GEMM into an outer product calculation, necessitating an expensive all-to-all reduction during GEMM, which can be the bottleneck for large-scale GNN training and inference.

3. Deal design & implementation

3.1. Observations

Using the graph from Figure 1, Figure 2 illustrates the four stages of end-to-end GNN inference. First, the input graph is stored as an edge list on disk, and the graph generation entails reading the edge list and converting it to the graph data structure. Second, the partition algorithms are applied to the graph, dividing it into two partitions. The partitioned sub-graphs are stored in a shared file system accessible to all machines. Third, every machine reads one graph partition in CSR format to memory. Lastly, machines perform the distributed GNN inference to compute the representation of all 8 nodes in the graph used for subsequent tasks.

Observation #1. End-to-end inference requires a lightweight, co-designed topology-and-feature partitioning method. Figure 3a presents the breakdown of the end-to-end inference time across three datasets, where the graph is generated and 1-D partitioned into four parts for inference. The results show that 86% of the end-to-end time is spent on pre-processing, identifying it as a major efficiency bottleneck. The reason is that GNN inference requires just a single epoch of forward computation to compute the representations for all nodes. This is different from training that performs forward and backward computation multiple times to obtain a converged model. Therefore, advanced “time-consuming” graph pre-processing algorithms (such as partitioning and reordering) might not be a good fit because the time saved during inference by advanced pre-processing steps (e.g., advanced partitioning like METIS (karypis1997parmetis, )) could be shorter than the time spent in pre-processing.

Neither graph partition nor feature partition alone would meet the requirements. (i) On the one hand, graph partition alone would incur excessive memory consumption. Specifically, during ego network computation, when the number of layers in the ego network grows, most nodes become cross-partition nodes. One would need a space to hold these features as the intermediate results for computation. Figure 3b shows the memory consumption of GNN inference with four partitions. For example, when the ogbn-products (Bhatia16, ) graph is partitioned into four parts, one partition receives messages from 80% of the total nodes in the graph, leading to more than 380 GB memory footprint on one machine. (ii) On the other hand, feature partition alone will introduce excessive all-to-all communications during GEMM and SDDMM computation, which are illustrated in Figures 7 and 10.

Deal design. We introduce node feature partitioning, in addition to 1-D graph partition, to mitigate excessive memory consumption and communication costs (Section 3.3). First off, this design is lightweight. Second, it will divide a big primitive into several smaller ones, like one GEMM on big matrices, into GEMM on several smaller ones (Section 3.4). This strategy not only bounds the communication required for each group compared to the entire set of primitives but also significantly lowers peak memory usage.

Observation #2. All-node inference presents new sharing opportunities and challenges. While computing the GNN inferences for different target nodes together, offering sharing opportunities is not new (see, SALIENT++ (kaler2022accelerating, ), DGI (yin2023dgi, ) and P³ (gandhi2021p3, )), the amount of sharing for all-node inferences is unprecedented, which renders new research opportunities and challenges.

The key insight is that only batching enough inferences will offer considerable sharing opportunities, which also means significant memory space consumption. Figure 5 illustrates the sharing opportunities for a 3-layer GNN while increasing the batch sizes for sparse (obgn-products) and dense (social-spammer) graphs. Particularly for sparse graphs like ogbn-products, full sharing is only achieved with a single batch due to low connectivity, which causes partial batches to include nodes across batches redundantly. For dense graphs like social-spammer, high connectivity allows each batch to cover a distinct part of the graph. While seems promising, it actually indicates increasing batch size will consume enormous memory space. For instance, a machine with 256 GB memory can only accommodate the batch size up to 6% of the nodes (149K) for ogbn-products, 0.12% (103K) for social-spammer.

Deal design. We propose processing all-node inference in a single batch to extract the sharing benefits fully. First, we implement GNN operations as distributed linear algebra primitives and further optimize them to be communication efficient (Section 3.4). Second, we strategically partition the GNN primitives to reduce the extra communication from group-by-group execution. Additionally, we implement a pipelining strategy for communicating these groups, which helps in overlapping communication and computation. This reduces the waiting time for data transfer and ensures better resource utilization (Section 3.5).

3.2. Deal Workflow Overview

A natural way of taking advantage of the sharing in observation #1 would take - option in Figure 4. That is, we first obtain all the multi-hop ego networks. Subsequently, we break them into 1-hop ego networks. Finally, we remove the duplicated 1-hop ego networks. For instance, the 1-hop ego network of 5 is shared across the ego networks for nodes 2 and 7. Therefore, we only store that 1-hop ego network once.

Deal goes further by completely avoiding building the multi-hop ego networks. In fact, we compute the embeddings for all target nodes without recovering the multi-hop ego network. Of note, Deal can also work for complete graph-based embedding updates (i.e., each one-hop ego network contains the entire neighborhood).

As shown in of Figure 4, we directly sample 1-hop ego networks for all nodes. For each layer, we will collectively store all of these 1-hop ego networks as a graph. For instance, the layer 0 graph is stored as $\bf G_{0}$ . Similarly, layer 1 as $\bf G_{1}$ . We sample two 1-hop ego networks for every node for two GNN layers, as shown in Figure 4(a). The 8-node graph with a 2-layer GNN leads to 16 ego networks, which can be combined to form the multi-hop ego network. For example, the 2-hop ego network of node 2 comprises the 1-hop ego network of node 2 at layer 2, nodes 0 and 5 at layer 1. In Figure 4, the aggregation of 1-hop ego network in every row is unique, while the sampling in each column accesses the neighbors of the same node. Meanwhile, in each row, ego networks may share the neighbors. Therefore, we can maximize the sharing within ego networks by sampling column-wise and the sharing between ego networks by computing row-wise. Of note, if we work on the complete graph, we will use the complete graph $\bf G$ as $\bf G_{0}$ and $\bf G_{1}$ for the subsequent computations.

We sample the 1-hop ego network in one column together to leverage the sharing in sampling. In particular, sampling from a distribution requires a data structure representing the distribution. Building and accessing this data structure leads to the major overhead of sampling. For example, when sampling multiple neighbors without replacement, a tree is built where each branch indicates the sampling space after the certain neighbor is picked, and the sampling is to traverse the tree randomly. Therefore, when sampling the same target node for different GNN layers, the same data structure can be reused, saving construction costs. The sampled 1-hop ego networks are stored as an edge list for the computation.

Computation sharing is achieved in two ways: First, the node projection GEMM of the same node is shared. For instance, in layer 1, neighbor 0 of 1, 2, 3, and 5 are shared ( ). Second, the aggregation SPMM of the same node is shared. As shown in step , node 5’s aggregation is shared for target nodes 2 and 7. We notice that certain nodes might not appear in any multi-hop ego networks due to the neighbor sample, but we still sample and compute its 1-hop network to simplify the implementation. We notice that nearly all nodes will appear on each layer of the ego networks because the number of dependency nodes increases exponentially at each layer.

3.3. Topology and feature co-designed partition

Deal partitions both the graph topology and node features to curb the memory consumption and the time spent on partitioning. Specifically, we adopt 1-D graph partition such that each machine obtains all the in-neighbors of a disjoint equal range of nodes. Further, we distribute the features of each partition across multiple machines. Figure 6 explains how the same toy example would be distributed across four machines. Specifically, we partition nodes range 0 - 3 and 4 - 7 into two partitions. Machines 0 and 1 both host one copy of the edge list of Partition 0. Machines 2 and 3 both host one copy of the edge list of Partition 1. In the meantime, we partition the features of each node between machines hosting the same partition. Therefore, machine 0 will be responsible for the first two feature dimensions of nodes 0 - 3, and machine 1 for the second two dimensions of nodes 0 - 3, similar to the other two machines.

Our partitioning approach is more lightweight and communication efficient than both traditional 2-D-based graph partition and feature partition: (i) 2-D partition, splits the adjacency matrix into tiles in both row and column directions. Therefore, during the SPMM primitive, each machine computes the partial results and needs to communicate to other machines with the tiles in the same row. Our partitioning stores the full rows on a machine to avoid such distributed aggregation requirements. (ii) Feature partition, distributes the features of nodes across machines so that each machine stores the entire column of the features. As a result, GEMM computation becomes an outer product calculation. This will result in an all-to-all reduction during GEMM computation, which could be very expensive. In our approach, only the machines with the same rows of feature tensors need to communicate, reducing the total communication size. More details are presented in Section 3.4.

3.4. Deal distributed GNN primitives

GEMM multiplies the partitioned feature matrix $\bf H^{(0)}$ with the weight matrix $\bf W_{0}$ . We let each machine own a part of the feature matrix and the full weight matrix because $\bf W_{0}$ is significantly smaller than $\bf H^{(0)}$ in GNN.

SOTA GEMM. Figure 7(a) illustrates the design in existing SOTA all reduce-based GEMM, i.e., CAGNET (tripathy2020reducing, ). At step , every machine multiplies its local tile with the associated rows in the weight matrix. In the example, machine 0 multiplies with the first 2 rows, and machine 1 multiplies with the second 2 rows. Each machine derives a $4\times 4$ matrix. Subsequently, at step , machines sharing the rows aggregate the columns from each other to compute the resultant columns.

CAGNET’s GEMM faces two drawbacks: excessive communication cost and memory consumption. (i) In Figure 7(a), each machine has to receive partial results from all the other machines for the tile this machine is responsible for. (ii) For space consumption, CAGNET creates the intermediate result of size $4\times 4$ on each machine before aggregation. Assuming $\bf H^{(0)}$ has $N$ rows and $D$ columns, and $H^{(0)}$ is partitioned into $P\times M$ partitions, that means we have $PM$ machines. Therefore, each machine works on $\frac{ND}{PM}$ entries. During GEMM, each machine receives $\frac{ND}{PM}(M-1)$ in entries for the feature tensor, and the memory footprint increases from $\frac{ND}{M}$ to $\frac{ND}{P}$ .

Our GEMM. Figure 7(b) introduces our design, significantly reducing the memory costs and the communication overhead CAGNET faces. The key idea is to avoid creating large intermediate results on each machine, as well as fully leverage the benefits of the duplicated $\bf W_{0}$ matrix. In the example, at step , machine 0 partitions its $4\times 2$ tile of $\bf H^{(0)}$ into two $2\times 2$ tiles. Then, it keeps the first tile and sends/receives the remaining tiles to/from the other machines. After that, machine 0 owns a $2$ tile for the first two rows and uses that to multiply with $\bf W_{0}$ to arrive at the first two rows of H’ ( ). The final step ( ) performs the same communication pattern as the first step so that machine 0 could, again, maintain the $4\times 2$ tile of the feature matrix (H’).

We implement a ring-based all-to-all communication to pipeline the computation. Using step as an example, for the first entry, four machines form a logical ring: machines $0\rightarrow 1\rightarrow...\rightarrow(M-1)\rightarrow 0$ . This process continues until we arrive at the row-wise distributed ${\bf H^{(0)}}$ . In this example, we only have 2 machines per sharing each row of ${\bf H^{(0)}}$ . So it is simply a Ping-Pong exchange. The communication of step is similar. Since we break the all-to-all communication into $M-1$ stages, we can overlap the communication with the computation. For example, machine 0 multiplies $\bigl{[}$ $\begin{smallmatrix}0&2\\ 1&3\end{smallmatrix}$ $\bigr{]}$ with $\bf W_{0}$ while receiving $\bigl{[}$ $\begin{smallmatrix}4&6\\ 5&7\end{smallmatrix}$ $\bigr{]}$ from machine 1. This will further reduce the size of the intermediate result.

Table 1. Memory and communication costs of GEMM.

Method	Memory	Communication
SOTA	$\frac{ND}{P}$	$\frac{ND}{PM}(M-1)$
Ours	$\frac{ND}{PM^{2}}$	$2\frac{ND}{PM^{2}}(M-1)$

We reduce memory by $M^{2}\times$ and communication costs by $\frac{M}{2}\times$ compared with SOTA as shown in Table 1. At step , one machine splits its partition into $M$ blocks with the size of each block as $\frac{ND}{PM^{2}}$ . Each machine will send $M-1$ blocks to the $M-1$ rest of machines. Therefore, the communication size of one machine is $2\frac{ND}{PM^{2}}(M-1)$ because it happens in and .

SPMM multiplies the node embedding matrix $\bf H^{{}^{\prime}}$ with the edge features $\bf E$ based on the graph connectivity ${\bf G_{0}}$ . Formally, $\bf H^{(l)}[][{\text{i}}]=multiply_{G}(E[\text{i}][],\bf H^{\prime}[][\text{i}])$ . Figure 8 uses ${\bf G_{0}}$ as an example, it multiplies the $i$ -th row of E with $i$ -th column of $\bf H^{\prime}$ following Sparse-Matrix Vector Multiplication (SpMV) fashion. During multiplication, E is shaped into ${\bf G_{0}}$ for proper matrix multiplication. As shown in , the first row of $\bf E$ , i.e., {0.3, 0.1, …, 0.3} is loaded into corresponding edge locations of ${\bf G_{0}}$ to multiply with $\bf H^{\prime}$ .

Our SPMM. Deal’s SPMM communicates the $\bf H^{\prime}$ matrix to realize distributed SPMM. As shown in Figure 8, machines 0 and 1 hold the top half of ${\bf G_{0}}$ while 2 and 3 are the bottom. The $\bf H^{\prime}$ matrix follows the partitioning in GEMM. Each machine holds the edge features of the edges (non-zeros) within its ${\bf G_{0}}$ part, aligning with the feature partition of $\bf H^{\prime}$ . Using machine 0 as an example, it is responsible for computing the blue tile of $\bf H^{\prime}$ . As shown in the blue dotted box and dashed arrows, during SPMM, machine 0 sends the non-zeros column IDs (5,6,7) to machine 2 ( ), and machine 2 returns rows 5-7 of $\bf H^{\prime}$ ( ), i.e., $\bigl{[}$ $\begin{smallmatrix}14&16\\ 6&8\\ 8&10\end{smallmatrix}$ $\bigr{]}$ . After that, machine 0 computes the resultant tile $\bf H^{(1)}$ with its local ${\bf G_{0}}$ , $\bf E$ , and $\bf H^{\prime}$ .

Exchange $\bf G_{0}$ . An alternative approach to fulfilling the communication is to exchange the sparse graph ${\bf G_{0}}$ . Using machine 0 as an example, it first multiplies columns 0-3 of ${\bf G_{0}}$ with its ${\bf H^{\prime}}$ tile as a partial result. It then sends its ${\bf G_{0}}$ tile and the associated edge features to machine 1. Machine 1 performs multiplication with its local ${\bf H^{\prime}}$ tile and returns the resultant partial product to machine 0 to aggregate as the final $\bf H^{(1)}$ . Although this method reduces the initial communication volume by transmitting only the graph structure and edge features, the second communication phase involves transferring partial results, whose size is comparable to that of the $\bf H^{\prime}$ tile. Consequently, the overall communication cost exceeds that of our SPMM approach.

SOTA 2D-based SPMM (tripathy2020reducing, ; peng2022sancus, ; selvitopi2021distributed, ; kurt2023communication, ). As depicted in Figure 9, the sparse matrix ( ${\bf G_{0}}$ ) and the feature matrix ( $\bf H{{}^{\prime}}$ and $\bf E$ ) are partitioned in 2D and distributed across four machines. During SPMM, machine 0 first receives the $\bf H{{}^{\prime}}$ tile from machine 1. Machine 0 then performs local SPMM with its local edge features $\bf E$ to compute partial results of $\bf H^{(l)}$ . After that, machine 0 receives the partial result of columns 0 and 1 from machine 1 to derive the final results. Both Deal and 2D-based SPMM receive a $4\times 2$ tile of $\bf H{{}^{\prime}}$ from machine 3 and machine 1, respectively. Since both approaches can send the non-zero index first to reduce the actual transferred features, they initially have similar communication costs. However, 2D-based SPMM needs to send its partial $4\times 2$ results for columns 2 and 3 to machine 1, which is not required in Deal’s approach.

Table 2. Memory and communication costs of SPMM. (Note memory consumption is the same as communication cost)

Method	Memory	Communication
Ours	-	$\frac{ZN(P-1)}{P^{2}}+\frac{N(P-1)}{P^{2}}\frac{D}{M}$
Exchange ${\bf G_{0}}$	-	$\frac{ZN(P-1)}{P^{2}}\frac{D}{M}+\frac{ND}{PM}$
2D-based SPMM	-	$\frac{N(P-1)}{P^{2}}\frac{D}{M}+\frac{ND(M-1)}{PM}$

The communication size is determined by two messages. Consider the distributed SPMM multiplying an $N\times N$ $\bf G_{0}$ with an $N\times D$ $\bf H^{\prime}$ , which has $P$ parts for rows and $M$ parts for columns (same as $\bf H^{(0)}$ in GEMM). Assuming that each column has $Z$ non-zeros on average, every machine receives $\frac{ZN(P-1)}{P^{2}}$ non-zeros from other machines ( ), which contains $\frac{N(P-1)}{P^{2}}$ unique columns. Further, since each machine receives the $\frac{D}{M}$ features for every non-zero column in $\bf H^{\prime}$ , the communication size is $\frac{N(P-1)}{P^{2}}\frac{D}{M}$ ( ). Similarly, for exchanging graphs, the $\bf G_{0}$ leads to $\frac{ZN(P-1)}{P^{2}}\frac{D}{M}$ communication and the partial result leads to $\frac{ND(M-1)}{PM}$ communication. For 2D-based SPMM, the extra aggregation leads to $\frac{ND(M-1)}{PM}$ communication. Compared with exchanging $\bf G_{0}$ , both our first term (graph) and the second term (features) are smaller. Further, compared with 2D-based SPMM, the second term of 2D-based SPMM is much larger than ours. Together, our design is more communication efficient.

SDDMM primitive uses the source features matrix $\bf H^{(l-1)}_{src}$ and the destination feature matrix $\bf{H^{(l-1)}_{dest}}$ to derive the edge attention based on the adjacency matrix ${\bf G_{0}}$ . Formally, $\textbf{attn}={\bf G_{0}}\odot(\bf{H^{(l-1)}_{dest}}\cdot({\bf H}^{(l-1)}_{src% })^{T})$ . As shown in Figure 10, the highlighted nonzero $(1,6)$ computation in ${\bf G_{0}}$ is the dot-product between row 1 in $\bf{H^{(l-1)}_{dest}}$ and column 6 in $\bf(H^{(l-1)}_{src})^{T}$ . Practically, only the positions with non-zeros associated in ${\bf G_{0}}$ are computed, and the result sparsity is identical to ${\bf G_{0}}$ . In distributed SDDMM, the computation of any nonzero entries would involve feature matrices from multiple machines. For instance, computing entry $(1,6)$ requires data from four machines (highlighted in red dashed boxes in $\bf{H^{(l-1)}_{dest}}$ and $\bf(H^{(l-1)}_{src})^{T}$ ).

We propose an output-oriented task scheduling with corresponding communication patterns. Specifically, we assign machines storing the non-zeros in ${\bf G_{0}}$ to compute the corresponding $\bf attn$ results. This strategy ensures that the results for each non-zero element are co-located with the sparse matrix after the SDDMM operation. When multiple machines store the same portion of the sparse matrix, we consider two approaches: (i) duplicating the computation across machines or (ii) distributing the computation of non-zeros among machines and subsequently exchanging the results.

Approach (i). Using machine 0 as an example, approach (i) requires all four features of nodes 5-7 in $\bf(H^{(l-1)}_{src})^{T}$ from machines 2 and 3, which is 12 values $\bigl{[}\begin{smallmatrix}\hbox{\pagecolor[HTML]{ACE1AF}\footnotesize 2.3\ \ % 0.0\ \ 0.6}\\ \hbox{\pagecolor[HTML]{ACE1AF}\footnotesize 0.4\ \ 1.2\ \ 0.8}\\ \hbox{\pagecolor[HTML]{FFE4E1}\footnotesize 0.6\ \ 0.5\ \ 1.2}\\ \hbox{\pagecolor[HTML]{FFE4E1}\footnotesize 0.3\ \ 0.7\ \ 3.2}\end{smallmatrix% }\bigr{]}$ . We also need all 8 values of $\bf H^{(l-1)}_{dest}$ , i.e., $\bigl{[}$ $\begin{smallmatrix}6&8\\ 8&10\\ 10&12\\ 12&14\end{smallmatrix}$ $\bigr{]}$ from machine 1, and 6 values of $\bf(H^{(l-1)}_{src})^{T}$ , i.e., $\bigl{[}$ $\begin{smallmatrix}2.5&1.5&1.2\\ 3.9&1.8&2.4\end{smallmatrix}$ $\bigr{]}$ from machine 1. The communication size is 26.

Approach (ii), we let machine 0 compute the non-zeros in $\bf attn$ rows 0-1, and machine 1 in rows 2-3. Therefore, machine 0 receives $\bigl{[}\begin{smallmatrix}\hbox{\pagecolor[HTML]{ACE1AF}\footnotesize 0.0}\\ \hbox{\pagecolor[HTML]{ACE1AF}\footnotesize 1.2}\\ \hbox{\pagecolor[HTML]{FFE4E1}\footnotesize 0.5}\\ \hbox{\pagecolor[HTML]{FFE4E1}\footnotesize 0.7}\end{smallmatrix}\bigr{]}$ , $\bigl{[}$ $\begin{smallmatrix}2.5\\ 3.9\end{smallmatrix}$ $\bigr{]}$ , and $\bigl{[}$ $\begin{smallmatrix}1.2\\ 2.4\end{smallmatrix}$ $\bigr{]}$ from $\bf(H^{(l-1)}_{src})^{T}$ , and $\bigl{[}$ $\begin{smallmatrix}6&8\\ 8&10\end{smallmatrix}$ $\bigr{]}$ in $\bf{H^{(l-1)}_{dest}}$ . After the computation, machine 0 receives rows 2 - 3 in attn from machine 1. In total, machine 0 receives 17 values. In the meantime, machine 1 receives $\bigl{[}\begin{smallmatrix}\hbox{\pagecolor[HTML]{ACE1AF}\footnotesize 2.3}\\ \hbox{\pagecolor[HTML]{ACE1AF}\footnotesize 0.4}\\ \hbox{\pagecolor[HTML]{FFE4E1}\footnotesize 0.6}\\ \hbox{\pagecolor[HTML]{FFE4E1}\footnotesize 0.3}\end{smallmatrix}\bigr{]}$ , $\bigl{[}\begin{smallmatrix}\hbox{\pagecolor[HTML]{ACE1AF}\footnotesize 0.6}\\ \hbox{\pagecolor[HTML]{ACE1AF}\footnotesize 0.8}\\ \hbox{\pagecolor[HTML]{FFE4E1}\footnotesize 1.2}\\ \hbox{\pagecolor[HTML]{FFE4E1}\footnotesize 3.2}\end{smallmatrix}\bigr{]}$ , and $\bigl{[}$ $\begin{smallmatrix}2.8&2.0\\ 3.2&2.2\end{smallmatrix}$ $\bigr{]}$ in $({\bf H}^{(l-1)}_{src})^{T}$ , and $\bigl{[}$ $\begin{smallmatrix}8&10\\ 10&12\end{smallmatrix}$ $\bigr{]}$ in ${\bf H}^{(l-1)}_{dest}$ . For results, it receives rows 0 - 1 from machine 0. The total # of received values is 19. Further, the two machines communicate in parallel. Therefore, approach (ii) leads to fewer communications in this example.

Table 3. Memory and communication costs of SDDMM. (Note memory consumption is the same as communication cost)

Method	Memory	Communication
Approach (i)	-	$(M+MP-2)\frac{ND}{MP}$
Approach (ii)	-	$(M+MP-2)\frac{ND}{M^{2}P}$ + $\frac{NZ(M-1)}{PM}$

We choose approach (ii) for reduced communication size. Similar to SPMM assumptions, we assume $\bf H^{(l-1)}_{dest}$ and $\bf H^{(l-1)}_{src}$ are $N\times D$ dense with $P$ partitions in rows and $M$ partitions in columns, and $\bf{G_{0}}$ is $N\times N$ with $Z$ non-zeros per column on average. So # of machines = $MP$ . For approach (i), each machine needs to access $M-1$ and $MP-1$ machines from $\bf H^{(l-1)}_{dest}$ and $\bf H^{(l-1)}_{src}$ , respectively, so the total communication is $(M+MP-2)\frac{ND}{MP}$ . For approach (ii), each machine computes $\frac{N}{MP}$ rows instead of $\frac{N}{P}$ rows in approach (i). Therefore, the total communication is reduced to $(M+MP-2)\frac{ND}{M^{2}P}$ . Further, approach (ii) requires communicating the $\frac{M-1}{M}$ ratio of the nonzeros in $\bf attn$ , which leads to $\frac{NZ(M-1)}{PM}$ communications. In total, approach (ii) performs $(M+MP-2)\frac{ND}{M^{2}P}$ + $\frac{NZ(M-1)}{PM}$ communications. When $M$ increase, the communication size of the input in Approach (ii) is reduced faster than that of Approach (i), which supports our choice of Approach (ii).

3.5. Deal system optimizations

Partitioned communication. Figure 11 exemplifies our partitioning strategy, which consists of two steps. First, we assign the non-zeros from ${\bf G_{0}}$ that multiply with local node features of $\bf H^{(l-1)}$ into one group. For example, the non-zeros in the top-left tile in ${\bf G_{0}}$ are local for machine 0 and machine1. Second, we partition the other non-zeros based on their column IDs. In particular, we sort the column ID array in CSR and assign non-zeros in adjacent columns into groups because they fall within a small range in the sorted column ID array. Using machine 0 as an example, we put the non-zeros in columns 5 and 6 of ${\bf G_{0}}$ into group 1 and the remaining into group 2. As a result, the computation of ${\bf G_{0}}$ is partitioned into 6 groups, where machines 0 & 1 compute groups 0-2, and machines 2 & 3 compute groups 3-6. In each group, we communicate the needed features and multiply them with the edge features of the non-zeros in the group.

When non-zeros of the same row in ${\bf G_{0}}$ are partitioned into different groups, we cache the results of each row and perform inter-group accumulation. In the example, we cache the partial results of rows 0-3 derived by the group 0. Then, for group 1, we accumulate the results of $(4,1)$ and $(4,3)$ in ${\bf G_{0}}$ to the cache of rows 1 and 3, respectively. The SDDMM computation is similar. We focus on one group of non-zeros at a time and perform the computations.

Pipeline optimization. We organize the computation and communication of Figure 11 in the pipeline to hide the communication time. Figure 12(a) shows an example of SPMM with four groups in rows 4-7 of ${\bf G_{0}}$ , where each group is associated with two communications for column IDs and features except group 6. We can schedule the groups in the pipeline so that the SPMM computation of the group overlaps with the communication of receiving features. For example, we first finish the communication for the column IDs of group 4. After that, we can start receiving the features for group 4 and sending the features to other machines while performing the SPMM computation of group 3. However, communicating the features depends on the results of ID communication, so we cannot start the SPMM before it completes, leading to the bubble between two SPMM computations.

We propose two reordering optimizations to reduce the communication cost in our pipelined strategy: (i) At the start of the primitive, we first communicate the IDs for groups 3 and 4 as shown in Figure 12(b). As a result, the SPMM computation of group 3 can overlap the communication of column IDs for group 5 and the features for group 4. The communication for the features of the next group and the IDs of the group after the next group do not need synchronization. (ii) We can schedule the local SPMM (group 6) at the beginning to cover the pipeline fill time, as shown in Figure 12(c).

Fusing feature preparation with the first GNN primitive. During GNN inference, we need to load the node features from the files. Of note, the feature files are not sorted based on IDs. Since Deal partitions the features for scalability purposes, one can either let each machine scan all the feature files to obtain its own features or let each machine load part of the features and communicate for the correct feature distributions. When there are $M$ machines and $N$ nodes, the former approach incurs $O(M\cdot N)$ traffic on the file system, and the latter reduces the file system traffic by $M$ times and leads to $O(\frac{(M-1)N}{M})$ network traffic. Because the network has a larger aggregated bandwidth than the file system (efs, ), we opt to let each machine load part of the features and then redistribute them.

Deal goes further to avoid an extra redistribution cost as follows: (i) We build a table recording the location of each node feature on every machine. In the example of Figure 13, machine 0 (blue) loads the features of nodes 1, 6, 3, and 7. (ii) We let the machines that are supposed to hold a particular feature tile compute that tile in $\bf H^{(1)}$ , so the residence of the first layer output, $\bf H^{(1)}$ , aligns with the partition results. For example, for the sparse primitives in the first GNN layer, machines receive $\bf H^{(0)}[6]$ from machines 0 and 1 and receive $\bf H^{(1)}[6]$ from 2 and 3.

4. Evaluation

4.1. Experimental setup

Table 4. Real-world graph datasets.

Dataset	ogbn-products	social-spammer	ogbn-papers100M
Nodes	2.4 M	5.6 M	111 M
Edges	123 M	858 M	1.6 B

Datasets. We conduct experiments on three real-world datasets as shown in Table 4. The ogbn-papers100M (wang2020microsoft, ) is a citation graph whose nodes represent papers and edges are the citations. ogbn-products (Bhatia16, ) depicts a product co-purchasing network, where nodes represent products sold on Amazon, and edges between two products indicate that they are purchased together. The social-spammer (fakhraei2015collective, ) dataset depicts a multi-relation social network with legitimate users and spammers. Besides, we use synthetic datasets to evaluate scalability, generated using RMAT (chakrabarti2004r, ), with the edge probabilities as $\{0.57,0.19,0.19,0.05\}$ , and the average degree as 20.

Models. We test the inference of 3-layer GCN and GAT. The hidden dimension of node features is set the same as the input feature dimension, which is 100 for ogbn-products and 128 for other datasets. The GAT model has 4 heads. We sample 50 neighbors for every GNN layer.

Baseline systems. We compare with the baseline system for Deal’s GNN computation and graph construction. In particular, we implement the GNN computation of DGI and SALIENT++ in DistDGL (zheng2020distdgl, ). Note that we don’t use P³ as a baseline because it is not open-sourced. For graph construction, Deal is compared with DistDGL’s built-in pipeline (dglpart, ).

System implementation. We implement Deal on top of DGL and PyTorch. Specifically, we leverage DGL for graph operations and PyTorch’s distributed package for communication. The experiments are run with PyTorch 2.0 and DGL 1.1. The system is deployed on up to 16 AWS R5.16xlarge EC2 instances with Intel Xeon Platinum 8175 and 768 GB memory. Instances are connected via 25 Gbps Ethernet.

4.2. Deal vs State-of-the-art

Deal v.s SOTA. Figure 14 shows the speedup of Deal over DGI and SALIENT++ across three datasets and two models. For GCN, Deal achieves $4.64\times$ , $2.28\times$ , and $3.25\times$ speedup over DGI, and $4.36\times$ , $1.82\times$ , and $3.26\times$ speedups over SALIENT++, respectively for the three datasets. For GAT, over the three datasets, respectively, Deal enjoys $7.70\times$ , $2.93\times$ , and $3.90\times$ speedups against DGI, and $3.07\times$ , $1.32\times$ , and $2.32\times$ speedup against SALIENT++. Regarding the trends on datasets, as the graph grows larger (ogbn-papers100M) and sparser (ogbn-products), DGI suffers from decreased sharing ratios as such graphs are harder for DGI to obtain common neighbors, and SALIENT++ experiences increased overhead from building and maintaining its cache.

Regarding the model trends, Deal achieves higher speedup on GAT when compared with DGI because GAT contains more primitives, benefiting more from better exploited sharing. In contrast, Deal exhibits higher speedups for GCN when compared against SALIENT++, because its GCN computation is dominated by feature communication, which Deal significantly improves. Deal keeps similar speedups when increasing the number of machines. The reason is that our sampling would offer better speedups, but it results in more communication. These two effects are roughly comparable, thus leading to maintained speedups.

Table 5. The sharing ratio of different approaches.

	ogbn-products	social-spammer	ogbn-papers100M
DGI (yin2023dgi, )	60.1%	87.0%	63.9%
P³ (gandhi2021p3, )	33.3%	46.1%	28.6%
SALIENT++ (kaler2023communication, )	66.4%	77.9%	70.3%

Sharing ratio. Table 5 shows the sharing ratio of different approaches. Across the three datasets, DGI, P³, and SALIENT++ achieve an average sharing ratio of $70.3\%$ , $36\%$ , and $71.5\%$ , respectively. Although P³ can leverage all sharing in the outermost hop, the outermost hop alone only contributes limited sharings, so P³ has the lowest overall sharing ratio. When comparing the different datasets, the three approaches $51.0\%$ , $67.8\%$ , and $50.4\%$ , respectively. Notably, there is an inverse relationship between the achieved sharing ratio and the speedup of Deal. SALIENT++ has a higher sharing ratio than DGI, but its cache maintenance overhead slows it.

Table 6. The test accuracy on ogbn-products.

Model	Full neighbor	SALIENT++	Ours
GCN	76.9% ( $\pm$ 0.29%)	76.9% ( $\pm$ 0.46%)	76.9% ( $\pm$ 0.43%)
GAT	79.4% ( $\pm$ 0.12%)	79.3% ( $\pm$ 0.63%)	79.2% ( $\pm$ 0.82%)

Accuracy study. We evaluate the accuracy of Deal on the ogbn-products in Table 6. Deal reuses the same sampled 1-hop ego networks for different nodes, which is slightly different from the conventional mini-batch inference (kaler2023communication, ; zheng2020distdgl, ). However, our results show that Deal achieves similar or the same accuracy as sampling-based method (i.e., SALIENT++). Particularly, this study compares the layer-by-layer inference in Deal with the full neighbor inference and the mini-batch inference in SALIENT++. We trained two 3-layer GCN and GAT models with sampling fanout as 10. Deal achieves the same accuracy for GCN and similar accuracy for GAT when compared to SALIENT++ and full neighbor-based approach.

4.3. Scalability

We evaluate the scalability of Deal using synthetic datasets. We use processed edges per second per machine to represent the system efficiency. Figure 15(a) shows the weak scaling of the GNN computation We run graphs of different scales on different cluster sizes. For example, we run a graph with 1B edges on 2 machines and a graph with 8B edges on 16 machines. When scaled to 16 machines, Deal retains $48.2\%$ and $47.9\%$ efficiency compared with using 2 machines for GCN and GAT, respectively.

Figure 15(b), (c), and (d) shows the strong scalability from 2 machines to 16 machines on real-world datasets. When scaled to 16 machines, Deal retains $2.28\times$ , $4.98\times$ , and $3.98\times$ for GCN, and $2.38\times$ , $5.32\times$ , and $4.83\times$ for GAT. Compared with GCN, GAT has better scalability because it has more GEMM primitives. When graphs grow larger, the scalability of Deal is better because the fixed overhead such as communication latency becomes insignificant.

4.4. Distributed primitive evaluation

4.4.1. GEMM

Figure 16 evaluates the distributed GEMM algorithm (Deal vs. CAGNET) on ogbn-products for two sizes of hidden dimensions. As GEMM performance is graph-structure independent, we restrict our results to one dataset. Deal’s distributed GEMM approach demonstrates substantial scalability, with average speedups of 1.97 $\times$ and 2.97 $\times$ when using 4 and 8 machines, respectively, compared to the 2-machine baseline. While Deal experiences noticeable overhead of adjusting the memory layout to accommodate the communication library for 2 and 3 machines, this overhead becomes trivial when the machine number is large. In contrast, CAGNET’s GEMM exhibits poorer scalability due to increased communication overhead with more machines. Overall, benefiting from reduced communication, our method significantly outperforms CAGNET, achieving average speedups of 1.52 $\times$ and 1.47 $\times$ across different machine counts. The speedup increases with more machines used.

4.4.2. SPMM

Figure 17 shows the performance of SPMM, evaluating (i) exchanging graph structure as the baseline and (ii) exchanging features in Deal. When comparing the two options, Deal achieves $4.30\times$ , $5.28\times$ , and $5.29\times$ speedups for three datasets, respectively. The speedup contains two parts, i.e., communication and SPMM computation. For communication, the reduced communication of Deal enjoys $4.15\times$ , $5.30\times$ , and $4.86\times$ speedups over the baseline, respectively. For SPMM computation, Deal delivers $6.14\times$ , $7.21\times$ , and $8.78\times$ speedups over the baseline, respectively.

Moreover, the scalability of these approaches diverges significantly. As the number of machines increases from 2 to 8, baseline shows decreased performance, becoming $2.27\times$ , $1.52\times$ , and $2.49\times$ slower, respectively. In contrast, Deal achieves $1.21\times$ , $1.08\times$ , and $1.52\times$ speedup, showcasing its superior scalability over the baseline. The reason is that the size of the sparse matrix sparse matrix tile does not reduce linearly as the number of partitions increases. Therefore, the baseline experiences a larger communication size to exchange the spare matrix tile when # partitions increases.

4.4.3. SDDMM

Figure 18 evaluates the SDDMM under various partitioning configurations. In SDDMM, machines communicate with other machines storing the different graph partitions and feature partitions. Therefore, the total communication is a combined effect of graph partition and feature partitions. We used fixed eight machines and varied the number of graph and feature partitions to assess their impacts on communication and computation times. The two approaches examined are (i) duplicating computation across partitions (baseline) and (ii) splitting non-zeros among partitions (Deal). As we increase feature partitions from one to eight, Deal demonstrates speedups of $1.65\times$ , $1.38\times$ , $1.15\times$ , and $1.00\times$ . Notably, both approaches are equivalent with a single feature partition (hence $1.00\times$ for the last case). Regarding communication efficiency, as shown as light green bars and deep green bars for baseline and Deal respectively, Deal yields speedups of $1.32\times$ , $1.53\times$ , and $1.15\times$ , respectively, across the partition configurations. Moreover, the exploitation of computation parallelism under Deal results in speedups of $1.54\times$ , $1.78\times$ , and $1.24\times$ , respectively, when the number of graph partitions increases. Dataset comparison reveals that denser graphs, such as those from the social-spammer dataset, benefit more from computational speedup, while larger and sparser graphs, like ogbn-papers100M, see reduced speedup primarily due to the communication overhead in aggregating edge features computed across machines.

4.5. Study system implementation optimizations

Partitioned communication and pipeline optimization. Figure 19 depicts the speedup achieved by the sparse primitives of Deal through two optimizations: partitioned communication and pipelining. Across datasets and machine counts, the partitioned communication yields the average speedups of $2.90\times$ , $3.09\times$ , and $2.15\times$ for SPMM, and $1.89\times$ , $2.09\times$ , and $1.57\times$ for SDDMM, respectively. Denser graphs with more non-zeros per column benefit more due to efficient communication merging, leading to the highest speedup for the dense social-spammer dataset and the lowest gain on the sparser ogbn-papers100M. The speedup decreases with more machines as the ratio of redundant communication is reduced. Compared with SPMM, the speedup of SDDMM is smaller because we assign the non-zeros to different machines row-wise, reducing the number of non-zeros in each group. Subsequently, applying pipelining further boosts performance on average by $1.50\times$ , $1.65\times$ , and $1.47\times$ for SPMM, and $1.82\times$ , $2.15\times$ , and $1.90\times$ for SDDMM. The dense graphs enjoy more speedup due to reduced communication overhead. Likewise, the SDDMM achieves higher speedup because of more communication operations per group. Cumulatively, our optimizations achieve overall speedups of $4.41\times$ , $4.74\times$ , and $3.61\times$ on SPMM, and $3.72\times$ , $4.24\times$ , and $3.50\times$ on SDDMM by combining partitioned communication and pipelining across the respective datasets, underscoring the compounded benefit.

Graph construction. Figure 20 illustrates the speedup of our graph construction over DistDGL. Deal achieves $7.92\times$ , $21.05\times$ , and $11.99\times$ speedup on average across the evaluated datasets, respectively. Deal exhibits higher speedup on graphs with more edges because DistDGL can only process the edge list using one machine while Deal fully distributes the construction. Furthermore, leveraging 4 machines in Deal results in $2.54\times$ , $3.11\times$ , and $3.21\times$ speedup when compared to a single machine for the respective datasets. This scaling efficiency is particularly pronounced for larger graphs.

Feature preparation. Figure 21 evaluates the fusing feature preparation with the first GNN primitive across three distinct datasets. Notably, when compared to the baseline scan-through loading method, the feature redistribution design achieves a speedup of 1.20 $\times$ , 1.26 $\times$ , and 1.39 $\times$ on average for these datasets, respectively, across varying machine counts. Furthermore, Deal’s communication-free method yields additional 1.15 $\times$ , 1.15 $\times$ , and 1.14 $\times$ speedups, respectively. As we scale up the number of machines, the baseline time remains unchanged because the file system is the bottleneck. In contrast, when the machine count increases from 2 to 8, the redistribution approach achieves a speedup of 3.27 $\times$ over the baseline using 2 machines. Deal further achieves 3.88 $\times$ over 2-machines baseline, underscoring the benefits of communication reduction.

5. Related work

GNN research has proliferated recently. GNN applications (hamidi2023learning, ; sun2020benchmarking, ; suzuki2023clustered, ), algorithms (zhu2021graph, ; chen2024view, ; khan2023interpretability, ; jiang2022fast, ; tao2022cross, ; wang2023efficient, ; wu2023certified, ; FENG2023119617, ), models (kipf2016semi, ; hamilton2017inductive, ; layne2023temporal, ), systems (chen2023tango, ; zheng2020distdgl, ; huang2024wisegraph, ; chen2023compressgraph, ), hardware (FINGERS, ; yu2023race, ; lyu2022efficient, ), among many others (zheng2022distributed, ; zheng2020dglke, ; calvanese2023conceptually, ; Graph2020Amirali, ). We refer the readers to a handful of GNN surveys for a more comprehensive landscape of GNN research (abadal2021computing, ). This work mainly focuses on two related subjects: distributed GNN and GNN inference acceleration.

Distributed GNN computation can be categorized through two avenues: (i) ego network-centric distribution and (ii) full graph-centric distribution. Below, we discuss them separately.

(i) Ego network-centric distribution treats the ego network as a first-class citizen, and distribution is achieved centering around each ego network entity. PyG (fey2019fast, ), DGL (wang2019dgl, ), AliGraph (yang2019aligraph, ), AGL (zhang2020agl, ), DistDGL (zheng2020distdgl, ), P³ (gandhi2021p3, ), FlexGraph (wang2021flexgraph, ), Betty (yang2023betty, ), SALIENT++ (kaler2022accelerating, ), and PaGraph (lin2020pagraph, ) belong to this category. We discuss some representative projects: P³ (gandhi2021p3, ) introduces the hybrid parallelism to address the redundancy. All machines first exchange the ego networks and collectively compute the features of all nodes in the first layers. Then, the results are communicated, and every machine continues for the rest of the layers of each ego network. FlexGraph (wang2021flexgraph, ) dynamically migrates the ego networks among the machines to balance the workload and minimize the communication cost. Betty (yang2023betty, ) partitions the multi-layer bipartite graph built by a batch of ego networks. The goal is that each ego network owns one graph partition, and the inter-partition communication is reduced to mitigate redundancy. SALIENT++ (kaler2022accelerating, ) and PaGraph (bai2021efficient, ) focus on caching features of hub nodes, which are often included in multiple ego networks, to eliminate the need for repeated communication.

Deal is fundamentally distinct from the aforementioned paradigm. Particularly, Deal breaks all ego networks into 1-hop samples and computes all samples of the same layer together with distributed primitives. This process continues layer-by-layer to arrive at the final embeddings for all nodes. This concerted effort eliminates redundancy and offers rich pipelining opportunities.

(ii) Full graph-centric distribution simply partitions the graph for distributed GNN. NeutronStar (wang2022neutronstar, ), Sancus (peng2022sancus, ), NeuGraph (ma2019neugraph, ) DistGNN (md2021distgnn, ), DGCL (cai2021dgcl, ), and Dorylus (thorpe2021dorylus, ) fall in this category. Particularly, DGCL (cai2021dgcl, ) introduces a novel communication scheduling approach that considers both network topology and GNN computation dependencies to reduce communication costs. NeutronStar (wang2022neutronstar, ) opts for an adaptive solution between recomputation and caching to reduce communication costs. The benefits are pronounced when nodes have few dependencies and larger hidden layer sizes, where computation overhead is less than the communication overhead. NeuGraph (ma2019neugraph, ) distributes tiles of the adjacency matrix across multiple GPUs. Each GPU calculates partial results and then employs an all-reduce operation for complete results aggregation. Besides, DistGNN (md2021distgnn, ) and Sancus (peng2022sancus, ) delay the communication of sparse primitives and proceed with partial aggregated results. This approach, while effective, can lead to accuracy losses.

Deal is different as follows: Deal partitions all the participating tensors during GNN distribution, including the sparse graph tensor, and node and edge feature tensors. We prioritize the feature tensors due to their superior size. In contrast, the aforementioned projects only focus on graph partitions.

Distributed GNN primitives. Existing work on optimizing distributed primitives for GNNs focuses on reducing communication overhead (koanantakool2016communication, ; tripathy2020reducing, ; kurt2023communication, ; selvitopi2021distributed, ; zhang2023unfairness, ). (selvitopi2021distributed, ) and (koanantakool2016communication, ) proposes novel partition and associated communication algorithms to optimize individual primitives. MGG (wang2023mgg, ) leverages the sparsity to overlap the communication and computation within a GPU kernel. Techniques like CAGNET (tripathy2020reducing, ) optimize work distribution based on the GNN computation flow, and RDM (kurt2023communication, ) redistributes matrices to accommodate different primitives. However, these efforts lack support for diverse GNN models (velivckovic2018graph, ; hamilton2017inductive, ) and are coupled with specific model structures, limiting their applicability.

GNN inference acceleration. A separate line of research has focused on optimizing GNN inference from various perspectives (he2022coldguess, ; hosseini2023exploring, ; lyu2022efficient, ; abadal2021computing, ; tan2023quiver, ; yang2024gmorph, ). Work by (zhou2021accelerating, ) taps into model pruning to diminish the hidden dimension of node representations. Similarly, (auten2020hardware, ) devises hardware architectures tailored to efficiently manage the irregular data movement inherent to GNNs. HAG (jia2020hag, ) is proposed to reduce the redundancy computation in neighbor aggregation by combining the common neighbors of different nodes. HAG can reduce the total aggregation operations, but searching for the neighbor combination is time-consuming.

6. Conclusion

Deal introduces distributed end-to-end GNN inference at scale for all nodes. Particularly, Deal makes three major contributions. First, Deal introduces a lightweight partitioning strategy for end-to-end inference. Second, Deal designs the distributed GNN primitives to address partitioned graphs and features communication and memory consumption issues. Third, Deal implements partitioning and scheduling mechanisms to reduce communication costs further and enable pipelining-based optimizations. With Deal, the end-to-end inference time on real-world benchmark datasets is reduced up to $7.70\times$ and the graph construction time is reduced up to $21.05\times$ , compared to the state-of-the-art.

References

[1] Da Zheng, Xiang Song, Chengru Yang, Dominique LaSalle, and George Karypis. Distributed hybrid cpu and gpu training for graph neural networks on billion-scale heterogeneous graphs. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4582–4591, 2022.
[2] Hewen Wang, Renchi Yang, Keke Huang, and Xiaokui Xiao. Efficient and effective edge-wise graph representation learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2326–2336, 2023.
[3] Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. Graph neural networks in recommender systems: a survey. ACM Computing Surveys, 55(5):1–37, 2022.
[4] Kun Wu, Jie Shen, Yue Ning, Ting Wang, and Wendy Hui Wang. Certified edge unlearning for graph neural networks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2606–2617, 2023.
[5] Jou-An Chen, Hsin-Hsuan Sung, Xipeng Shen, Sutanay Choudhury, and Ang Li. Bitgnn: Unleashing the performance potential of binary graph neural networks on gpus. In Proceedings of the 37th International Conference on Supercomputing, pages 264–276, 2023.
[6] Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai Koutra. Graph neural networks with heterophily. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11168–11176, 2021.
[7] Shengli Jiang and Prasanna Balaprakash. Graph neural network architecture search for molecular property prediction. In 2020 IEEE International conference on big data (big data), pages 1346–1353. IEEE, 2020.
[8] Sandeep Polisetty, Juelin Liu, Kobi Falus, Yi Ren Fung, Seung-Hwan Lim, Hui Guan, and Marco Serafini. Gsplit: Scaling graph neural network training on large graphs via split-parallelism. arXiv preprint arXiv:2303.13775, 2023.
[9] Seung Won Min, Kun Wu, Mert Hidayetoglu, Jinjun Xiong, Xiang Song, and Wen-mei Hwu. Graph neural network training and data tiering. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3555–3565, 2022.
[10] Matthew Joseph Adiletta, Jesmin Jahan Tithi, Emmanouil-Ioannis Farsarakis, Gerasimos Gerogiannis, Robert Adolf, Robert Benke, Sidharth Kashyap, Samuel Hsia, Kartik Lakhotia, Fabrizio Petrini, et al. Characterizing the scalability of graph convolutional networks on intel® piuma. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 168–177. IEEE, 2023.
[11] Yuhang Yao, Weizhao Jin, Srivatsan Ravi, and Carlee Joe-Wong. Fedgcn: Convergence-communication tradeoffs in federated training of graph convolutional networks. Advances in Neural Information Processing Systems, 36, 2024.
[12] Maciej Besta, Pawel Renc, Robert Gerstenberger, Paolo Sylos Labini, Alexandros Ziogas, Tiancheng Chen, Lukas Gianinazzi, Florian Scheidl, Kalman Szenes, Armon Carigiet, et al. High-performance and programmable attentional graph neural networks with global tensor formulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2023.
[13] Shiyang Chen, Da Zheng, Caiwen Ding, Chengying Huan, Yuede Ji, and Hang Liu. Tango: re-thinking quantization for graph neural network training on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2023.
[14] Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision gnn: An image is worth graph of nodes. Advances in Neural Information Processing Systems, 35:8291–8303, 2022.
[15] Yuxiao Lin, Yuxian Meng, Xiaofei Sun, Qinghong Han, Kun Kuang, Jiwei Li, and Fei Wu. Bertgcn: Transductive text classification by combining gnn and bert. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1456–1462, 2021.
[16] Abdelghny Orogat and Ahmed El-Roby. Maestro: Automatic generation of comprehensive benchmarks for question answering over knowledge graphs. Proceedings of the ACM on Management of Data, 1(2):1–24, 2023.
[17] Anthony Colas, Mehrdad Alvandipour, and Daisy Zhe Wang. Gap: A graph-aware language model framework for knowledge graph-to-text generation. arXiv preprint arXiv:2204.06674, 2022.
[18] Haoyu He, Yuede Ji, and H Howie Huang. Illuminati: Towards explaining graph neural networks for cybersecurity analysis. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 74–89. IEEE, 2022.
[19] Sina Shaham, Gabriel Ghinita, and Cyrus Shahabi. Enhancing the performance of spatial queries on encrypted data through graph embedding. In Data and Applications Security and Privacy XXXIV: 34th Annual IFIP WG 11.3 Conference, DBSec 2020, Regensburg, Germany, June 25–26, 2020, Proceedings 34, pages 289–309. Springer, 2020.
[20] Yifei Wang, Shiyang Chen, Guobin Chen, Ethan Shurberg, Hang Liu, and Pengyu Hong. Motif-based graph representation learning with application to chemical molecules. In Informatics, volume 10, page 8. MDPI, 2023.
[21] Kaustav Banerjee, Shukri J Souri, Pawan Kapur, and Krishna C Saraswat. 3-D ICs: A Novel Chip Design for Improving Deep-submicrometer Interconnect Performance and Systems-on-Chip Integration. Proceedings of the IEEE, 89(5):602–633, 2001.
[22] Xiao-Meng Zhang, Li Liang, Lin Liu, and Ming-Jing Tang. Graph neural networks and their current applications in bioinformatics. Frontiers in genetics, 12:690049, 2021.
[23] Anh-Tu Hoang, Barbara Carminati, and Elena Ferrari. Protecting privacy in knowledge graphs with personalized anonymization. IEEE Transactions on Dependable and Secure Computing, 2023.
[24] Carlos Enrique Muniz Cuza, Nguyen Ho, Eleni Tzirita Zacharatou, Torben Bach Pedersen, and Bin Yang. Spatio-temporal graph convolutional network for stochastic traffic speed imputation. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems, pages 1–12, 2022.
[25] Yun Peng, Byron Choi, and Jianliang Xu. Graph learning for combinatorial optimization: a survey of state-of-the-art. Data Science and Engineering, 6(2):119–141, 2021.
[26] Qidong Liu, Cheng Long, Jie Zhang, Mingliang Xu, and Dacheng Tao. Aspect-aware graph attention network for heterogeneous information networks. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[27] Oliver Hope and Eiko Yoneki. Gddr: Gnn-based data-driven routing. In 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), pages 517–527. IEEE, 2021.
[28] Nicholas Choma, Federico Monti, Lisa Gerhardt, Tomasz Palczewski, Zahra Ronaghi, Prabhat Prabhat, Wahid Bhimji, Michael M Bronstein, Spencer R Klein, and Joan Bruna. Graph neural networks for icecube signal classification. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 386–391. IEEE, 2018.
[29] Akash Dutta, Jee Choi, and Ali Jannesari. Power constrained autotuning using graph neural networks. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 535–545. IEEE, 2023.
[30] Maciej Besta, Raphael Grob, Cesare Miglioli, Nicola Bernold, Grzegorz Kwasniewski, Gabriel Gjini, Raghavendra Kanakagiri, Saleh Ashkboos, Lukas Gianinazzi, Nikoli Dryden, et al. Motif prediction with graph neural networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 35–45, 2022.
[31] Guo Zhang, Hao He, and Dina Katabi. Circuit-gnn: Graph neural networks for distributed circuit design. In International conference on machine learning, pages 7364–7373. PMLR, 2019.
[32] Mario Lino, Stathi Fotiadis, Anil A Bharath, and Chris D Cantwell. Multi-scale rotation-equivariant graph neural networks for unsteady eulerian fluid dynamics. Physics of Fluids, 34(8), 2022.
[33] Mingxuan Lu, Zhichao Han, Susie Xi Rao, Zitao Zhang, Yang Zhao, Yinan Shan, Ramesh Raghunathan, Ce Zhang, and Jiawei Jiang. BRIGHT - Graph Neural Networks in Real-Time Fraud Detection. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 3342–3351, 2022.
[34] Francisca Suárez and Aidan Hogan. Templet: A collaborative system for knowledge graph question answering over wikidata. In Companion Proceedings of the ACM Web Conference 2023, pages 152–155, 2023.
[35] Jhomara Luzuriaga, Emir Munoz, Henry Rosales-Mendez, and Aidan Hogan. Merging web tables for relation extraction with knowledge graphs. IEEE Transactions on Knowledge and Data Engineering, 2021.
[36] Andrea Rossi, Donatella Firmani, Paolo Merialdo, and Tommaso Teofili. Explaining link prediction systems based on knowledge graph embeddings. In Proceedings of the 2022 international conference on management of data, pages 2062–2075, 2022.
[37] Andreas S Andreou, Donatella Firmani, Jerin George Mathew, Massimo Mecella, and Michalis Pingos. Using knowledge graphs for record linkage: Challenges and opportunities. In International Conference on Advanced Information Systems Engineering, pages 145–151. Springer, 2023.
[38] Hongwei Jin, Krishnan Raghavan, George Papadimitriou, Cong Wang, Anirban Mandal, Mariam Kiran, Ewa Deelman, and Prasanna Balaprakash. Graph neural networks for detecting anomalies in scientific workflows. The International Journal of High Performance Computing Applications, 37(3-4):394–411, 2023.
[39] Ramakrishnan Kannan, Piyush Sao, Hao Lu, Jakub Kurzak, Gundolf Schenk, Yongmei Shi, Seung-Hwan Lim, Sharat Israni, Vijay Thakkar, Guojing Cong, et al. Exaflops biomedical knowledge graph analytics. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2022.
[40] Da Zheng, Xiang Song, Chao Ma, Zeyuan Tan, Zihao Ye, Jin Dong, Hao Xiong, Zheng Zhang, and George Karypis. Dgl-ke: Training knowledge graph embeddings at scale. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 739–748, 2020.
[41] Swapnil Gandhi and Anand Padmanabha Iyer. P3: Distributed deep graph learning at scale. In 15th USENIX Symposium on Operating Systems Design and Implementation OSDI 21), pages 551–568, 2021.
[42] Shuangyan Yang, Minjia Zhang, Wenqian Dong, and Dong Li. Betty: Enabling large-scale gnn training with batch-level graph partitioning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 103–117, 2023.
[43] Lei Wang, Qiang Yin, Chao Tian, Jianbang Yang, Rong Chen, Wenyuan Yu, Zihang Yao, and Jingren Zhou. Flexgraph: a flexible and efficient distributed framework for gnn training. In Proceedings of the Sixteenth European Conference on Computer Systems, pages 67–82, 2021.
[44] Eric A Moreno, Olmo Cerri, Javier M Duarte, Harvey B Newman, Thong Q Nguyen, Avikar Periwal, Maurizio Pierini, Aidana Serikova, Maria Spiropulu, and Jean-Roch Vlimant. Jedi-net: a jet identification algorithm based on interaction networks. The European Physical Journal C, 80:1–15, 2020.
[45] Peiqi Yin, Xiao Yan, Jinjing Zhou, Qiang Fu, Zhenkun Cai, James Cheng, Bo Tang, and Minjie Wang. Dgi: An easy and efficient framework for gnn model evaluation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5439–5450, 2023.
[46] Zhihao Jia, Sina Lin, Rex Ying, Jiaxuan You, Jure Leskovec, and Alex Aiken. Redundancy-free computation for graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 997–1005, 2020.
[47] Tim Kaler, Alexandros Iliopoulos, Philip Murzynowski, Tao Schardl, Charles E Leiserson, and Jie Chen. Communication-efficient graph neural networks with probabilistic neighborhood expansion analysis and caching. Proceedings of Machine Learning and Systems, 5, 2023.
[48] Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. Pagraph: Scaling gnn training on large graphs via computation-aware caching. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 401–415, 2020.
[49] Hongxia Yang. Aligraph: A comprehensive graph neural network platform. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3165–3166, 2019.
[50] Tianfeng Liu, Yangrui Chen, Dan Li, Chuan Wu, Yibo Zhu, Jun He, Yanghua Peng, Hongzheng Chen, Hongzhi Chen, and Chuanxiong Guo. BGL: GPU-EfficientGNN training by optimizing graph data I/O and preprocessing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 103–118, 2023.
[51] Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj Kalamkar, Nesreen K Ahmed, and Sasikanth Avancha. Distgnn: Scalable distributed training for large-scale graph neural networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
[52] Süreyya Emre Kurt, Jinghua Yan, Aravind Sukumaran-Rajam, Prashant Pandey, and P Sadayappan. Communication optimization for distributed execution of graph neural networks. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 512–523. IEEE, 2023.
[53] Alok Tripathy, Katherine Yelick, and Aydın Buluç. Reducing communication in graph neural network training. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14. IEEE, 2020.
[54] Thomas N Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations, 2016.
[55] George Karypis, Kirk Schloegel, and Vipin Kumar. Parmetis: Parallel graph partitioning and sparse matrix ordering library. 1997.
[56] K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. The Extreme Classification Repository: Multi-label Datasets and code, 2016.
[57] Tim Kaler, Nickolas Stathas, Anne Ouyang, Alexandros-Stavros Iliopoulos, Tao Schardl, Charles E Leiserson, and Jie Chen. Accelerating training and inference of graph neural networks with fast sampling and pipelining. Proceedings of Machine Learning and Systems, 4:172–189, 2022.
[58] Jingshu Peng, Zhao Chen, Yingxia Shao, Yanyan Shen, Lei Chen, and Jiannong Cao. Sancus: sta le n ess-aware c omm u nication-avoiding full-graph decentralized training in large-scale graph neural networks. Proceedings of the VLDB Endowment, 15(9):1937–1950, 2022.
[59] Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, Katherine Yelick, and Aydın Buluç. Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication. In Proceedings of the ACM International Conference on Supercomputing, pages 431–442, 2021.
[60] Amazon EFS performance, 2024.
[61] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
[62] Shobeir Fakhraei, James Foulds, Madhusudana Shashanka, and Lise Getoor. Collective spammer detection in evolving multi-relational social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1769–1778, 2015.
[63] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining, pages 442–446. SIAM, 2004.
[64] Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. DistDGL: distributed graph neural network training for billion-scale graphs. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3), pages 36–44. IEEE, 2020.
[65] Deep Graph Library Tutorials and Documentation, 2024.
[66] Radin Hamidi Rad, Hoang Nguyen, Feras Al-Obeidat, Ebrahim Bagheri, Mehdi Kargar, Divesh Srivastava, Jaroslaw Szlichta, and Fattane Zarrinkalam. Learning heterogeneous subgraph representations for team discovery. Information Retrieval Journal, 26(1):8, 2023.
[67] Zequn Sun, Qingheng Zhang, Wei Hu, Chengming Wang, Muhao Chen, Farahnaz Akrami, and Chengkai Li. A benchmarking study of embedding-based entity alignment for knowledge graphs. Proceedings of the VLDB Endowment, 13(12):2326–2340, 2020.
[68] Yuto Suzuki and Farnoush Banaei-Kashani. Clustered federated learning for heterogeneous feature spaces using siamese graph convolutional neural network distance prediction. In Federated Learning Systems (FLSys) Workshop@ MLSys 2023, 2023.
[69] Tingyang Chen, Dazhuo Qiu, Yinghui Wu, Arijit Khan, Xiangyu Ke, and Yunjun Gao. View-based explanations for graph neural networks. arXiv preprint arXiv:2401.02086, 2024.
[70] Arijit Khan and Ehsan B Mobaraki. Interpretability methods for graph neural networks. In 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–4. IEEE, 2023.
[71] Shunhua Jiang, Yunze Man, Zhao Song, Zheng Yu, and Danyang Zhuo. Fast graph neural tangent kernel via kronecker sketching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7033–7041, 2022.
[72] Yufei Tao, Hao Wu, and Shiyuan Deng. Cross-space active learning on graph convolutional networks. In International Conference on Machine Learning, pages 21133–21145. PMLR, 2022.
[73] Guosheng Feng, Hongzhi Wang, and Chunnan Wang. Search for deep graph neural networks. Information Sciences, 649:119617, 2023.
[74] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
[75] Janet Layne, Justin Carpenter, Edoardo Serra, and Francesco Gullo. Temporal sir-gn: Efficient and effective structural representation learning for temporal graphs. Proceedings of the VLDB Endowment, 16(9):2075–2089, 2023.
[76] Kezhao Huang, Jidong Zhai, Liyan Zheng, Haojie Wang, Yuyang Jin, Qihao Zhang, Runqing Zhang, Zhen Zheng, Youngmin Yi, and Xipeng Shen. Wisegraph: Optimizing gnn with joint workload partition of graph and operations. In Proceedings of the Nineteenth European Conference on Computer Systems, pages 1–17, 2024.
[77] Zheng Chen, Feng Zhang, JiaWei Guan, Jidong Zhai, Xipeng Shen, Huanchen Zhang, Wentong Shu, and Xiaoyong Du. Compressgraph: Efficient parallel graph analytics with rule-based compression. Proceedings of the ACM on Management of Data, 1(1):1–31, 2023.
[78] Qihang Chen, Boyu Tian, and Mingyu Gao. FINGERS: Exploiting Fine-Grained Parallelism in Graph Mining Accelerators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2022, page 43–55, New York, NY, USA, 2022.
[79] Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, et al. Race: An efficient redundancy-aware accelerator for dynamic graph neural network. ACM Transactions on Architecture and Code Optimization, 20(4):1–26, 2023.
[80] Bo Lyu, Maher Hamdi, Yin Yang, Yuting Cao, Zheng Yan, Ke Li, Shiping Wen, and Tingwen Huang. Efficient spectral graph convolutional network deployment on memristive crossbars. IEEE Transactions on Emerging Topics in Computational Intelligence, 7(2):415–425, 2022.
[81] Diego Calvanese, Avigdor Gal, Davide Lanti, Marco Montali, Alessandro Mosca, and Roee Shraga. Conceptually-grounded mapping patterns for virtual knowledge graphs. Data & Knowledge Engineering, 145:102157, 2023.
[82] Amirali Abdolrashidi, Anna Darling Goldie, Azalia Mirhoseini, Daniel Wong, Hanxiao Liu, James Pierce Laudon, Mangpo Phothilimthana, Peter Chao Ma, Qiumin Xu, Shen Wang, Sudip Roy, and Yanqi Zhou, editors. Graph Transformer: A Generalized Method for Computation Graph Optimizations, 2020.
[83] Sergi Abadal, Akshay Jain, Robert Guirado, Jorge López-Alonso, and Eduard Alarcón. Computing graph neural networks: A survey from algorithms to accelerators. ACM Computing Surveys (CSUR), 54(9):1–38, 2021.
[84] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
[85] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv preprint arXiv:1909.01315, 2019.
[86] Dalong Zhang, Xin Huang, Ziqi Liu, Jun Zhou, Zhiyang Hu, Xianzheng Song, Zhibang Ge, Lin Wang, Zhiqiang Zhang, and Yuan Qi. Agl: a scalable system for industrial-purpose graph machine learning. Proceedings of the VLDB Endowment, 13(12):3125–3137, 2020.
[87] Youhui Bai, Cheng Li, Zhiqi Lin, Yufei Wu, Youshan Miao, Yunxin Liu, and Yinlong Xu. Efficient data loader for fast sampling-based gnn training on large graphs. IEEE Transactions on Parallel and Distributed Systems, 32(10):2541–2556, 2021.
[88] Qiange Wang, Yanfeng Zhang, Hao Wang, Chaoyi Chen, Xiaodong Zhang, and Ge Yu. Neutronstar: distributed gnn training with hybrid dependency management. In Proceedings of the 2022 International Conference on Management of Data, pages 1301–1315, 2022.
[89] Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. NeuGraph: Parallel deep neural network computation on large graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 443–458, 2019.
[90] Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan Yu. Dgcl: an efficient communication library for distributed gnn training. In Proceedings of the Sixteenth European Conference on Computer Systems, pages 130–144, 2021.
[91] John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, et al. Dorylus: Affordable, scalable, and accurate GNN training with distributed CPU servers and serverless threads. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 495–514, 2021.
[92] Penporn Koanantakool, Ariful Azad, Aydin Buluç, Dmitriy Morozov, Sang-Yun Oh, Leonid Oliker, and Katherine Yelick. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 842–853. IEEE, 2016.
[93] Hao Zhang, Malith Jayaweera, Bin Ren, Yanzhi Wang, and Sucheta Soundarajan. Unfairness in distributed graph frameworks. In 2023 IEEE International Conference on Data Mining (ICDM), pages 1529–1534. IEEE, 2023.
[94] Yuke Wang, Boyuan Feng, Zheng Wang, Tong Geng, Kevin Barker, Ang Li, and Yufei Ding. MGG: Accelerating graph neural networks with Fine-GrainedIntra-KernelCommunication-Computation pipelining on Multi-GPU platforms. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 779–795, 2023.
[95] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
[96] Bo He, Xiang Song, Vincent Gao, and Christos Faloutsos. Coldguess: A general and effective relational graph convolutional network to tackle cold start cases. arXiv preprint arXiv:2205.12318, 2022.
[97] Ryien Hosseini, Filippo Simini, Venkatram Vishwanath, Ramakrishnan Sivakumar, Sanjif Shanmugavelu, Zhengyu Chen, Lev Zlotnik, Mingran Wang, Philip Colangelo, Andrew Deng, et al. Exploring the use of dataflow architectures for graph neural network workloads. In International Conference on High Performance Computing, pages 648–661. Springer, 2023.
[98] Zeyuan Tan, Xiulong Yuan, Congjie He, Man-Kit Sit, Guo Li, Xiaoze Liu, Baole Ai, Kai Zeng, Peter Pietzuch, and Luo Mai. Quiver: Supporting gpus for low-latency, high-throughput gnn serving with workload awareness. arXiv preprint arXiv:2305.10863, 2023.
[99] Qizheng Yang, Tianyi Yang, Mingcan Xiang, Lijun Zhang, Haoliang Wang, Marco Serafini, and Hui Guan. Gmorph: Accelerating multi-dnn inference via model fusion. In Proceedings of the Nineteenth European Conference on Computer Systems, pages 505–523, 2024.
[100] Hongkuan Zhou, Ajitesh Srivastava, Hanqing Zeng, Rajgopal Kannan, and Viktor Prasanna. Accelerating large scale real-time gnn inference using channel pruning. Proceedings of the VLDB Endowment, 14(9):1597–1605, 2021.
[101] Adam Auten, Matthew Tomei, and Rakesh Kumar. Hardware acceleration of graph neural networks. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2020.
[102] Zhihao Jia, Sina Lin, Rex Ying, Jiaxuan You, Jure Leskovec, and Alex Aiken. Redundancy-free computation for graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 997–1005, New York, NY, USA, 2020. Association for Computing Machinery.