Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library
Yangzihao Wang and Yuechao Pan with Andrew Davidson, Yuduo Wu,
Carl Yang, Leyuan Wang, Andy Riffel and John D. Owens
Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library - - PowerPoint PPT Presentation
Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library Yangzihao Wang and Yuechao Pan with Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Andy Riffel and John D. Owens University of California, Davis {yzhwang,
Carl Yang, Leyuan Wang, Andy Riffel and John D. Owens
○ Road & social networks, web, etc.
○ Memory bandwidth, computing power and GOOD software
○ Billions of edges
○ Limits performance and scalability
○ Data center, desktops, mobiles, etc.
○ High memory bandwidth (288 GBps) and computing power (4.3 Tflops)
○ 12 GB per NVIDIA K40
○ Harder to optimize
Gunrock @ GTC 2016, Apr. 6, 2016 | 2
Gunrock @ GTC 2016, Apr. 6, 2016 | 3
Gunrock @ GTC 2016, Apr. 6, 2016 | 4
Gunrock @ GTC 2016, Apr. 6, 2016 | 5
a group of vertices or edges ≝ a frontier => Operations = manipulations of frontiers
multiple elements can be done in parallel, without order
Loop until convergence A group of V or E Do something Resulting group of V or E Do something Another resulting group of V or E A generic graph algorithm:
Gunrock @ GTC 2016, Apr. 6, 2016 | 6
Gunrock @ GTC 2016, Apr. 6, 2016 | 7
Block 0
t0 t1 tn t0 t1 tn t0 t1 tn t0 t1 tn t0 t1 tn t0 t1 tn t0 t1 tn t0 t1
Block 1 Block 255
t0 t1 tn t0 t1 tn t0 t1 tn t0 t1
Block cooperative Advance of large neighbor lists;
t0
t31
t0 t0
t31 t0 t31
t1 t0
t31
Warp cooperative Advance of medium neighbor lists;
t0 t1 t2 tn
Warp 31 Warp 1 Warp 0
Pre-thread Advance of small neighbor lists.
Load-Balanced Partitioning [3] Per-thread fine-grained, Per-warp and per-CTA coarse-grained [4] Gunrock @ GTC 2016, Apr. 6, 2016 | 8
label = ? Input label = 1 label = 0
10 Advance Idempotence enabled Idempotence disabled
10
10
Gunrock @ GTC 2016, Apr. 6, 2016 | 9
Advance
11 13
Pull-based Push-based 11 11 11 11 12 12 11 12
Input label = 2 label = 1 2
10
Unvisited vertices label = ?
11 12 13 12 12
To: V11 V12 V13 Output frontier Gunrock @ GTC 2016, Apr. 6, 2016 | 10
High-priority pile Temp output queue th = 2.0 Low-priority pile
10
10 Priority Queue
Scan + Compact Next Input
10
1.3 4.5 1.8 9.4 7.2 8.6
Gunrock @ GTC 2016, Apr. 6, 2016 | 11
Gunrock @ GTC 2016, Apr. 6, 2016 | 12
Iterate till convergence Input frontier Output frontier Single GPU Associative data (label, parent, etc.)
Gunrock @ GTC 2016, Apr. 6, 2016 | 13
Iterate till convergence Input frontier Output frontier GPU 0 Associative data (label, parent, etc.) Input frontier Output frontier GPU 1 Associative data (label, parent, etc.)
Dream: just duplicate the single GPU implementation Reality: it won’t work, but good try!
Gunrock @ GTC 2016, Apr. 6, 2016 | 14
Iterate till all GPUs convergence
Local input frontier Local
GPU 0 Associative data (label, parent, etc.) GPU 1 Associative data (label, parent, etc.) Remote
Remote input frontier Remote input frontier Local input frontier Remote
Local
Partition Gunrock @ GTC 2016, Apr. 6, 2016 | 15
gunrock_executable input_graph --device=0,1,2,3 other_parameters
Gunrock @ GTC 2016, Apr. 6, 2016 | 16
Gunrock @ GTC 2016, Apr. 6, 2016 | 17
≥1 CPU threads with multiple GPU streams to control each individual GPUs
check space requirement before every possible overflow
Gunrock @ GTC 2016, Apr. 6, 2016 | 18
6x-337x speedup on avg over all primitives compared to BGL and PowerGraph. 5x slower on CC compared to hardwired implementation. Outperforms both CuSha and MapGraph.
Gunrock @ GTC 2016, Apr. 6, 2016 | 19
* Primitives (except DOBFS) get good speedups (averaged over 16 datasets of various types) BFS: 2.74x, SSSP: 2.92x, CC: 2.39x, BC: 2.22x, PR: 4.03x using 6 GPUs * Peak DOBFS performance: 514 GTEPS with rmat_n20_512 * Gunrock is able to process graph with 3.6B edges (full-friendster graph, undirected, DOBFS in 339ms, 10.7 GTEPS using 4 K40s), 50 PR iterations on the directed version (2.6B edges) took ~51 seconds
Gunrock @ GTC 2016, Apr. 6, 2016 | 20
*Strong: Rmat_n24_32 *Weak edge: Rmat_n19_256 * #GPUs *Weak vertex: Rmat_219 * #GPUs_256
Mostly linear, except for DOBFS strong scaling
Gunrock @ GTC 2016, Apr. 6, 2016 | 21
* graph format: name (|V|, |E|, directed (D) or undirected (UD)) * ref. hw. format: #GPU per node x GPU model x #nodes
* Gunrock out-performs or close to small GPU clusters using 4 ~ 64 GPUs, on both real and generated graphs * a few times faster than Enterprise (Liu et al., SC15), a dedicated multi-GPU DOBFS implementation
Gunrock @ GTC 2016, Apr. 6, 2016 | 22
Gunrock @ GTC 2016, Apr. 6, 2016 | 23
Gunrock @ GTC 2016, Apr. 6, 2016 | 24
The Gunrock team Onu Technology and Royal Caliber team Erich Elsen, Vishal Vaidyananthan, Oded Green and others For their discussion on library development and dataset generating code All code contributors to the Gunrock library NVIDIA For hardware support, GPU cluster access, and all other supports and discussions The Gunrock project is funded by * DARPA XDATA program under AFRL Contract FA8750-13-C-0002 * NSF awards CCF-1017399 and OCI-1032859 * DARPA STTR award D14PC00023
Gunrock @ GTC 2016, Apr. 6, 2016 | 25
[1] Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. “Gunrock: A high-performance graph processing library on the GPU”. CoRR, abs/1501. 05387(1501.05387v4) (Oct. 2015, http://arxiv.org/abs/1501.05387 ), to appear at PPoPP 2016; [2] Y. Pan, Y. Wang, Y. Wu, C. Yang, and J. D. Owens. “Multi-GPU Graph Analytics”. CoRR, abs/1504.04804(1504.04804v1) (Apr. 2015, http://arxiv.
[3] A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, pages 349–359, May 2014; [4] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pages 117–128, Feb. 2012; [5] S. Beamer, K. Asanovic, and D. Patterson. Direction-optimizing ´ breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 12:1–12:10, Nov. 2012; [6] J. G. Siek, L.-Q. Lee, and A. Lumsdaine. The Boost Graph Library: User Guide and Reference Manual. Addison-Wesley, Dec. 2001; [7] F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan. CuSha: Vertexcentric graph processing on GPUs. In Proceedings of the 23rd International Symposium
[8] J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium
[9] Z. Fu, M. Personick, and B. Thompson. MapGraph: A high level API for fast development of high performance graph analytics on GPUs. In Proceedings
[10] J. Zhong and B. He. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems, 25(6):1543‐1552, June 2014; [11] Z. Fu, H. K. Dasari, B. Bebee, M. Berzins, and B. Thompson. Parallel breadth first search on GPU clusters. In IEEE International Conference on Big Data, pages 110‐118, Oct. 2014.
Gunrock @ GTC 2016, Apr. 6, 2016 | 27
from ctypes import * ### load gunrock shared library - libgunrock gunrock = cdll.LoadLibrary('../../build/lib/libgunrock.so') ### read in input CSR arrays from files row_list = [int(x.strip()) for x in open('toy_graph/row.txt')] col_list = [int(x.strip()) for x in open('toy_graph/col.txt')] ### convert CSR graph inputs for gunrock input row = pointer((c_int * len(row_list))(*row_list)) col = pointer((c_int * len(col_list))(*col_list)) nodes = len(row_list) - 1 edges = len(col_list) ### output array labels = pointer((c_int * nodes)()) ### call gunrock function on device gunrock.bfs(labels, nodes, edges, row, col, 0) ### sample results print ' bfs labels (depth):', for idx in range(nodes): print labels[0][idx],
Gunrock @ GTC 2016, Apr. 6, 2016 | 28
1 1 1
10 11 12 +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞
13 +∞
Gunrock @ GTC 2016, Apr. 6, 2016 | 29
1 1 1
10 11 12 +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞
13 +∞
Gunrock @ GTC 2016, Apr. 6, 2016 | 30
1 1 1
10 11 12 2 2 2 2 2 2 +∞ +∞
10
13 +∞
Gunrock @ GTC 2016, Apr. 6, 2016 | 31
1 1 1
10 11 12 2 2 2 2 2 2 +∞ +∞
10
13 +∞
Gunrock @ GTC 2016, Apr. 6, 2016 | 32
1 1 1
10 11 12 2 2 2 2 2 2 +∞ +∞
10
10
13 +∞
Gunrock @ GTC 2016, Apr. 6, 2016 | 33
1 1 1
10 11 12 2 2 2 2 2 2 3 3
10
10
11 12
13 +∞
Gunrock @ GTC 2016, Apr. 6, 2016 | 34
Local input frontier Partitioner Input graph Partition table Sub-graph builder Sub-graphs Local input frontier Output sub-frontier Merge Received data package Remote input frontier Output sub-frontier Full-queue kernels Merged frontier Output frontier Separate Local
frontier Remote
frontier Data package Output sub-frontier Merge Remote input frontier Output sub-frontier Merged frontier Output frontier Separate Local
frontier Remote
frontier Data package Received data package Finish Converged? Converged? GPU0 GPU1 Package data Push to peer Unpackage Sub-queue kernels Sub-queue kernels Unpackage Legend: Package data Push to peer
Parameters required from user
User provided
Sub-queue kernels Full-queue kernels
Single GPU data flow Multi GPU data flow
Sub-queue kernels Gunrock @ GTC 2016, Apr. 6, 2016 | 35
1 2 8
10
11
12 4 9 5 10 6 11 7
7 8 1
10
11
9 3 10 4 11 5 12
10 11 12 13
Original vertices
y y
Local vertices
y y
Remote vertices (with local replicas)
Local V-id Remote V-id
GPU 0 |V| = 11 |E| = 23 GPU 1 |V| = 12 |E| = 21 |V| = 13 |E| = 44 Gunrock @ GTC 2016, Apr. 6, 2016 | 36