Parallel Triangle Counting and K-Truss Identification Using Graph-Centric Methods
Chad Voegele, Yi-Shan Lu, Sreepathi Pai, Keshav Pingali The University of Texas at Austin 09/13/2017
1
Parallel Triangle Counting and K-Truss Identification Using - - PowerPoint PPT Presentation
Parallel Triangle Counting and K-Truss Identification Using Graph-Centric Methods Chad Voegele, Yi-Shan Lu, Sreepathi Pai, Keshav Pingali The University of Texas at Austin 09/13/2017 1 Graph-Centric vs. Matrix-Centric Abstractions 1 1 1 1
Chad Voegele, Yi-Shan Lu, Sreepathi Pai, Keshav Pingali The University of Texas at Austin 09/13/2017
1
read/written by the update
2 : active node : neighborhood
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
* =
Shared-Memory Galois [1] (C++ Library)
IrGL [2] (Compiler)
3
[1] D. Nguyen, A. Lenharthand K. Pingali. “A lightweight infrastructure for graph analytics,” in SOSP 2013. [2] S. Pai and K. Pingali. “A compiler for throughput optimization of graph algorithms on GPUs,” in OOPSLA 2016.
4
5
Graph-centric methods: Operator for edges Operator for e2 Operator for e3 Operator for en
Matrix-centric methods: Matrix operation for each step Barrier between rounds Matrix operation for triangle enumeration Matrix operation for counting # triangles for edges Matrix operation for removing selected edges Reduction to check for edges w/ insufficient support Barrier in a round Barrier in a round Barrier in a round
Enumerate triangles Count number of triangles for edges Remove edges w/ insufficient support
Do all edges have sufficient support?
K-Truss done No Yes K-Truss begins
Barrier between rounds Operator for e1
6
EdgeDst EdgeRange
Graph as Compressed Sparse Row (CSR) 1 2 3 5 4 Sorted edge lists to speed up edge list intersection from O(deg(u)*deg(v)) to O(deg(u)+deg(v)) Sorted edge lists to locate edges using binary search when removing edges
Enumerate triangles Count number of triangles for edges Remove edges w/ insufficient support
Do all edges have sufficient support?K-Truss done No Yes K-Truss begins
Early termination when edge support reaches k – 2. Edge removals may be visible in current round, reducing the number of rounds.
1 2 3 2 3 1 3 4 1 2 4 5 2 3 5 3 4
6
10 15 18 20
7
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
* =
3 6
10 15 18 20
1 2 3 2 3 1 3 4 1 2 4 5 2 3 5 3 4
e e e e e e e e e e e e e e e e e e e
EdgeData EdgeRange
n n n n n n
Graph-centric methods: Load graphs and update node/edge data in the graphs Fixed after graphs are loaded.
Adjacency matrix Incidence matrix Product matrix
Matrix-centric methods: Construct matrices at runtime Needs runtime memory management. 1 2 3 5 4
8
9
Platform
Baseline from IEEE HPEC static graph challenge [3]
Parameter
[3] S. Samsiet al. “Static graph challenge: subgraph isomorphism,” in IEEE HPEC, 2017. [4] J. Leskovec and A. Krevl. SNAP datasets: Stanford large network dataset collection. Retrieved from http://snap.Stanford.edu/data, June 2014.
[4]
Largest Smallest
10
11
4800
Lower is better
timeout
Speedup over Julia
Variant Geo Mean Julia 1.00 End-to-end runtime after the graph is loaded and before the results are printed.
12
4800
Lower is better
timeout
Speedup over Julia
Variant Geo Mean Julia 1.00 Cpu-01 428.87 End-to-end runtime after the graph is loaded and before the results are printed.
13
4800
Lower is better
timeout
Speedup over Julia
Variant Geo Mean Julia 1.00 Cpu-01 428.87 Cpu-24 623.62 End-to-end runtime after the graph is loaded and before the results are printed. Maximum speedup of cpu-24 over cpu-01: 14.30X (~117M edges)
14
4800
Lower is better
timeout
Speedup over Julia
Variant Geo Mean Julia 1.00 Cpu-01 428.87 Cpu-24 623.62 Gpu 2,213.14 End-to-end runtime after the graph is loaded and before the results are printed. Maximum speedup of cpu-24 over cpu-01: 14.30X (~117M edges)
15
4800
Lower is better
timeout
Speedup over MiniTri
Variant Geo Mean MiniTri 1.00 Cpu-01 163.23 Cpu-24 380.57 Gpu 1,760.47 End-to-end runtime after the graph is loaded and before the results are printed. Maximum speedup of cpu-24 over cpu-01: 17.22X (~15.7M edges)
16
17
192GB
Lower is better
Total CPU memory
% over Julia
Variant Geo Mean Julia 100.00 Measurement Julia: @time
18
Lower is better
% over Julia
Variant Geo Mean Julia 100.00 Cpu-01 0.54 Measurement Julia: @time CPU: Galois’ internal allocator 192GB Total CPU memory
19
Lower is better
% over Julia
Variant Geo Mean Julia 100.00 Cpu-01 0.54 Cpu-24 11.05 Measurement Julia: @time CPU: Galois’ internal allocator 192GB Total CPU memory
20
Lower is better
% over Julia
Variant Geo Mean Julia 100.00 Cpu-01 0.54 Cpu-24 11.05 Gpu 1.09 Measurement Julia: @time CPU: Galois’ internal allocator GPU: cudaMemGetInfo 192GB Total CPU memory 8GB Total GPU memory
21
Lower is better
% over MiniTri
Variant Geo Mean MiniTri 100.00 Cpu-01 94.31 Cpu-24 791.64 Gpu 50.14 Measurement CPU: Galois’ internal allocator GPU: cudaMemGetInfo 192GB Total CPU memory 8GB Total GPU memory
22
23
Lower is better
% over Julia
Variant Geo Mean Julia 100.00 Cpu-01 2.27 Cpu-24 2.03 Gpu 0.48 Measurement Julia: Intel RAPL counters CPU: Intel RAPL counters GPU: nvprof
24
Lower is better
% over MiniTri
Variant Geo Mean MiniTri 100.00 Cpu-01 12.95 Cpu-24 12.07 Gpu 2.55 Measurement CPU: Intel RAPL counters GPU: nvprof
improvements over matrix-centric IEEE HPEC static graph challenge reference implementations.
25
Questions? Comments?
26