Making Pull-Based Graph Processing Performant
Samuel Grossman1, Heiner Litz2, and Christos Kozyrakis1
1Stanford University 2University of California, Santa Cruz
PPoPP 2018 · February 27, 2018
Making Pull-Based Graph Processing Performant Samuel Grossman 1 , - - PowerPoint PPT Presentation
Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 February 27, 2018 Graph Processing Problem modelled as
Samuel Grossman1, Heiner Litz2, and Christos Kozyrakis1
1Stanford University 2University of California, Santa Cruz
PPoPP 2018 · February 27, 2018
connections between them (edges)
2
G G E E F D G I L H K J B A C A B C D E I I H H H K
3
G’ E’ A’ B’ C’ F’ D’ I’ L’ H’ K’ J’
Repeat until convergence
4
Push Pull
5
Group by source vertex Group by destination vertex
Hybrid: dynamically select push or pull for each iteration
foreach vertex v in graph.vertices foreach edge e in v.(in|out)edges // process the edge ...
6
7
8
Running Ligra on twitter-2010 graph 2 4 6 8 10 PageRank Br eadth-Fir st Sear ch Speedup Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h)
9
2 4 6 8 10 PageRank Br eadth-Fir st Sear ch Speedup Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) Running Ligra on twitter-2010 graph
Running Ligra on twitter-2010 graph
10
2 4 6 8 10 PageRank Br eadth-Fir st Sear ch Speedup Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h)
Serial Inner Loop
Parallel Inner Loop
11
Contribution #1: “Scheduler Awareness” A technique that can be applied to the inner loop of a pull engine to parallelize it without introducing conflicts.
vectorization in existing work
12
Contribution #2: “Vector-Sparse” A low-level modification to a data structure commonly used to represent graphs, intended to enhance vectorization.
embodies both of our contributions
some cases
https://github.com/stanford-mast/Grazelle-PPoPP18
13
Contribution #1
14
15
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data
16
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data
17
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data
18
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data
19
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data
20
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data
per-chunk merge buffer.
21
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data Merge Buffers
per-chunk merge buffer.
22
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data Merge Buffers
per-chunk merge buffer.
shared state without synchronization.
23
per-chunk merge buffer.
shared state without synchronization.
2 1
1 2 3 4 1 2 3 4 5 6 7 1 2
Chunk A Chunk B Chunk C
Vertex (Outer Loop) Edge (Inner Loop) Vertex Data Merge Buffers
scheduling granularity
balance and probability of write conflicts
increased merge operation overhead
24
dimacs-usa (low, even degree distribution) uk-2007 (extremely skewed)
25
0.0 0.2 0.4 0.6 0.8 1.0 1.2 100 1,000 10,000
Chunk Size Scheduler Un-Aware 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1,000 10,000 100,000
Chunk Size Scheduler-Aware 10× Different 50× 3.3× 1.2×
dimacs-usa (low, even degree distribution) uk-2007 (extremely skewed)
26
10 20 30 40 50 14 28 42 56
# Physical Cores Scheduler Un-Aware 10 20 30 40 50 60 70 14 28 42 56
# Physical Cores Scheduler Awar e Scaling enabled by Scheduler Awareness Scaling improved by Scheduler Awareness Key Insights
Contribution #2
27
23 10 50 4 0 53 62 1 78 50 23 4
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [0] [1] [2] [3]
3 7 10 Vertex Index Edges
28
29
23 10 50 4 0 53 62 1 78 50 23 4
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
Vertex 0 Vertex 1
30
Vertex 1 23 10 50 4 0 53 62 1 78 50 23 4
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
Vertex 0×
31
23 10 50 4 0 53 62 1 78 50 23 4
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
Vertex 0 Vertex 1
32
23 10 50 4 0 53 62 1 78 50 23 4
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
Vertex 0 Vertex 1
23 10 50 4 0 53 62 1 78 50 23
[0] [1] [2] [4] [5] [6] [7] [8] [9] [11] [12]
Vertex 0 Vertex 1
[3]
4
[13]
Vertex 2
[10]
33
Padding
23 10 50 4 0 53 62 1 78 50 23 Vertex 0 Vertex 1 4 Vertex 2
1 1 1 1 1 1 1 1 1 1 1 1
[0] [1] [2] [4] [5] [6] [7] [8] [9] [11] [12] [3] [13] [10]
34
Padding + “valid” bits
23 10 50 4 0 53 62 1 78 50 23 Vertex 0 Vertex 1 4 Vertex 2
1 1 1 1 1 1 1 1 1 1 1 1 1 2
[0] [1] [2] [4] [5] [6] [7] [8] [9] [11] [12] [3] [13] [10]
35
Padding + “valid” bits + top-level vertex spread-encoding
0.0 0.5 1.0 1.5 2.0 2.5 3.0
di m acs-usa livejournal twitter-2010 friendster uk-2007
Speedup PageRank CC BFS 0% 25% 50% 75% 100%
di m acs-usa livejournal twitter-2010 friendster uk-2007
Average Efficiency
Packing Efficiency Performance Impact
36
Generally ≥ 75% 1.5× to 2.5×
Putting it all together
37
GraphMat, and X-Stream
Components, Breadth-First Search
Intel Xeon E7-4850 v3 processors
38
1E+0 1E+1 1E+2 1E+3 1E+4 1E+5
di m acs-usa livejournal twitter-2010 friendster uk-2007
Execution Time (ms)
Grazelle-Pull Grazelle-Pus h Ligra-Pull Ligra-Push Polymer GraphMat X-Stream
39
Logarithmic
3.6× 2.3× 2.3× 1.4× 15.2×
1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6
di m acs-usa livejournal twitter-2010 friendster uk-2007
Execution Time (ms)
Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream
40
Logarithmic
4.9× 1.5× 1.6× 21.1×
1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6
di m acs-usa livejournal twitter-2010 friendster uk-2007
Execution Time (ms)
Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream
41
Logarithmic
parallelization for pull-based graph processing
in some cases by over 10×
https://github.com/stanford-mast/Grazelle-PPoPP18
42
Questions?
43