Graph Partitioning for Scalable Distributed Graph Computations
Aydın Buluç Kamesh Madduri
ABuluc@lbl.gov madduri@cse.psu.edu
10th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering February 13-14, 2012 Atlanta, GA
Graph Partitioning for Scalable Distributed Graph Computations Aydn - - PowerPoint PPT Presentation
Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh Madduri ABuluc@lbl.gov madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
10th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering February 13-14, 2012 Atlanta, GA
2
3
4
5
7 5 3 8 2 4 6 1 9
source vertex
in current frontier are visited in parallel
7 5 3 8 2 4 6 1 9
source vertex
“super vertices”
vertices”
7 5 3 8 2 4 6 1 x x x x x x x x x x x x x x x x x x x x x x x x 9 vertices, 9 processors, 3x3 processor grid Flatten Sparse matrices Per-processor local graph representation
1 2 3 6 5 4
[0,1] [0,3] [0,3] [1,0] [1,4] [1,6] [2,3] [2,5] [2,5] [2,6] [3,0] [3,0] [3,2] [3,6] [4,1] [4,5] [4,6] [5,2] [5,2] [5,4] [6,1] [6,2] [6,3] [6,4] Consider an undirected graph with n vertices and m edges Each processor ‘owns’ n/p vertices and stores their adjacencies (~ 2m/p per processor, assuming balanced partitions).
1 2 3 6 5 4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2
No work No work
1 2 3 6 5 4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2
No work No work
1 2 3 6 5 4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2
1 2 3 6 5 4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2 2, 3 4 Frontier for next iteration
12
p n L L
/ ,
Local latency on working set |n/p| Inverse local RAM bandwidth
N a a N
2 ,
All-to-all remote bandwidth with p participating processors
13
r c
p n L p n L L , ,
c N c r c gather N r N r a a N
p p n p p p p edgecut p 1 1 ) ( ) (
, 2 ,
14
1 2 3 6 5 4
[0,3] [0,3] [1,3] [0,4] [1,4]
P0
Local pruning prior to All-to-all step [0,6] [1,6] [1,6] [0,3] [0,4] [1,6]
15
Buluc & Madduri, Parallel BFS on distributed memory systems, Proc. SC’11, 2011. Execution time is dominated by work performed in a few parallel phases
16
– Total aggregate communication volume – Sum of max. communication volume during each BFS iteration – Intra-node computational work balance – Communication volume reduction with 2D partitioning
17
18
Natural Random PaToH checkerboard PaToH Metis
% compared to m
Natural Random PaToH
19
Ratio over total volume
Natural Random PaToH
20
21
Ratio compared to 1D
Natural Random PaToH
Max/Avg. ratio
Natural Random PaToH
23
24 50 100 150 200 Random-1D Random-2D Metis-1D PaToH-1D
BFS time (ms) Partitioning Strategy
Computation Fold Expand 2 4 6 8 10 Random-1D Random-2D Metis-1D PaToH-1D
Partitioning Strategy
Fold Expand 50 100 150 200 250 300 Random-1D Random-2D Metis-1D PaToH-1D
BFS time (ms) Partitioning Strategy
Computation Fold Expand 0.5 1 1.5 2 2.5 3 Random-1D Random-2D Metis-1D PaToH-1D
Partitioning Strategy
Fold Expand
25
* Timeline of 4 processes shown in figures. PaToH-partitioned graph suffers from severe load imbalance in computational phases.
– Points to the need for dynamic load balancing
26
27