Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm
Angen Zheng, Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi University of Pittsburgh
1
Graph Partition Refinement Algorithm Angen Zheng , Alexandros - - PowerPoint PPT Presentation
Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm Angen Zheng , Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi University of Pittsburgh 1 Importance of Graph Partitioning
1
2
3
★ Vertex
○ a unique identifier ○ a modifiable, user-defined value
★ Edge
○ a modifiable, user-defined value ○ a target vertex identifier
UDF UDF UDF UDF
Balanced load distribution!!! Minimizing the comm cost!!! ★ Vertex-Centric UDF
○ Change vertex/edge state ○ Send msg to neighbours ○ Receive msg from neighbors ○ Mutate the graph topology ○ Deactivate at end of the superstep ○ Reactivate by external msgs
4
5
6
Introduction ✔ Heterogeneity State of the Art PARAGON Contention Experiments
% of audience asleep
7
8
Network
(Ethernet, IPoIB)
Node#1 Node#2
9
N1 N2 N3 N1
1 6
N2
1 1
N3
6 1 N3 N1 N2
10
N1 N2 N3 N1
1 6
N2
1 1
N3
6 1
11
N3 N1 N2
12
Introduction ✔ Heterogeneity ✔ State of the Art PARAGON Contention Experiments
% of audience asleep
13
Balanced Graph (Re)Partitioning Partitioners (static graphs)
Repartitioners (dynamic graphs)
Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG Fennel
Offline Methods (High Quality)
(Poor Scalability)
Online Methods
(Moderate Quality) (High Scalability)
Parmetis Aragon
Offline Methods
(High Quality) (Poor Scalability)
Online Methods
(Moderate~High Quality) (High Scalability)
Heterogeneity-Aware
CatchW Paragon xdgp Mizan
Heterogeneity-Aware
LogGP Hermes
14
[1]. Angen Zheng, Alexandors Labrinidis, and Panos K. Chrysanthis. Archiitecture-Aware Graph Repartitioning for Data-Intensive Scientific
N1 N2 N3 N4 N5 N6 N7 N8 N9
G
P1 P2 P3 P4 P5 P6 P7 P8 P9
P1 P2 P3 P4 P9 P8 P7 P6
Heterogeneity-Aware Refinement
(More details in the paper)
Aragon
N5 can hold entire graph in memory Prefer to work in offline mode
15
16
Introduction ✔ Heterogeneity ✔ State of the Art ✔ PARAGON Contention Experiments
% of audience asleep
17
P1 P2 P3 P4 P6 P9 P5 P7 P8
18
P1 P2 P3 P4 P5 P6 P7 P8 P9
N1 N2 N3 N4 N5 N6 N7 N8 N9
19
P1 P2 P3 P4 P5 P6 P7 P8 P9
N1 N2 N3 N4 N5 N6 N7 N8 N9
P1 P2 P3 P4 P6 P9 P5 P7 P8
N1 N2 N3 N4 N5 N6 N7 N8 N9
P1 P2 P3 P4 P5 P6 P7 P8 P9
P1 P3 P4 P6 P5 P7
20
Only send boundary vertices
P1 P2 P3 P4 P6 P9 P5 P7 P8
Aragon
P1 P2 P3 P4 P5 P6 P7 P8 P9
P1 P3 P4 P6 P5 P7
Aragon Aragon
N2 N3 N4 N5 N6 N7 N8 N9 N1
21
P1 P2 P3 P4 P6 P9 P5 P7 P8
# of Groups
○ Degree of Parallelism
Aragon
P1 P2 P3 P4 P5 P6 P7 P8 P9
P1 P3 P4 P6 P5 P7
Aragon Aragon
N2 N3 N4 N5 N6 N7 N8 N9 N1
22
P1 P2 P3 P4 P6 P9 P5 P7 P8
# of Groups
○ Degree of Parallelism ○ Parallelism vs Quality
36 16 9 6
N2: N9: N8:
P1 P2 P4 P5 P7 P9 P3 P6 P8
Swap Aragon Aragon Aragon Parallel
P1 P2 P3 P4 P6 P9 P5 P7 P8
23
Repeat k times
To increase the # of partition pairs being refined!
24
Introduction ✔ Heterogeneity ✔ State of the Art ✔ PARAGON ✔ Contention Experiments
% of audience asleep
Network
Node#1 Node#2
RDMA-enabled
25
[2]. C. Binnig, U. Çetintemel, A. Crotty, A. Galakatos, T. Kraska, E. Zamanian, and S. B. Zdonik. The End of Slow Networks: It’sTime for a
★ Dual-socket Xeon E5v2 server with
○ DDR3-1600 ○ 2 FDR 4x NICs per socket
Revisit the Impact of Memory Subsystem Carefully!
★ Infiniband: 1.7GB/s~37.5GB/s ★ DDR3: 6.25GB/s~16.6GB/s
26
Send Buffer Sending Core Receiving Core Receive Buffer Shared Buffer
27
Cached Send/Shared/Receive Buffer
Multiple copies of the same data in LLC, contending for LLC and MC
28
Cached Send/Shared Buffer Cached Receive/Shared Buffer
29
Multiple copies of the same data in LLC, contending for LLC, MC, and QPI.
30
(Small HPC Clusters)
(Cloud/Large Clusters)
Send Buffer Sending Core
Node#1
IB HCA Receive Buffer Receiving Core
Node#2
IB HCA
31
32
Introduction ✔ Heterogeneity ✔ State of the Art ✔ PARAGON ✔ Contention ✔ Experiments
% of audience asleep
33
34
35
Aragon
★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ # of Shuffle Times: 0
36
★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ # of Shuffle Times: 0
37
38
★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ Deg. of Parallelism: 8
# of shuffle refinement times > 10
○ Paragon had lower refinement overhead ■ 8~10s vs 33s (Paragon vs Aragon) ○ Paragon produce better decompositions ■ 0~2.6% (Paragon vs Aragon)
39
40
Dataset 12 datasets from various areas # of Parts 40 (two 20-core machines) Initial Partitioner HP/DG/LDG
8 # of Refinement Times 8
HP: Hashing Partitioning DG: Deterministic Greedy Partitioning LDG: Linear Deterministic Greedy Partitioning
Improv. Max Avg. HP
58% 43%
DG
29% 17%
LDG
53% 36%
41
42
43
Bottleneck Memory (𝛍=1) Network (𝛍=0)
Paragon xdgp Mizan
44
uniParagon
Initial Partitioner: DG
Balanced Graph (Re)Partitioning Partitioners (static graphs)
Repartitioners (dynamic graphs)
Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG Fennel
Offline Methods (High Quality)
(Poor Scalability)
Online Methods
(Moderate Quality) (High Scalability)
Parmetis Aragon
Offline Methods
(High Quality) (Poor Scalability)
Online Methods
(Moderate~High Quality) (High Scalability)
CatchW LogGP Hermes
★ as-skitter: |V|=1.6M, |E| = 22M ★ 60 partitions: three 20-core machines ★ deg. of parallelism: 8 ★ # of shuffle refinement times: 8
5.9x 6.7x 5.9x 2.7x
45
46
★ as-skitter: |V|=1.6M, |E| = 22M ★ 60 partitions: three 20-core machines ★
★ # of shuffle refinement times: 8
Reduction Intra- Socket Inter- Socket DG 62% 55% METIS 53% 55% PARMETIS 15% 17% uniPARAGON 62% 39%
2.5x 1.5x 50% 38%
47
★ as-skitter: |V|=1.6M, |E| = 22M ★ 48 partitions: three 16-core machines ★
★ # of shuffle refinement times: 8
48
49
★ friendster: |V|=124M, |E| = 3.6B ★ 60 partitions: three 20-core machines ★ deg. of parallelism: 10 ★ # of shuffle refinement times: 10
★ 1.65x ★ 60 cores
★ friendster: |V|=124M, |E| = 3.6B ★ 60 partitions: three 20-core machines ★ deg. of parallelism: 10 ★ # of shuffle refinement times: 10
50
1.36x
Graph Partition Refinement Algorithm
51
Acknowledgments:
Funding:
52