Argo: Architecture-Aware Graph Partitioning
Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange
Department of Computer Science, University of Pittsburgh
http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/
1
Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros - - PowerPoint PPT Presentation
Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1 Big Graphs
Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange
Department of Computer Science, University of Pittsburgh
http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/
1
2
3
✓ Dual-socket Xeon E5v2 server with
○ DDR3-1600 ○ 2 FDR 4x NICs per socket
✓ Infiniband:
1.7GB/s~37.5GB/s
✓ DDR3:
6.25GB/s~16.6GB/s
4
5
6
Graph Partitioners METIS and LDG Graph Workloads BFS, SSSP, and PageRank Graph Dataset Orkut (|V|=3M, |E|=234M) Number of Partitions 16 (one partition per core)
7
m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280
9x m: # of machines used s: # of sockets used per machine c: # of cores used per socket
✓ Denser configurations had longer execution time.
○
Contention on the memory subsystems impacted performance.
○
Network may not always be the bottleneck.
8
m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280 m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187
9x 235x
✓ Denser configurations had longer execution time.
○
Contention on the memory subsystems impacted performance.
○
Network may not always be the bottleneck.
9
m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187
m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280
9x 235x
✓ Denser configurations had longer execution time.
○
Contention on the memory subsystems impacted performance.
○
Network may not always be the bottleneck.
10
m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187
m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280
9x 235x
✓ Denser configurations had longer execution time.
○
Contention on the memory subsystems impacted performance.
○
Network may not always be the bottleneck.
11
The distribution of edge-cut matters.
m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280 m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187
✓ METIS had lower execution time and LLC misses than LDG.
○
Edge-cut matters.
○
Higher edge-cut-->higher comm-->higher contention
9x 235x
12
✓ Intra-Node and Inter-Node Data Communication
○ Have different performance impact on the memory
13
14
Send Buffer Sending Core Receiving Core Receive Buffer Shared Buffer
15
Cached Send/Shared/Receive Buffer
16
Cached Send/Shared Buffer Cached Receive/Shared Buffer
17
18
Send Buffer Sending Core
Node#1
IB HCA Receive Buffer Sending Core
Node#2
IB HCA
19
20
21
Partitioner ... ... Vertex Stream
Streaming Graph Partitioning Model [I. Stanton, KDD’12]
22
Place vertex, v, to a partition, Pi, that maximize:
✓ Weighted by the relative network comm cost, Argo will
○ avoid edge-cut across nodes (inter-node data comm).
Penalize the placement based on the load of Pi
Weighted Edge-cut
Great for cases where the network is the bottleneck.
23
Refined Intra-Node Network Comm Cost Maximal Inter-Node Network Comm Cost Degree of Contention (𝞵 ∈ [0, 1]) Original Intra-Node Network Comm Cost ✓ Weighted by the refined relative network comm cost, Argo will
○ avoid edge-cut across cores of the same node (intra-node
data comm).
Bottleneck Network Memory
𝞵=0 𝞵=1
24
25
Three Classic Graph Workloads
Three Real-World Large Graphs
Dataset |V| |E| Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B
26
Cluster Configuration # of Nodes 32 Network Topology FDR Infiniband (Single Switch) Network Bandwidth 56Gbps Compute Node Configuration # of Sockets 2 Intel Haswell (10 cores / socket) L3 Cache 25MB
27
METIS: the most well-known multi-level partitioner. LDG: the most well-known streaming partitioner. ARGO-H: network is the bottleneck.
ARGO: memory is the bottleneck.
28
Message Grouping Size
(Group multiple msgs by a single SSSP process to the same destination into one msg)
★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines
2x 2x 3x 1x 4x 2x 5x 1x 2x 1.4x 3x 1x
✓ ARGO had the lowest SSSP execution time.
29
Message Grouping Size
★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines
50x 38x 9x 4x 3x 6x 1x 1x 1x 9x 1.2x 12x
✓ ARGO had the lowest LLC Misses.
30
★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines
64 Intra-Socket METIS 69% LDG 49% ARGO-H 70%
✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters.
31
★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512
✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased.
32
★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512
✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H.
33
* 160 = 13h * 180 = 6h
★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512
✓ Hours CPU Time Saving.
34
# of Partitions
Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution)
★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines
# of Partitions
✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time.
35
due to excess intra-node data comm.
Acknowledgments:
Peyman Givi Patrick Pisciuneri
Funding:
NSF CBET-1609120 NSF CBET-1250171 BigData’16 Student
Travel Award
36