Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros - - PowerPoint PPT Presentation

argo architecture aware graph partitioning
SMART_READER_LITE
LIVE PREVIEW

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros - - PowerPoint PPT Presentation

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1 Big Graphs


slide-1
SLIDE 1

Argo: Architecture-Aware Graph Partitioning

Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange

Department of Computer Science, University of Pittsburgh

http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/

1

slide-2
SLIDE 2

Big Graphs Are Everywhere [SIGMOD’16 Tutorial]

2

slide-3
SLIDE 3

A Balanced Partitioning = Even Load Distribution Minimal Edge-Cut = Minimal Data Comm

N3 N1 N2

Assumption: Network is the bottleneck.

3

slide-4
SLIDE 4

The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15]

✓ Dual-socket Xeon E5v2 server with

○ DDR3-1600 ○ 2 FDR 4x NICs per socket

✓ Infiniband:

1.7GB/s~37.5GB/s

✓ DDR3:

6.25GB/s~16.6GB/s

4

slide-5
SLIDE 5

The End of Slow Networks: Does edge-cut still matter?

5

slide-6
SLIDE 6

Roadmap

 Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions

6

slide-7
SLIDE 7

The End of Slow Networks: Does edge-cut still matter?

Graph Partitioners METIS and LDG Graph Workloads BFS, SSSP, and PageRank Graph Dataset Orkut (|V|=3M, |E|=234M) Number of Partitions 16 (one partition per core)

7

slide-8
SLIDE 8

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280

9x m: # of machines used s: # of sockets used per machine c: # of cores used per socket

✓ Denser configurations had longer execution time.

Contention on the memory subsystems impacted performance.

Network may not always be the bottleneck.

8

slide-9
SLIDE 9

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280 m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187

9x 235x

✓ Denser configurations had longer execution time.

Contention on the memory subsystems impacted performance.

Network may not always be the bottleneck.

9

slide-10
SLIDE 10

m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280

9x 235x

✓ Denser configurations had longer execution time.

Contention on the memory subsystems impacted performance.

Network may not always be the bottleneck.

10

slide-11
SLIDE 11

m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280

9x 235x

✓ Denser configurations had longer execution time.

Contention on the memory subsystems impacted performance.

Network may not always be the bottleneck.

11

The distribution of edge-cut matters.

slide-12
SLIDE 12

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280 m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187

✓ METIS had lower execution time and LLC misses than LDG.

Edge-cut matters.

Higher edge-cut-->higher comm-->higher contention

9x 235x

12

slide-13
SLIDE 13

Yes! Both edge-cut and its distribution matter!

✓ Intra-Node and Inter-Node Data Communication

○ Have different performance impact on the memory

subsystems of modern multicore machines.

The End of Slow Networks: Does edge-cut still matter?

13

slide-14
SLIDE 14

Roadmap

 Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions

14

slide-15
SLIDE 15

Send Buffer Sending Core Receiving Core Receive Buffer Shared Buffer

  • 1. Load
  • 3. Load
  • 2b. Write
  • 2a. Load
  • 4a. Load
  • 4b. Write

Extra Memory Copy

Intra-Node Data Comm: Shared Memory

15

slide-16
SLIDE 16

Cached Send/Shared/Receive Buffer

Intra-Node Data Comm: Shared Memory

LLC and Memory Bandwidth Contention Cache Pollution

16

slide-17
SLIDE 17

Cached Send/Shared Buffer Cached Receive/Shared Buffer

LLC and Memory Bandwidth Contention Cache Pollution

Intra-Node Data Comm: Shared Memory

17

slide-18
SLIDE 18

Excess intra-node data communication may hurt performance.

18

slide-19
SLIDE 19

Inter-Node Data Comm: RDMA Read/Write

Send Buffer Sending Core

Node#1

IB HCA Receive Buffer Sending Core

Node#2

IB HCA

No Extra Memory Copy and Cache Pollution

19

slide-20
SLIDE 20

Offloading excess intra-node data comm across nodes may achieve better performance.

20

slide-21
SLIDE 21

Roadmap

 Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions

21

slide-22
SLIDE 22

Argo: Graph Partitioning Model

Partitioner ... ... Vertex Stream

Streaming Graph Partitioning Model [I. Stanton, KDD’12]

22

slide-23
SLIDE 23

Argo: Architecture-Aware Vertex Placement

Place vertex, v, to a partition, Pi, that maximize:

✓ Weighted by the relative network comm cost, Argo will

○ avoid edge-cut across nodes (inter-node data comm).

Penalize the placement based on the load of Pi

Weighted Edge-cut

Great for cases where the network is the bottleneck.

23

slide-24
SLIDE 24

Argo: Architecture-Aware Vertex Placement

Refined Intra-Node Network Comm Cost Maximal Inter-Node Network Comm Cost Degree of Contention (𝞵 ∈ [0, 1]) Original Intra-Node Network Comm Cost ✓ Weighted by the refined relative network comm cost, Argo will

○ avoid edge-cut across cores of the same node (intra-node

data comm).

Bottleneck Network Memory

𝞵=0 𝞵=1

24

slide-25
SLIDE 25

 Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions

Roadmap

25

slide-26
SLIDE 26

 Three Classic Graph Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)
  • PageRank

 Three Real-World Large Graphs

Evaluation: Workloads & Datasets

Dataset |V| |E| Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B

26

slide-27
SLIDE 27

Evaluation: Platform

Cluster Configuration # of Nodes 32 Network Topology FDR Infiniband (Single Switch) Network Bandwidth 56Gbps Compute Node Configuration # of Sockets 2 Intel Haswell (10 cores / socket) L3 Cache 25MB

27

slide-28
SLIDE 28

 METIS: the most well-known multi-level partitioner.  LDG: the most well-known streaming partitioner.  ARGO-H: network is the bottleneck.

  • weight edge-cut by the original network comm costs.

 ARGO: memory is the bottleneck.

  • weight edge-cut by the refined network comm costs.

Evaluation: Partitioners

28

slide-29
SLIDE 29

Evaluation: SSSP Exec. Time on Orkut dataset

Message Grouping Size

(Group multiple msgs by a single SSSP process to the same destination into one msg)

★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines

2x 2x 3x 1x 4x 2x 5x 1x 2x 1.4x 3x 1x

✓ ARGO had the lowest SSSP execution time.

29

slide-30
SLIDE 30

Message Grouping Size

Evaluation: SSSP LLC Misses on Orkut dataset

★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines

50x 38x 9x 4x 3x 6x 1x 1x 1x 9x 1.2x 12x

✓ ARGO had the lowest LLC Misses.

30

slide-31
SLIDE 31

Evaluation: SSSP Comm Vol. on Orkut dataset

★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines

64 Intra-Socket METIS 69% LDG 49% ARGO-H 70%

✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters.

31

slide-32
SLIDE 32

Evaluation: SSSP Exec. Time vs Graph Size

★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512

✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased.

32

slide-33
SLIDE 33

Evaluation: SSSP Exec. Time vs # of Partitions

★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512

✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H.

33

slide-34
SLIDE 34

Evaluation: SSSP Exec. Time vs # of Partitions

* 160 = 13h * 180 = 6h

★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512

✓ Hours CPU Time Saving.

34

slide-35
SLIDE 35

# of Partitions

Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution)

Evaluation: Partitioning Overhead

★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines

# of Partitions

✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time.

35

slide-36
SLIDE 36

 Findings

  • Network is not always the bottleneck.
  • Contention on memory subsystems may

impact the performance a lot

 due to excess intra-node data comm.

  • Both edge-cut and its distribution matter.

Acknowledgments:

 Peyman Givi  Patrick Pisciuneri

Funding:

 NSF CBET-1609120  NSF CBET-1250171  BigData’16 Student

Travel Award

 ARGO

  • voids contention by offloading excess

intra-node data comm across nodes.

  • Achieves up to 11x improvement on

real-world workloads.

  • Scales well in terms of both graph size

and number of partitions.

Thanks!

Conclusions

36