Planar: Parallel Lightweight Architecture-Aware Adaptive Graph - - PowerPoint PPT Presentation

planar parallel lightweight architecture aware adaptive
SMART_READER_LITE
LIVE PREVIEW

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph - - PowerPoint PPT Presentation

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng , Alexandros Labrinidis, and Panos K. Chrysanthis University of Pittsburgh 1 Graph Partitioning Applications of Graph Partitioning Scientific


slide-1
SLIDE 1

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning

Angen Zheng, Alexandros Labrinidis, and Panos K. Chrysanthis

University of Pittsburgh

1

slide-2
SLIDE 2

Graph Partitioning

 Applications of Graph Partitioning

  • Scientific Simulations
  • Distributed Graph Computation
  • Pregel, Hama, Giraph
  • VLSI Design
  • Task Scheduling
  • Linear Programming

2

slide-3
SLIDE 3

A Balanced Partitioning = Even Load Distribution

N3 N1 N2

Balanced:

3

slide-4
SLIDE 4

Minimal Edge-Cut = Minimal Data Comm

N3 N1 N2

Minimizing Edge-Cut:

4

slide-5
SLIDE 5

Minimal Edge-Cut = Minimal Data Comm But Minimal Data Comm≠ Minimal Comm Cost

Group neighboring vertices as close as possible

5

The partitioner has to be Architecture-Aware

Figure 1. Pair-Wise Network Bandwidth (J. Xue, BigData’15) STD DEV: 416.82Mb/s STD DEV: 358.34Mb/s STD DEV: 269 . 71Mb/s

slide-6
SLIDE 6

Overview of the State-of-the-Art

Balanced Graph (Re)Partitioning

Partitioners (static graphs) Repartitioners (dynamic graphs)

Offline Methods (High Quality)

(Poor Scalability)

Online Methods

(Moderate Quality) (High Scalability)

Offline Methods

(High Quality) (Poor Scalability)

Online Methods

(Moderate~High Quality) (High Scalability)

Architecture-Aware Architecture-Aware

6

slide-7
SLIDE 7

Roadmap

7

Introduction Planar Evaluation Conclusions

slide-8
SLIDE 8

Planar: Problem Statement

Given G=(V, E) and an initial Partitioning P:

Minimizing Communication: Balancing Load:

8

Network Cost

Minimizing Migration:

slide-9
SLIDE 9

Planar: Overview

Phase-1: Logical Vertex Migration Phase-2: Physical Vertex Migration Phase-3: Convergence Check

★ Migration Planning ○ What vertices to move? ○ Where to move? ★ Still beneficial? ★ Perform the Migration Plan Sk Sk+1 Sk+2 Sk+4 Sk+5

Planar Planar Planar Planar Planar

○ Phase-1a: Minimizing Comm Cost ○ Phase-1b: Ensuring Balanced Partitions

9

slide-10
SLIDE 10

Phase-1a: Minimizing Comm Cost

N1 N2 N3 N1

6 1

N2

6 1

N3

1 1

N1 N2 N3

6 6 1 1

10

slide-11
SLIDE 11

★ Run Planar on each partition in Parallel ○ Each boundary vertex of my partition

■ make a migration decision on my own ■ Probabilistic vertex migration N1 N2 N3

6 6 1 1

Phase-1a: Minimizing Comm Cost

11

slide-12
SLIDE 12

N1 N2 N3

Phase-1a: Minimizing Comm Cost

6 6 1 1

12

★ Run Planar on each partition in Parallel ○ Each boundary vertex of my partition

■ make a migration decision on my own ■ Probabilistic vertex migration

slide-13
SLIDE 13

N1 N2 N3

Phase-1a@N1: Use vertex a as an example

6 6 1 1

13

g(a, N1, N1) = 0

★ Run Planar on each partition in Parallel ○ Each boundary vertex of my partition

■ make a migration decision on my own ■ Probabilistic vertex migration

Max Gain: 0 Optimal Dest: N1

slide-14
SLIDE 14

N1 N2 N3

Phase-1a@N1: Move vertex a to N2?

new_comm(a, N2) = 1 * 6 + 1 * 1 = 7 g(a, N1, N2) = 13 - 7 - 6 = 0

  • ld_comm(a, N1) = 2 * 6 + 1 * 1 = 13

mig(a, N1, N2) = 1 * 6 = 6

14

★ Run Planar on each partition in Parallel ○ Each boundary vertex of my partition

■ make a migration decision on my own ■ Probabilistic vertex migration

Max Gain: 0 Optimal Dest: N1

6 6 1 1

N1 N2 N3

6 1 1

slide-15
SLIDE 15

N2 N3

6 6 1

N1

Phase-1a@N1: Move vertex a to N3?

15

new_comm(a, N3) = 1 * 1 + 2 * 1 = 3

  • ld_comm(a, N1) = 2 * 6 + 1 * 1 = 13

mig(a, N1, N3) = 1 * 1 = 1 g(a, N1, N3) = 13 - 3 - 1 = 9

★ Run Planar on each partition in Parallel ○ Each boundary vertex of my partition

■ make a migration decision on my own ■ Probabilistic vertex migration

Max Gain: 9 Optimal Dest: N3

1

N1 N2 N3

1 1 1 1

slide-16
SLIDE 16

Phase-1a: Probabilistic Vertex Migration

Partition

N1

Boundary Vtx

a

Migration Dest N3 Gain

9

N2

b d

N3 N3

2 3 Migration Planning Probability

9/9

2/3 3/3 Max Gain

9

3

16

N1 N2 N3

6 6 1 1

Migrate with a probability proportional to the gain

N3

e g

N3 N3

slide-17
SLIDE 17

Phase-1b: Balancing Partitions

 Quota-Based Vertex Migration

Q2: What vertices to migrate?

■ Phase-1a vertex migration, but limited by the quota.

Q1: How much work should each overloaded partition migrate to each underloaded partition?

■ Potential Gain Computation

  • Similar to Phase-1a vertex gain computation

■ Iteratively allocate quota starting from the partition pair having the largest gain.

17

slide-18
SLIDE 18

Planar: Physical Vertex Migration

Phase-1: Logical Vertex Migration Phase-2: Physical Vertex Migration Phase-3: Convergence Check

★ Migration Planning ○ What vertices to move? ○ Where to move? ★ Still beneficial? ★ Perform the Migration Plan Sk Sk+1 Sk+2 Sk+4 Sk+5

Planar Planar Planar Planar Planar

○ Phase-1a: Minimizing Comm Cost ○ Phase-1b: Ensuring Balanced Partitions

18

slide-19
SLIDE 19

Planar: Convergence Check

Phase-1: Logical Vertex Migration Phase-2: Physical Vertex Migration Phase-3: Convergence Check

★ Migration Planning ○ What vertices to move? ○ Where to move? ★ Still beneficial? ★ Perform the Migration Plan Sk Sk+1 Sk+2 Sk+4 Sk+5

Planar Planar Planar Planar Planar

○ Phase-1a: Minimizing Comm Cost ○ Phase-1b: Ensuring Balanced Partitions

19

slide-20
SLIDE 20

Phase-3: Convergence

Sk Sk+1 Sk+2 Sk+4 Sk+5

Converge Enough changes (structure/load)

Repartitioning Epoch ★ Converge ○ improvement achieved per adaptation superstep < 𝜀 ○ after 𝜐 consecutive adaptation supersteps

Planar Planar Planar Planar Planar

𝜀 = 1% and 𝜐 = 10 (via Sensitivity Analysis)

20

slide-21
SLIDE 21

Evaluation

21

 Microbenchmarks

  • Convergence Study (Param Selection)
  • Partitioning Quality

 Real-World Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)

 Scalability Test

  • Scalability vs Graph Size
  • Scalability vs # of Partitions
  • Scalability vs Graph Size and # of Partitions
slide-22
SLIDE 22

Partitioning Quality: Setup

Dataset 12 datasets from various areas # of Parts 40 (two 20-core machines) Initial Partitioners HP: Hashing Partitioning DG: Deterministic Greedy LDG: Linear Deterministic Greedy

22

slide-23
SLIDE 23

Partitioning Quality: Datasets

Dataset |V| |E| Description

wave 156.317 2,118,662 FEM auto 448,695 6,629,222 333SP 3,712,815 22,217,266 CA-CondMat 108,300 373, 756 Collaboration Network DBLP 317,080 1,049,866 Email-Eron 36,692 183,831 as-skitter 1,696,415 22,190,596 Internet Topology Amazon 334,863 925,872 Product Network USA-roadNet 23,947,347 58,333,344 Road Network roadNet-PA 1,090,919 6,167,592 YouTube 3,223,589 24,447,548 Social Network Com-LiveJournal 4,036,537 69,362,378 Friendster 124,836,180 3,612,134,270

slide-24
SLIDE 24

Partitioning Quality:

Planar achieved up to 68% improvement

Improv. Max Avg. HP 68% 53% DG 46% 24% LDG 69% 48%

24

slide-25
SLIDE 25

Evaluation

25

 Microbenchmarks

  • Convergence Study (Param Selection)
  • Partitioning Quality

 Real-World Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)

 Scalability Test

  • Scalability vs Graph Size
  • Scalability vs # of Partitions
  • Scalability vs Graph Size and # of Partitions
slide-26
SLIDE 26

Real-World Workload: Setup

Cluster Configuration PittMPICluster (FDR Infiniband) Gordon

(QDR Infiniband)

# of Nodes 32 1024 Network Topology Single Switch (32 nodes / switch) 4*4*4 3D Torus of Switches (16 nodes / switch) Network Bandwidth 56Gbps 8Gbps Node Configuration PittMPICluster (Intel Haswell) Gordon (Intel Sandy Bridge) # of Sockets 2 (10 cores / socket) 2 (8 cores / socket) L3 Cache 25MB 20MB Memory Bandwidth 65GB/s 85GB/s

26

slide-27
SLIDE 27

Planar: Avoiding Resource Contention on the

Memory Subsystems of Multicore Machines

Intra-Node Network Comm Cost Maximal Inter-Node Network Comm Cost Degree of Contention

System Bottleneck

(A. Zheng EDBT’16)

PittMPICluster Gordon Memory (λ=1) Network (λ=0)

27

slide-28
SLIDE 28

Real-World Workload: Baselines

Balanced Graph (Re)Partitioning

Partitioners (static graphs) Repartitioners (dynamic graphs)

Offline Methods (High Quality)

(Poor Scalability)

Online Methods

(Moderate Quality) (High Scalability)

Offline Methods

(High Quality) (Poor Scalability)

Online Methods

(Moderate~High Quality) (High Scalability)

uniPlanar Initial Partitioner: DG

28

slide-29
SLIDE 29

BFS Exec. Time on PittMPICluster (λ=1):

Planar achieved up to 9x speedups

9x 7.5x 5.8x 4.1x 1.48x 1.37x 1x

29

★ as-skitter: |V|=1.6M, |E| = 22M ★ 60 Partitions: three 20-core machines

slide-30
SLIDE 30

BFS Comm Volume on PittMPICluster (λ=1):

Planar had the lowest intra-node comm volume

★ as-skitter: |V|=1.6M, |E| = 22M ★ 60 Partitions: three 20-core machines

Reduction Intra-Socket Inter-Socket

DG

51% 38%

METIS

51% 36%

PARMETIS

47% 34%

uniPLANAR

44% 28%

ARAGON

4.3% 0.8%

PARAGON

5.2% 2.6%

30

slide-31
SLIDE 31

3.2x 1.05x 1.16x 1.21x

BFS Exec. Time on Gordon (λ=0):

Planar achieved up to 3.2x speedups

1x

31

★ as-skitter: |V|=1.6M, |E| = 22M ★ 48 Partitions: three 16-core machines

slide-32
SLIDE 32

51% 25% 11% 0.1%

BFS Comm. Volume on Gordon (λ=0):

Planar had the lowest inter-node comm volume

32

★ as-skitter: |V|=1.6M, |E| = 22M ★ 48 Partitions: three 16-core machines

slide-33
SLIDE 33

Conclusions

 PLANAR

  • Architecture-Aware Adaptive Graph

Repartitioner

  • Communication Heterogeneity
  • Shared Resource Contention
  • Up to 9x speedups on real-world

workloads.

  • Scaled up to a graph with 3.6B edges.

Acknowledgments:

  • Peyman Givi
  • Patrick Pisciuneri
  • Mark Silvis

Funding:

  • NSF OIA-1028162
  • NSF CBET-1250171

33

slide-34
SLIDE 34

Thank You!

Email: anz28@cs.pitt.edu Homepage: http://people.cs.pitt.edu/~anz28/ ADMT: http://db.cs.pitt.edu/group/

34

slide-35
SLIDE 35

Backup Slides

35

slide-36
SLIDE 36

Phase-3: Convergence (Param Selection)

𝜀 = 1% and 𝜐 = 10

Initial Partitioner: DG (Deterministic Greedy) # of Parts: 40 (two 20-core nodes)

36

slide-37
SLIDE 37

Scalability vs Graph Size: BFS Exec. Time

★ friendster: |V| = 124M, |E|=3.6B ★ 60 of Partitions: three 20-core machines

Speedup (60 cores) DG 1.55x uniPLANAR 1.32x PARAGON 1.08x

37

slide-38
SLIDE 38

Scalability vs Graph Size: Repart. Time

38

★ friendster: |V| = 124M, |E|=3.6B ★ 60 of Partitions: three 20-core machines

slide-39
SLIDE 39

Scalability vs # of Partitions: BFS Exec. Time

Speedup (120 cores) DG 2.9x uniPLANAR 1.30x PARAGON 1.15x

★ friendster: |V| = 124M, |E|=3.6B ★ 60~120 of Partitions: three to six 20-core machines)

39

slide-40
SLIDE 40

Scalability vs # of Partitions: Repart. Time

40

★ friendster: |V| = 124M, |E|=3.6B ★ 60~120 of Partitions: three to six 20-core machines)

slide-41
SLIDE 41

Intra-Node Shared Resource Contention

Send Buffer Sending Core Receiving Core Receive Buffer Shared Buffer

  • 1. Load
  • 3. Load
  • 2b. Write
  • 2a. Load
  • 4a. Load
  • 4b. Write

41

slide-42
SLIDE 42

Cached Send/Shared/Receive Buffer

Intra-Node Shared Resource Contention

Multiple copies of the same data in LLC, contending for LLC and MC

42

slide-43
SLIDE 43

Intra-Node Shared Resource Contention

Cached Send/Shared Buffer Cached Receive/Shared Buffer

Multiple copies of the same data in LLC, contending for LLC, MC, and QPI

43

slide-44
SLIDE 44

Inter-Node Comm Cost ? Intra-Node Comm Cost

Network Node#1 Node#2 RDMA-enabled

44

slide-45
SLIDE 45

Inter-Node Comm Cost ≅ Intra-Node Comm Cost

[1]. C. Binnig et. al. The End of Slow Networks: It’sTime for a Redesign. CoRR, 2015

★ Dual-socket Xeon E5v2 server with

○ DDR3-1600 ○ 2 FDR 4x NICs per socket

Revisit the Impact of Memory Subsystem Carefully!

★ Infiniband: 1.7GB/s~37.5GB/s ★ DDR3: 6.25GB/s~16.6GB/s

45

slide-46
SLIDE 46

Planar: Avoiding Contention

Send Buffer Sending Core

Node#1

IB HCA Receive Buffer Sending Core

Node#2

IB HCA

46