Graph Partition Refinement Algorithm Angen Zheng , Alexandros - - PowerPoint PPT Presentation

graph partition refinement algorithm
SMART_READER_LITE
LIVE PREVIEW

Graph Partition Refinement Algorithm Angen Zheng , Alexandros - - PowerPoint PPT Presentation

Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm Angen Zheng , Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi University of Pittsburgh 1 Importance of Graph Partitioning


slide-1
SLIDE 1

Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm

Angen Zheng, Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi University of Pittsburgh

1

slide-2
SLIDE 2

Importance of Graph Partitioning

 Applications of Graph Partitioning

  • Scientific Simulations
  • Distributed Graph Computation
  • Pregel, Hama, Giraph
  • VLSI Design
  • Task Scheduling
  • Linear Programming

2

slide-3
SLIDE 3

Target Workloads

3

★ Vertex

○ a unique identifier ○ a modifiable, user-defined value

★ Edge

○ a modifiable, user-defined value ○ a target vertex identifier

UDF UDF UDF UDF

Balanced load distribution!!! Minimizing the comm cost!!! ★ Vertex-Centric UDF

○ Change vertex/edge state ○ Send msg to neighbours ○ Receive msg from neighbors ○ Mutate the graph topology ○ Deactivate at end of the superstep ○ Reactivate by external msgs

slide-4
SLIDE 4

A Balanced Partitioning = Even Load Distribution

N3 N1 N2

4

slide-5
SLIDE 5

Minimal Edge-Cut = Minimal Data Comm

5

N3 N1 N2 Minimal Data Comm ≠ Minimal Comm Cost

slide-6
SLIDE 6

Roadmap

6

# of slides

Introduction ✔ Heterogeneity State of the Art PARAGON Contention Experiments

% of audience asleep

slide-7
SLIDE 7

Nonuniform Inter-Node Network Comm Cost

Comm costs vary a lot as their locations change!

7

slide-8
SLIDE 8

Nonuniform Intra-Node Network Comm Cost

Cores sharing more cache levels communicate faster!

8

slide-9
SLIDE 9

Inter-Node Comm Cost > Intra-Node Comm Cost

Network

(Ethernet, IPoIB)

Node#1 Node#2

9

slide-10
SLIDE 10

Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost

N1 N2 N3 N1

1 6

N2

1 1

N3

6 1 N3 N1 N2

  • 3 edge-cut
  • 3 unit data comm
  • 8 unit comm cost (8 = 1 * 6 + 2 * 1)

10

slide-11
SLIDE 11

Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost

N1 N2 N3 N1

1 6

N2

1 1

N3

6 1

11

  • 4 edge-cut
  • 4 unit data comm
  • 4 unit comm cost (4 = 1 * 1+ 3 * 1)

N3 N1 N2

Group neighbouring vertices as close as possible!

slide-12
SLIDE 12

Roadmap

12

# of slides

Introduction ✔ Heterogeneity ✔ State of the Art PARAGON Contention Experiments

% of audience asleep

slide-13
SLIDE 13

Overview of the State-of-the-Art

13

Balanced Graph (Re)Partitioning Partitioners (static graphs)

Repartitioners (dynamic graphs)

Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG Fennel

Offline Methods (High Quality)

(Poor Scalability)

Online Methods

(Moderate Quality) (High Scalability)

Parmetis Aragon

Offline Methods

(High Quality) (Poor Scalability)

Online Methods

(Moderate~High Quality) (High Scalability)

Heterogeneity-Aware

CatchW Paragon xdgp Mizan

Heterogeneity-Aware

LogGP Hermes

slide-14
SLIDE 14

Our Prior Work: Aragon

 A sequential architecture-aware graph partition refinement algorithm.

  • Input:
  • A partitioned graph
  • The relative network comm cost matrix
  • Output:
  • A partitioning with improved mapping of the comm

pattern to the underlying hardware topology.

14

[1]. Angen Zheng, Alexandors Labrinidis, and Panos K. Chrysanthis. Archiitecture-Aware Graph Repartitioning for Data-Intensive Scientific

  • Computing. BigGraphs, 2014
slide-15
SLIDE 15

Our Prior Work: Aragon

N1 N2 N3 N4 N5 N6 N7 N8 N9

G

P1 P2 P3 P4 P5 P6 P7 P8 P9

P1 P2 P3 P4 P9 P8 P7 P6

Heterogeneity-Aware Refinement

(More details in the paper)

Aragon

 N5 can hold entire graph in memory  Prefer to work in offline mode

15

slide-16
SLIDE 16

Roadmap

16

# of slides

Introduction ✔ Heterogeneity ✔ State of the Art ✔ PARAGON Contention Experiments

% of audience asleep

slide-17
SLIDE 17

 Overview:

  • Parallel Architecture-Aware Graph Partition

Refinement Algorithm

 Goal:

  • Group neighbouring vertices as close as possible

Paragon

17

Paragon vs Aragon

○ lower overhead ○ scale to much larger graphs

slide-18
SLIDE 18

Paragon: Partition Grouping

P1 P2 P3 P4 P6 P9 P5 P7 P8

18

P1 P2 P3 P4 P5 P6 P7 P8 P9

N1 N2 N3 N4 N5 N6 N7 N8 N9

slide-19
SLIDE 19

Paragon: Group Server Selection

19

P1 P2 P3 P4 P5 P6 P7 P8 P9

N1 N2 N3 N4 N5 N6 N7 N8 N9

N9 N8 N2

P1 P2 P3 P4 P6 P9 P5 P7 P8

slide-20
SLIDE 20

Paragon: Sending “Partition” to Group Servers

N1 N2 N3 N4 N5 N6 N7 N8 N9

P1 P2 P3 P4 P5 P6 P7 P8 P9

P1 P3 P4 P6 P5 P7

20

Only send boundary vertices

N9 N8 N2

P1 P2 P3 P4 P6 P9 P5 P7 P8

slide-21
SLIDE 21

Paragon: Parallel Refinement

Aragon

P1 P2 P3 P4 P5 P6 P7 P8 P9

P1 P3 P4 P6 P5 P7

Aragon Aragon

N2 N3 N4 N5 N6 N7 N8 N9 N1

21

N9 N8 N2

P1 P2 P3 P4 P6 P9 P5 P7 P8

# of Groups

○ Degree of Parallelism

slide-22
SLIDE 22

Paragon: Parallel Refinement

Aragon

P1 P2 P3 P4 P5 P6 P7 P8 P9

P1 P3 P4 P6 P5 P7

Aragon Aragon

N2 N3 N4 N5 N6 N7 N8 N9 N1

22

N9 N8 N2

P1 P2 P3 P4 P6 P9 P5 P7 P8

# of Groups

○ Degree of Parallelism ○ Parallelism vs Quality

36 16 9 6

slide-23
SLIDE 23

Paragon: Shuffle Refinement

N2: N9: N8:

P1 P2 P4 P5 P7 P9 P3 P6 P8

Swap Aragon Aragon Aragon Parallel

P1 P2 P3 P4 P6 P9 P5 P7 P8

23

Repeat k times

To increase the # of partition pairs being refined!

slide-24
SLIDE 24

Roadmap

24

# of slides

Introduction ✔ Heterogeneity ✔ State of the Art ✔ PARAGON ✔ Contention Experiments

% of audience asleep

slide-25
SLIDE 25

Inter-Node Comm Cost ? Intra-Node Comm Cost

Network

Node#1 Node#2

RDMA-enabled

25

slide-26
SLIDE 26

Inter-Node Comm Cost ≅ Intra-Node Comm Cost

[2]. C. Binnig, U. Çetintemel, A. Crotty, A. Galakatos, T. Kraska, E. Zamanian, and S. B. Zdonik. The End of Slow Networks: It’sTime for a

  • Redesign. CoRR, 2015

★ Dual-socket Xeon E5v2 server with

○ DDR3-1600 ○ 2 FDR 4x NICs per socket

Revisit the Impact of Memory Subsystem Carefully!

★ Infiniband: 1.7GB/s~37.5GB/s ★ DDR3: 6.25GB/s~16.6GB/s

26

slide-27
SLIDE 27

Intra-Node Shared Resource Contention

Send Buffer Sending Core Receiving Core Receive Buffer Shared Buffer

  • 1. Load
  • 3. Load
  • 2b. Write
  • 2a. Load
  • 4a. Load
  • 4b. Write

27

slide-28
SLIDE 28

Cached Send/Shared/Receive Buffer

Intra-Node Shared Resource Contention

Multiple copies of the same data in LLC, contending for LLC and MC

28

slide-29
SLIDE 29

Intra-Node Shared Resource Contention

Cached Send/Shared Buffer Cached Receive/Shared Buffer

29

Multiple copies of the same data in LLC, contending for LLC, MC, and QPI.

slide-30
SLIDE 30

Paragon: Avoiding Contention

Intra-Node Network Comm Cost Maximal Inter-Node Network Comm Cost Degree of Contention

30

(Small HPC Clusters)

(Cloud/Large Clusters)

slide-31
SLIDE 31

Paragon: Avoiding Contention

Send Buffer Sending Core

Node#1

IB HCA Receive Buffer Receiving Core

Node#2

IB HCA

31

slide-32
SLIDE 32

Roadmap

32

# of slides

Introduction ✔ Heterogeneity ✔ State of the Art ✔ PARAGON ✔ Contention ✔ Experiments

% of audience asleep

slide-33
SLIDE 33

Evaluation

 MicroBenchmarks

  • Degree of Refinement Parallelism
  • Varying Shuffle Refinement Times
  • Varying Initial Partitioners

 Real-World Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)

 Billion-Edge Graph Scaling

33

slide-34
SLIDE 34

Evaluation

 MicroBenchmarks

  • Degree of Refinement Parallelism
  • Varying Shuffle Refinement Times
  • Varying Initial Partitioners

 Real-World Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)

 Billion-Edge Graph Scaling

34

slide-35
SLIDE 35

Degree of Refinement Parallelism: Refinement Time

35

Aragon

★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ # of Shuffle Times: 0

slide-36
SLIDE 36

Degree of Refinement Parallelism: Partitioning Quality

36

★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ # of Shuffle Times: 0

slide-37
SLIDE 37

Evaluation

 MicroBenchmarks

  • Degree of Refinement Parallelism
  • Varying Shuffle Refinement Times
  • Varying Initial Partitioners

 Real-World Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)

 Billion-Edge Graph Scaling

37

slide-38
SLIDE 38

Varying Shuffle Refinement Times

38

★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ Deg. of Parallelism: 8

# of shuffle refinement times > 10

○ Paragon had lower refinement overhead ■ 8~10s vs 33s (Paragon vs Aragon) ○ Paragon produce better decompositions ■ 0~2.6% (Paragon vs Aragon)

slide-39
SLIDE 39

Evaluation

 MicroBenchmarks

  • Degree of Refinement Parallelism
  • Varying Shuffle Refinement Times
  • Varying Initial Partitioners

 Real-World Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)

 Billion-Edge Graph Scaling

39

slide-40
SLIDE 40

Varying Initial Partitioners

40

Dataset 12 datasets from various areas # of Parts 40 (two 20-core machines) Initial Partitioner HP/DG/LDG

  • Deg. of Parallelism

8 # of Refinement Times 8

HP: Hashing Partitioning DG: Deterministic Greedy Partitioning LDG: Linear Deterministic Greedy Partitioning

slide-41
SLIDE 41

Impact of Varying Initial Partitioners: Partitioning Quality

Improv. Max Avg. HP

58% 43%

DG

29% 17%

LDG

53% 36%

41

slide-42
SLIDE 42

Evaluation

 MicroBenchmarks

  • Degree of Refinement Parallelism
  • Varying Shuffle Refinement Times
  • Varying Initial Partitioners

 Real-World Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)

 Billion-Edge Graph Scaling

42

slide-43
SLIDE 43

BFS: Platform

  • PittMPICluster: 32 nodes connected via a single

switch with 56Gbps FDR Infiniband

  • Gordon Supercomputer: 4x4x4 3D torus of switches

connected via QDR Infiniband with 16 compute nodes attached to each switch (with 8Gbps link bandwidth)

43

Bottleneck Memory (𝛍=1) Network (𝛍=0)

slide-44
SLIDE 44

Paragon xdgp Mizan

BFS: Competitors

44

uniParagon

Initial Partitioner: DG

Balanced Graph (Re)Partitioning Partitioners (static graphs)

Repartitioners (dynamic graphs)

Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG Fennel

Offline Methods (High Quality)

(Poor Scalability)

Online Methods

(Moderate Quality) (High Scalability)

Parmetis Aragon

Offline Methods

(High Quality) (Poor Scalability)

Online Methods

(Moderate~High Quality) (High Scalability)

CatchW LogGP Hermes

slide-45
SLIDE 45

BFS on PittMPICluser (𝛍=1): Exec. Time

★ as-skitter: |V|=1.6M, |E| = 22M ★ 60 partitions: three 20-core machines ★ deg. of parallelism: 8 ★ # of shuffle refinement times: 8

5.9x 6.7x 5.9x 2.7x

45

slide-46
SLIDE 46

BFS on PittMPICluser (𝛍=1): Comm Vol

46

★ as-skitter: |V|=1.6M, |E| = 22M ★ 60 partitions: three 20-core machines ★

  • deg. of parallelism: 8

★ # of shuffle refinement times: 8

Reduction Intra- Socket Inter- Socket DG 62% 55% METIS 53% 55% PARMETIS 15% 17% uniPARAGON 62% 39%

slide-47
SLIDE 47

2.5x 1.5x 50% 38%

BFS on Gordon (𝛍=0)

47

★ as-skitter: |V|=1.6M, |E| = 22M ★ 48 partitions: three 16-core machines ★

  • deg. of parallelism: 8

★ # of shuffle refinement times: 8

slide-48
SLIDE 48

Evaluation

 MicroBenchmarks

  • Degree of Refinement Parallelism
  • Varying Shuffle Refinement Times
  • Varying Initial Partitioners

 Real-World Workloads

  • Breadth First Search (BFS)
  • Single Source Shortest Path (BFS)

 Billion-Edge Graph Scaling

48

slide-49
SLIDE 49

Billion-Edge Graph Scaling: BFS Exec. Time

49

★ friendster: |V|=124M, |E| = 3.6B ★ 60 partitions: three 20-core machines ★ deg. of parallelism: 10 ★ # of shuffle refinement times: 10

★ 1.65x ★ 60 cores

slide-50
SLIDE 50

Billion-Edge Graph Scaling: Refine Time

★ friendster: |V|=124M, |E| = 3.6B ★ 60 partitions: three 20-core machines ★ deg. of parallelism: 10 ★ # of shuffle refinement times: 10

50

1.36x

slide-51
SLIDE 51

Conclusions

 Introduced PARAGON

  • Parallel Architecture-Aware

Graph Partition Refinement Algorithm

 PARAGON addresses:

  • Scalability
  • Communication Heterogeneity
  • Shared Resource Contention

 Extensive experimental study:

  • Achieved up to 6.7x speedups on real-world workloads
  • Scaled up to 3.6 billion edges

51

Acknowledgments:

  • Jack Lange
  • Albert DeFusco
  • Kim Wong
  • Mark Silvis

Funding:

  • NSF OIA-1028162
  • NSF CBET-1250171
slide-52
SLIDE 52

Thank You!

Email: angen.zheng@gmail.com or anz28@cs.pitt.edu Homepage: http://people.cs.pitt.edu/~anz28/ ADMT: http://db.cs.pitt.edu/group/

52