graph partition refinement algorithm
play

Graph Partition Refinement Algorithm Angen Zheng , Alexandros - PowerPoint PPT Presentation

Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm Angen Zheng , Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi University of Pittsburgh 1 Importance of Graph Partitioning


  1. Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm Angen Zheng , Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi University of Pittsburgh 1

  2. Importance of Graph Partitioning  Applications of Graph Partitioning o Scientific Simulations o Distributed Graph Computation o Pregel, Hama, Giraph o VLSI Design o Task Scheduling o Linear Programming 2

  3. Target Workloads ★ Vertex ○ a unique identifier ○ a modifiable, user-defined value ★ Edge ○ a modifiable, user-defined value ○ a target vertex identifier ★ Vertex-Centric UDF ○ Change vertex/edge state ○ Send msg to neighbours ○ UDF UDF UDF UDF Receive msg from neighbors ○ Mutate the graph topology ○ Deactivate at end of the superstep ○ Reactivate by external msgs Minimizing the comm cost!!! Balanced load distribution!!! 3

  4. A Balanced Partitioning = Even Load Distribution N2 N1 N3 4

  5. Minimal Edge-Cut = Minimal Data Comm N2 N1 N3 Minimal Data Comm ≠ Minimal Comm Cost 5

  6. Roadmap Experiments % of Contention audience asleep PARAGON State of the Art Heterogeneity Introduction ✔ # of slides 6

  7. Nonuniform Inter-Node Network Comm Cost Comm costs vary a lot as their locations change! 7

  8. Nonuniform Intra-Node Network Comm Cost Cores sharing more cache levels communicate faster! 8

  9. Inter-Node Comm Cost > Intra-Node Comm Cost Node#1 Node#2 Network (Ethernet, IPoIB) 9

  10. Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost N1 N2 N3 N1 1 6 N2 N2 1 1 N3 6 1 N1 N3 • 3 edge-cut • 3 unit data comm • 8 unit comm cost (8 = 1 * 6 + 2 * 1) 10

  11. Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost N1 N2 N3 N1 1 6 N2 N2 1 1 N3 6 1 N3 • 4 edge-cut N1 • 4 unit data comm • 4 unit comm cost (4 = 1 * 1+ 3 * 1) Group neighbouring vertices as close as possible! 11

  12. Roadmap Experiments % of Contention audience asleep PARAGON State of the Art Heterogeneity ✔ Introduction ✔ # of slides 12

  13. Overview of the State-of-the-Art Balanced Graph (Re)Partitioning Partitioners Repartitioners (static graphs) (dynamic graphs) Offline Methods Offline Methods Online Methods Online Methods ( High Quality) (High Quality) (Moderate Quality) (Moderate~High Quality) (Poor Scalability ) (Poor Scalability) (High Scalability) (High Scalability) Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG Parmetis Aragon Paragon Hermes CatchW xdgp Mizan LogGP Fennel Heterogeneity-Aware Heterogeneity-Aware 13

  14. Our Prior Work: Aragon  A sequential architecture-aware graph partition refinement algorithm. o Input: • A partitioned graph • The relative network comm cost matrix o Output : • A partitioning with improved mapping of the comm pattern to the underlying hardware topology . [1]. Angen Zheng, Alexandors Labrinidis, and Panos K. Chrysanthis. Archiitecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing. BigGraphs, 2014 14

  15. Our Prior Work: Aragon P1 N 1 P1 N 2 P2 P2 P3 P3 N 3 P4 P4 N 4 G Heterogeneity-Aware Refinement N 5 P5 (More details in the paper) P6 N 6 Aragon P6  N5 can hold entire graph P7 in memory N 7 P7  Prefer to work in offline P8 mode P8 N 8 P9 P9 N 9 15

  16. Roadmap Experiments % of Contention audience asleep PARAGON State of the Art ✔ Heterogeneity ✔ Introduction ✔ # of slides 16

  17. Paragon  Overview: o Parallel Architecture-Aware Graph Partition Refinement Algorithm  Goal: o Group neighbouring vertices as close as possible Paragon vs Aragon ○ lower overhead ○ scale to much larger graphs 17

  18. Paragon: Partition Grouping N1 P1 N2 P2 P2 P3 P1 N3 P3 N4 P4 P6 P9 P4 N5 P5 N6 P6 P7 P8 P5 N7 P7 P8 N8 P9 N9 18

  19. Paragon: Group Server Selection N1 P1 N2 P2 N2 P2 P3 P1 N3 P3 N4 P4 N9 P6 P9 P4 N5 P5 N6 P6 N8 P7 P8 P5 N7 P7 P8 N8 P9 N9 19

  20. Paragon: Sending “Partition” to Group Servers N1 P1 P1 P2 N2 N2 P2 P3 P1 P3 P3 N3 P4 N4 N9 P6 P9 P4 P5 N5 P6 P5 N6 N8 P7 P4 P8 P5 P7 P7 N7 P6 P8 N8 Only send boundary vertices P9 N9 20

  21. Paragon: Parallel Refinement P1 N1 P1 Aragon P2 N2 N2 P2 P3 P1 P3 P3 N3 P4 N4 N9 P6 P9 P4 P5 N5 P6 N6 P5 N8 P7 P4 P8 P5 P7 P7 N7 P6 # of Groups N8 P8 Aragon ○ Degree of Parallelism Aragon P9 N9 21

  22. Paragon: Parallel Refinement P1 N1 P1 Aragon P2 N2 N2 P2 P3 P1 P3 P3 N3 36 P4 N4 N9 P6 P9 P4 P5 N5 16 P6 9 N6 P5 6 N8 P7 P4 P8 P5 P7 P7 N7 P6 # of Groups N8 P8 Aragon ○ Degree of Parallelism ○ Parallelism vs Quality Aragon P9 N9 22

  23. Paragon: Shuffle Refinement N2: Aragon P2 P4 P1 P2 P3 P1 Swap Parallel Aragon P7 N9: P9 P5 P6 P9 P4 Aragon N8: P6 P8 P3 P7 P8 P5 Repeat k times To increase the # of partition pairs being refined! 23

  24. Roadmap Experiments % of Contention audience asleep PARAGON ✔ State of the Art ✔ Heterogeneity ✔ Introduction ✔ # of slides 24

  25. Inter-Node Comm Cost ? Intra-Node Comm Cost Node#1 Node#2 RDMA-enabled Network 25

  26. Inter-Node Comm Cost ≅ Intra-Node Comm Cost ★ Dual-socket Xeon E5v2 server with ★ Infiniband: 1.7GB/s~37.5GB/s ○ DDR3-1600 ★ DDR3: ○ 2 FDR 4x NICs per socke t 6.25GB/s~16.6GB/s Revisit the Impact of Memory Subsystem Carefully! [2]. C. Binnig, U. Çetintemel, A. Crotty, A. Galakatos, T. Kraska, E. Zamanian, and S. B. Zdonik. The End of Slow Networks: It’sTime for a Redesign. CoRR, 2015 26

  27. Intra-Node Shared Resource Contention Sending Core Receiving Core 4b. Write 2b. Write 3. Load 1. Load 4a. Load 2a. Load Shared Buffer Receive Buffer Send Buffer 27

  28. Intra-Node Shared Resource Contention Cached Send/Shared/Receive Buffer Multiple copies of the same data in LLC, contending for LLC and MC 28

  29. Intra-Node Shared Resource Contention Cached Send/Shared Buffer Cached Receive/Shared Buffer Multiple copies of the same data in LLC, contending for LLC, MC, and QPI. 29

  30. Paragon: Avoiding Contention Degree of Contention Intra-Node Network Maximal Inter-Node Comm Cost Network Comm Cost (Small HPC Clusters) (Cloud/Large Clusters) 30

  31. Paragon: Avoiding Contention Node#1 Node#2 Sending Core Receiving Core Send Receive Buffer Buffer IB IB HCA HCA 31

  32. Roadmap Experiments Contention ✔ % of audience asleep PARAGON ✔ State of the Art ✔ Heterogeneity ✔ Introduction ✔ # of slides 32

  33. Evaluation  MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners  Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP)  Billion-Edge Graph Scaling 33

  34. Evaluation  MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners  Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP)  Billion-Edge Graph Scaling 34

  35. Degree of Refinement Parallelism: Refinement Time ★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ # of Shuffle Times: 0 Aragon 35

  36. Degree of Refinement Parallelism: Partitioning Quality ★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ # of Shuffle Times: 0 36

  37. Evaluation  MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners  Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP)  Billion-Edge Graph Scaling 37

  38. Varying Shuffle Refinement Times ★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ Deg. of Parallelism: 8 # of shuffle refinement times > 10 ○ Paragon had lower refinement overhead ■ 8~10s vs 33s (Paragon vs Aragon) ○ Paragon produce better decompositions ■ 0~2.6% (Paragon vs Aragon) 38

  39. Evaluation  MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners  Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP)  Billion-Edge Graph Scaling 39

  40. Varying Initial Partitioners Dataset 12 datasets from various areas # of Parts 40 (two 20-core machines) Initial Partitioner HP/DG/LDG Deg. of Parallelism 8 # of Refinement Times 8 HP: Hashing Partitioning DG: Deterministic Greedy Partitioning LDG: Linear Deterministic Greedy Partitioning 40

  41. Impact of Varying Initial Partitioners: Partitioning Quality Improv. Max Avg. HP 58% 43% DG 29% 17% LDG 53% 36% 41

  42. Evaluation  MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners  Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP)  Billion-Edge Graph Scaling 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend