argo architecture aware graph partitioning
play

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros - PowerPoint PPT Presentation

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1 Big Graphs


  1. Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1

  2. Big Graphs Are Everywhere [SIGMOD’16 Tutorial] 2

  3. A Balanced Partitioning = Even Load Distribution Minimal Edge-Cut = Minimal Data Comm N2 N1 N3 Assumption: Network is the bottleneck. 3

  4. The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] ✓ Dual-socket Xeon E5v2 server with ○ DDR3-1600 ✓ Infiniband: 1.7GB/s~37.5GB/s ○ 2 FDR 4x NICs per socket ✓ DDR3: 6.25GB/s~16.6GB/s 4

  5. The End of Slow Networks: Does edge-cut still matter? 5

  6. Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 6

  7. The End of Slow Networks: Does edge-cut still matter? Graph Partitioners METIS and LDG Graph Workloads BFS, SSSP , and PageRank Graph Dataset Orkut (|V|=3M, |E|=234M) Number of Partitions 16 (one partition per core) 7

  8. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) m:s:c METIS LDG m: # of machines used 1:2:8 633 2,632 s: # of sockets used per machine c: # of cores used per socket 2:2:4 654 2,565 9x 4:2:2 521 631 8:2:1 222 280 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 8

  9. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 9

  10. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 10

  11. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. The distribution of edge-cut matters. ○ Network may not always be the bottleneck . ○ 11

  12. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ METIS had lower execution time and LLC misses than LDG. ○ Edge-cut matters. ○ Higher edge-cut-->higher comm-->higher contention 12

  13. The End of Slow Networks: Does edge-cut still matter? Yes! Both edge-cut and its distribution matter! ✓ Intra-Node and Inter-Node Data Communication ○ Have different performance impact on the memory subsystems of modern multicore machines. 13

  14. Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 14

  15. Intra-Node Data Comm: Shared Memory Sending Core Receiving Core 4b. Write 2b. Write 3. Load 1. Load 4a. Load 2a. Load Shared Buffer Receive Buffer Send Buffer Extra Memory Copy 15

  16. Intra-Node Data Comm: Shared Memory Cached Send/Shared/Receive Buffer Cache Pollution LLC and Memory Bandwidth Contention 16

  17. Intra-Node Data Comm: Shared Memory Cached Send/Shared Buffer Cached Receive/Shared Buffer Cache Pollution LLC and Memory Bandwidth Contention 17

  18. Excess intra-node data communication may hurt performance. 18

  19. Inter-Node Data Comm: RDMA Read/Write Node#1 Node#2 Sending Core Sending Core Send Receive Buffer Buffer IB IB HCA HCA No Extra Memory Copy and Cache Pollution 19

  20. Offloading excess intra-node data comm across nodes may achieve better performance. 20

  21. Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 21

  22. Argo: Graph Partitioning Model Vertex Stream ... Partitioner ... Streaming Graph Partitioning Model [I. Stanton, KDD’12] 22

  23. Argo: Architecture-Aware Vertex Placement Place vertex, v , to a partition, Pi , that maximize: Weighted Edge-cut Penalize the placement based on the load of Pi ✓ Weighted by the relative network comm cost , Argo will ○ avoid edge-cut across nodes (inter-node data comm). Great for cases where the network is the bottleneck. 23

  24. Argo: Architecture-Aware Vertex Placement Degree of Contention ( 𝞵 ∈ [0, 1]) Bottleneck Network Memory 𝞵 =0 𝞵 =1 Maximal Inter-Node Refined Intra-Node Original Intra-Node Network Comm Cost Network Comm Cost Network Comm Cost ✓ Weighted by the refined relative network comm cost , Argo will ○ avoid edge-cut across cores of the same node (intra-node data comm). 24

  25. Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 25

  26. Evaluation: Workloads & Datasets  Three Classic Graph Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) o PageRank  Three Real-World Large Graphs Dataset |V| |E| Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B 26

  27. Evaluation: Platform Cluster Configuration # of Nodes 32 Network Topology FDR Infiniband (Single Switch) Network Bandwidth 56Gbps Compute Node Configuration 2 Intel Haswell # of Sockets (10 cores / socket) L3 Cache 25MB 27

  28. Evaluation: Partitioners  METIS: the most well-known multi-level partitioner.  LDG: the most well-known streaming partitioner.  ARGO-H: network is the bottleneck. weight edge-cut by the original network comm costs. o  ARGO: memory is the bottleneck. weight edge-cut by the refined network comm costs. o 28

  29. Evaluation: SSSP Exec. Time on Orkut dataset ★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines 5x 4x 3x 3x 2x 2x 2x 2x 1.4x 1x 1x 1x Message Grouping Size ✓ ARGO had the lowest SSSP execution time. (Group multiple msgs by a single SSSP process to the same destination into one msg) 29

  30. Evaluation: SSSP LLC Misses on Orkut dataset ★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines 50x 38x 12x 9x 9x 6x 4x 3x 1x 1x 1.2x 1x Message Grouping Size ✓ ARGO had the lowest LLC Misses. 30

  31. Evaluation: SSSP Comm Vol. on Orkut dataset 64 Intra-Socket ★ Orkut: |V| = 3M, |E| = 234M METIS 69% ★ 60 Partitions: three 20-core machines LDG 49% ARGO-H 70% ✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters. 31

  32. Evaluation: SSSP Exec. Time vs Graph Size ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512 ✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased. 32

  33. Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512 ✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H. 33

  34. Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512 * 160 = 13h * 180 = 6h ✓ Hours CPU Time Saving. 34

  35. Evaluation: Partitioning Overhead ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution) # of Partitions # of Partitions ✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time. 35

  36. Conclusions  Findings o Network is not always the bottleneck. Thanks! o Contention on memory subsystems may impact the performance a lot  due to excess intra-node data comm. o Both edge-cut and its distribution matter. Acknowledgments :  Peyman Givi  ARGO  Patrick Pisciuneri o voids contention by offloading excess Funding : intra-node data comm across nodes.  NSF CBET-1609120 o Achieves up to 11x improvement on  NSF CBET-1250171 real-world workloads.  BigData’16 Student o Scales well in terms of both graph size Travel Award and number of partitions. 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend