making pull based graph processing performant
play

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , - PowerPoint PPT Presentation

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 February 27, 2018 Graph Processing Problem modelled as


  1. Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 · February 27, 2018

  2. Graph Processing • Problem modelled as objects (vertices) and connections between them (edges) • Examples: • Internet (pages and hyperlinks) • Social network (people and friendships) • Roads and intersections • Products and ratings 2

  3. Graph Processing F L I I I A A E E E C C G G G B B K K D D H H H H J 3

  4. Graph Processing F’ L’ I’ A’ E’ C’ G’ B’ K’ D’ H’ J’ Repeat until convergence 4

  5. Graph Processing Push Pull Group by source vertex Group by destination vertex Hybrid: dynamically select push or pull for each iteration 5

  6. Graph Processing foreach vertex v in graph.vertices foreach edge e in v.(in|out)edges // process the edge ... 6

  7. Parallelizing Graph Processing • Outer loop parallelization • Between cores: assign entire vertices to threads • Inner loop parallelization • Between cores: subdivide the edges within each vertex • Within one core: vectorize the loop 7

  8. Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 8

  9. Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 9

  10. Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 10

  11. Pull’s Performance Challenge Serial Inner Loop Parallel Inner Loop Contribution #1: “Scheduler Awareness” A technique that can be applied to the inner loop of a pull engine to parallelize it without introducing conflicts. • One thread per vertex • Multiple threads per vertex • Updates are thread-private • Updates will conflict 11

  12. Pull’s Performance Opportunity • Further gains possible using SIMD vectorization • Improve parallelism of the computation • Improve memory bandwidth utilization Contribution #2: “Vector-Sparse” A low-level modification to a data structure commonly used • Data structure layout issues impede effective to represent graphs, intended to enhance vectorization. vectorization in existing work 12

  13. Grazelle • A hybrid graph processing framework that embodies both of our contributions • Outperforms the state-of-the-art by over 10× in some cases • Available for download at https://github.com/stanford-mast/Grazelle-PPoPP18 13

  14. Scheduler Awareness Contribution #1 14

  15. Serial Inner Loop Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 15

  16. Serial Inner Loop Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 16

  17. Scheduler Un-Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 17

  18. Scheduler Un-Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 18

  19. Scheduler Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 19

  20. Scheduler Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 20

  21. Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 21

  22. Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 2. Writes in the middle of a chunk can be committed to shared state without synchronization. 22

  23. Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 2. Writes in the middle of a chunk can be committed to shared state without synchronization. 23

  24. Analyzing Scheduler Awareness • Performance impact depends primarily on the scheduling granularity • Scheduler Un-Awareness: trade-off between load balance and probability of write conflicts • Scheduler Awareness: finer granularity leads to increased merge operation overhead 24

  25. PageRank: Performance vs. Scheduling Granularity dimacs-usa uk-2007 (low, even degree distribution) (extremely skewed) Scheduler Un-Aware Scheduler-Aware 1.2 1.2 50× 1.2× Rel. Execution Time Rel. Execution Time 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 3.3× 0.2 0.2 0.0 0.0 100 1,000 10,000 1,000 10,000 100,000 Chunk Size Chunk Size 10× Different 25

  26. PageRank: Performance vs. Number of Cores dimacs-usa uk-2007 (low, even degree distribution) (extremely skewed) Scheduler Un-Aware Scheduler Awar e Key Insights 50 70 Rel. Performance Rel. Performance 60 40 Huge improvement for extremely skewed graphs 50 • 30 40 Still beneficial for evenly-distributed low-degree graphs • Scaling enabled by Scaling improved by 30 20 Scheduler Awareness Scheduler Awareness 20 10 10 0 0 0 14 28 42 56 0 14 28 42 56 # Physical Cores # Physical Cores 26

  27. Vector-Sparse Contribution #2 27

  28. Compressed-Sparse [0] [1] [2] [3] Vertex Index 0 3 7 10 Edges 23 10 50 4 0 53 62 1 78 50 23 4 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 28

  29. Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 29

  30. Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 × Vertex 1 30

  31. Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 31

  32. Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 32

  33. Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 × × Vertex 0 Vertex 1 Vertex 2 Padding 33

  34. Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 1 1 1 0 1 1 1 1 1 1 1 0 1 1 Vertex 0 Vertex 1 Vertex 2 Padding + “valid” bits 34

  35. Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 0 1 2 1 1 1 0 1 1 1 1 1 1 1 0 1 1 Vertex 0 Vertex 1 Vertex 2 Padding + “valid” bits + top-level vertex spread-encoding 35

  36. Analyzing Vector-Sparse Packing Efficiency Performance Impact 100% PageRank CC BFS Average Efficiency 3.0 75% 2.5 Speedup 2.0 50% 1.5 1.0 25% 0.5 0% 0.0 twitter-2010 acs-usa twitter-2010 livejournal friendster acs-usa uk-2007 livejournal friendster uk-2007 di m di m Generally ≥ 75% 1.5× to 2.5× 36

  37. Performance Comparison Putting it all together 37

  38. Evaluation Scope • Grazelle is compared with Ligra, Polymer, GraphMat, and X-Stream • Three applications: PageRank, Connected Components, Breadth-First Search • Running on a machine equipped with four Intel Xeon E7-4850 v3 processors • 14 physical cores / 28 logical cores per socket 38

  39. PageRank: Peak Processing Throughput Grazelle-Pull Grazelle-Pus h Ligra-Pull Ligra-Push Polymer GraphMat X-Stream 1E+5 15.2× Execution Time (ms) 1E+4 1.4× 2.3× 1E+3 2.3× 3.6× 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 39

  40. Connected Components: Dynamic Control Flow Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream 1E+6 4.9× 1E+5 21.1× Execution Time (ms) 1.5× 1.6× 1E+4 1E+3 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 40

  41. Breadth-First Search: Compatibility of Optimizations Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream 1E+6 1E+5 Execution Time (ms) 1E+4 1E+3 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend