Making Pull-Based Graph Processing Performant Samuel Grossman 1 , - - PowerPoint PPT Presentation

making pull based graph processing performant
SMART_READER_LITE
LIVE PREVIEW

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , - - PowerPoint PPT Presentation

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 February 27, 2018 Graph Processing Problem modelled as


slide-1
SLIDE 1

Making Pull-Based Graph Processing Performant

Samuel Grossman1, Heiner Litz2, and Christos Kozyrakis1

1Stanford University 2University of California, Santa Cruz

PPoPP 2018 · February 27, 2018

slide-2
SLIDE 2

Graph Processing

  • Problem modelled as objects (vertices) and

connections between them (edges)

  • Examples:
  • Internet (pages and hyperlinks)
  • Social network (people and friendships)
  • Roads and intersections
  • Products and ratings

2

slide-3
SLIDE 3

Graph Processing

G G E E F D G I L H K J B A C A B C D E I I H H H K

3

slide-4
SLIDE 4

Graph Processing

G’ E’ A’ B’ C’ F’ D’ I’ L’ H’ K’ J’

Repeat until convergence

4

slide-5
SLIDE 5

Graph Processing

Push Pull

5

Group by source vertex Group by destination vertex

Hybrid: dynamically select push or pull for each iteration

slide-6
SLIDE 6

Graph Processing

foreach vertex v in graph.vertices foreach edge e in v.(in|out)edges // process the edge ...

6

slide-7
SLIDE 7

Parallelizing Graph Processing

  • Outer loop parallelization
  • Between cores: assign entire vertices to threads
  • Inner loop parallelization
  • Between cores: subdivide the edges within each vertex
  • Within one core: vectorize the loop

7

slide-8
SLIDE 8

Parallelizing Graph Processing

8

Running Ligra on twitter-2010 graph 2 4 6 8 10 PageRank Br eadth-Fir st Sear ch Speedup Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h)

slide-9
SLIDE 9

Parallelizing Graph Processing

9

2 4 6 8 10 PageRank Br eadth-Fir st Sear ch Speedup Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) Running Ligra on twitter-2010 graph

slide-10
SLIDE 10

Running Ligra on twitter-2010 graph

Parallelizing Graph Processing

10

2 4 6 8 10 PageRank Br eadth-Fir st Sear ch Speedup Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h)

slide-11
SLIDE 11

Pull’s Performance Challenge

Serial Inner Loop

  • One thread per vertex
  • Updates are thread-private

Parallel Inner Loop

  • Multiple threads per vertex
  • Updates will conflict

11

Contribution #1: “Scheduler Awareness” A technique that can be applied to the inner loop of a pull engine to parallelize it without introducing conflicts.

slide-12
SLIDE 12

Pull’s Performance Opportunity

  • Further gains possible using SIMD vectorization
  • Improve parallelism of the computation
  • Improve memory bandwidth utilization
  • Data structure layout issues impede effective

vectorization in existing work

12

Contribution #2: “Vector-Sparse” A low-level modification to a data structure commonly used to represent graphs, intended to enhance vectorization.

slide-13
SLIDE 13

Grazelle

  • A hybrid graph processing framework that

embodies both of our contributions

  • Outperforms the state-of-the-art by over 10× in

some cases

  • Available for download at

https://github.com/stanford-mast/Grazelle-PPoPP18

13

slide-14
SLIDE 14

Scheduler Awareness

Contribution #1

14

slide-15
SLIDE 15

Serial Inner Loop

15

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data

slide-16
SLIDE 16

Serial Inner Loop

16

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data

slide-17
SLIDE 17

Scheduler Un-Awareness

17

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data

slide-18
SLIDE 18

Scheduler Un-Awareness

18

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data

slide-19
SLIDE 19

Scheduler Awareness

19

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data

slide-20
SLIDE 20

Scheduler Awareness

20

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data

  • 1. Writes at the end of a chunk are redirected to a private

per-chunk merge buffer.

slide-21
SLIDE 21

Scheduler Awareness

21

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data Merge Buffers

  • 1. Writes at the end of a chunk are redirected to a private

per-chunk merge buffer.

slide-22
SLIDE 22

Scheduler Awareness

22

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data Merge Buffers

  • 1. Writes at the end of a chunk are redirected to a private

per-chunk merge buffer.

  • 2. Writes in the middle of a chunk can be committed to

shared state without synchronization.

slide-23
SLIDE 23

Scheduler Awareness

23

  • 1. Writes at the end of a chunk are redirected to a private

per-chunk merge buffer.

  • 2. Writes in the middle of a chunk can be committed to

shared state without synchronization.

2 1

1 2 3 4 1 2 3 4 5 6 7 1 2

Chunk A Chunk B Chunk C

Vertex (Outer Loop) Edge (Inner Loop) Vertex Data Merge Buffers

slide-24
SLIDE 24

Analyzing Scheduler Awareness

  • Performance impact depends primarily on the

scheduling granularity

  • Scheduler Un-Awareness: trade-off between load

balance and probability of write conflicts

  • Scheduler Awareness: finer granularity leads to

increased merge operation overhead

24

slide-25
SLIDE 25

PageRank: Performance vs. Scheduling Granularity

dimacs-usa (low, even degree distribution) uk-2007 (extremely skewed)

25

0.0 0.2 0.4 0.6 0.8 1.0 1.2 100 1,000 10,000

  • Rel. Execution Time

Chunk Size Scheduler Un-Aware 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1,000 10,000 100,000

  • Rel. Execution Time

Chunk Size Scheduler-Aware 10× Different 50× 3.3× 1.2×

slide-26
SLIDE 26

PageRank: Performance vs. Number of Cores

dimacs-usa (low, even degree distribution) uk-2007 (extremely skewed)

26

10 20 30 40 50 14 28 42 56

  • Rel. Performance

# Physical Cores Scheduler Un-Aware 10 20 30 40 50 60 70 14 28 42 56

  • Rel. Performance

# Physical Cores Scheduler Awar e Scaling enabled by Scheduler Awareness Scaling improved by Scheduler Awareness Key Insights

  • Huge improvement for extremely skewed graphs
  • Still beneficial for evenly-distributed low-degree graphs
slide-27
SLIDE 27

Vector-Sparse

Contribution #2

27

slide-28
SLIDE 28

Compressed-Sparse

23 10 50 4 0 53 62 1 78 50 23 4

[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [0] [1] [2] [3]

3 7 10 Vertex Index Edges

28

slide-29
SLIDE 29

Vectorizing Compressed-Sparse

29

23 10 50 4 0 53 62 1 78 50 23 4

[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Vertex 0 Vertex 1

slide-30
SLIDE 30

Vectorizing Compressed-Sparse

30

Vertex 1 23 10 50 4 0 53 62 1 78 50 23 4

[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Vertex 0×

slide-31
SLIDE 31

Vectorizing Compressed-Sparse

31

23 10 50 4 0 53 62 1 78 50 23 4

[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Vertex 0 Vertex 1

slide-32
SLIDE 32

Vectorizing Compressed-Sparse

32

23 10 50 4 0 53 62 1 78 50 23 4

[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Vertex 0 Vertex 1

slide-33
SLIDE 33

Vector-Sparse

23 10 50 4 0 53 62 1 78 50 23

[0] [1] [2] [4] [5] [6] [7] [8] [9] [11] [12]

Vertex 0 Vertex 1

[3]

4

[13]

Vertex 2

[10]

33

× ×

Padding

slide-34
SLIDE 34

Vector-Sparse

23 10 50 4 0 53 62 1 78 50 23 Vertex 0 Vertex 1 4 Vertex 2

1 1 1 1 1 1 1 1 1 1 1 1

[0] [1] [2] [4] [5] [6] [7] [8] [9] [11] [12] [3] [13] [10]

34

Padding + “valid” bits

slide-35
SLIDE 35

Vector-Sparse

23 10 50 4 0 53 62 1 78 50 23 Vertex 0 Vertex 1 4 Vertex 2

1 1 1 1 1 1 1 1 1 1 1 1 1 2

[0] [1] [2] [4] [5] [6] [7] [8] [9] [11] [12] [3] [13] [10]

35

Padding + “valid” bits + top-level vertex spread-encoding

slide-36
SLIDE 36

0.0 0.5 1.0 1.5 2.0 2.5 3.0

di m acs-usa livejournal twitter-2010 friendster uk-2007

Speedup PageRank CC BFS 0% 25% 50% 75% 100%

di m acs-usa livejournal twitter-2010 friendster uk-2007

Average Efficiency

Analyzing Vector-Sparse

Packing Efficiency Performance Impact

36

Generally ≥ 75% 1.5× to 2.5×

slide-37
SLIDE 37

Performance Comparison

Putting it all together

37

slide-38
SLIDE 38

Evaluation Scope

  • Grazelle is compared with Ligra, Polymer,

GraphMat, and X-Stream

  • Three applications: PageRank, Connected

Components, Breadth-First Search

  • Running on a machine equipped with four

Intel Xeon E7-4850 v3 processors

  • 14 physical cores / 28 logical cores per socket

38

slide-39
SLIDE 39

1E+0 1E+1 1E+2 1E+3 1E+4 1E+5

di m acs-usa livejournal twitter-2010 friendster uk-2007

Execution Time (ms)

Grazelle-Pull Grazelle-Pus h Ligra-Pull Ligra-Push Polymer GraphMat X-Stream

PageRank: Peak Processing Throughput

39

×

Logarithmic

3.6× 2.3× 2.3× 1.4× 15.2×

slide-40
SLIDE 40

1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6

di m acs-usa livejournal twitter-2010 friendster uk-2007

Execution Time (ms)

Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream

Connected Components: Dynamic Control Flow

40

×

Logarithmic

4.9× 1.5× 1.6× 21.1×

slide-41
SLIDE 41

1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6

di m acs-usa livejournal twitter-2010 friendster uk-2007

Execution Time (ms)

Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream

Breadth-First Search: Compatibility of Optimizations

41

×

Logarithmic

slide-42
SLIDE 42

Conclusion

  • Two contributions to improve inner loop

parallelization for pull-based graph processing

  • Scheduler Awareness: eliminate write conflicts
  • Vector-Sparse: enable SIMD vectorization
  • Grazelle significantly out-performs state-of-the-art,

in some cases by over 10×

  • Grazelle is available for download at

https://github.com/stanford-mast/Grazelle-PPoPP18

42

slide-43
SLIDE 43

Thank You

Questions?

43