Parallel Depth First on GPU M. Naumov, A. Vrielink and M. Garland, - - PowerPoint PPT Presentation

parallel depth first on gpu
SMART_READER_LITE
LIVE PREVIEW

Parallel Depth First on GPU M. Naumov, A. Vrielink and M. Garland, - - PowerPoint PPT Presentation

Parallel Depth First on GPU M. Naumov, A. Vrielink and M. Garland, GTC 2017 Introduction Directed Trees Directed Acyclic Graphs (DAGs) AGENDA Path- and SSSP-based variants Optimizations Performance Experiments 2 What is DFS? a Node:


slide-1
SLIDE 1
  • M. Naumov, A. Vrielink and M. Garland, GTC 2017

Parallel Depth First on GPU

slide-2
SLIDE 2

2

AGENDA

Introduction Directed Trees Directed Acyclic Graphs (DAGs)  Path- and SSSP-based variants  Optimizations Performance Experiments

slide-3
SLIDE 3

3

What is DFS?

a d b f e c j g i Node: a,b,c,d,e,f,g,i,j Parent: Discovery: Finish:

slide-4
SLIDE 4

4

What is DFS?

a d b f e c j g i Node: a,b,c,d,e,f,g,i,j Parent: /,a Discovery: a,b Finish:

slide-5
SLIDE 5

5

What is DFS?

a d b f e c j g i Node: a,b,c,d,e,f,g,i,j Parent: /,a, b, Discovery: a,b,e Finish: e

slide-6
SLIDE 6

6

What is DFS?

a d b f e c j g i Node: a,b,c,d,e,f,g,i,j Parent: /,a, b,b Discovery: a,b,e,f Finish: e

slide-7
SLIDE 7

7

What is DFS?

a d b f e c j g i Node: a,b,c,d,e,f,g,i,j Parent: /,a, b,b, ,f Discovery: a,b,e,f,i Finish: e,i

slide-8
SLIDE 8

8

What is DFS?

a d b f e c j g i Node: a,b,c,d,e,f,g,i,j Parent: /,a, b,b, ,f,f Discovery: a,b,e,f,i,j Finish: e,i,j

slide-9
SLIDE 9

9

What is DFS?

a d b f e c j g i Node: a,b,c,d,e,f,g,i,j Parent: /,a,a,a,b,b,d,f,f Discovery: a,b,e,f,i,j,c,d,g Finish: e,i,j,f,b,c,g,d,a

slide-10
SLIDE 10

10

Previous Work on DFS

Planar Graphs

Time O(log2n) Processors O(n) Directed Acyclic Graphs (DAGs) Time O(log2n) Processors O(nω/log n) Directed Graphs with Cycles Time O( 𝑜 log11n) Processors O(n3) where ω < 2.373 is the matrix multiplication exponent Lexicographic DFS

slide-11
SLIDE 11

11

Previous Work on DFS

Planar Graphs

Time O(log2n) Processors O(n) Directed Acyclic Graphs (DAGs) Time O(log2n) Processors O(nω/log n) Directed Graphs with Cycles Time O( 𝑜 log11n) Processors O(n3) where ω < 2.373 is the matrix multiplication exponent Lexicographic DFS topological sort, bi-connectivity and planarity testing

slide-12
SLIDE 12

12

DIRECTED TREES

slide-13
SLIDE 13

13

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] Phase 2: Bottom-Up Traversal

slide-14
SLIDE 14

14

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,1,1] Phase 2: Bottom-Up Traversal

slide-15
SLIDE 15

15

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1,2] [0,1] prefix sum Phase 2: Bottom-Up Traversal

slide-16
SLIDE 16

16

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,1,3] [0,1,2] Phase 2: Bottom-Up Traversal

slide-17
SLIDE 17

17

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] prefix sum [0,1,4] [0,1,2] Phase 2: Bottom-Up Traversal

slide-18
SLIDE 18

18

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,5,1,2] [0,1,4] [0,1,2] Phase 2: Bottom-Up Traversal

slide-19
SLIDE 19

19

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,5,6,8] [0,1,4] [0,1,2] Phase 2: Bottom-Up Traversal

slide-20
SLIDE 20

20

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,5,6,8] [0,1,4] [0,1,2] This phase is done, next phase is about to start …

slide-21
SLIDE 21

21

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,5,6,8] [0,1,4] [0,1,2] Phase 3: Top-down Traversal

  • ffset 0
slide-22
SLIDE 22

22

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,5,6,8] [0,1,4] [0,1,2] Phase 3: Top-down Traversal

  • ffset 0
  • ffset 1
slide-23
SLIDE 23

23

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,5,6,8] [0,1,4] [0,1,2] Phase 3: Top-down Traversal

  • ffset 0
  • ffset 6
  • ffset 1
slide-24
SLIDE 24

24

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,5,6,8] [0,1,4] [0,1,2] Phase 3: Top-down Traversal discovery 0+2 discovery 1+3 discovery = offset + depth discovery 6+1

slide-25
SLIDE 25

25

Directed Tree

a d b f e c j g i [0] [0] [0] [0] [0] [0,1] [0,5,6,8] [0,1,4] [0,1,2] Phase 3: Top-down Traversal finish 0+0 finish 1+0 finish = offset + sub-tree size finish 6+1

slide-26
SLIDE 26

26

DIRECTED ACYCLIC GRAPHS

PATH-BASED VARIANT

slide-27
SLIDE 27

27

Path-Based (for DAGs)

a d b f e c j g i left right [a,b,f] [a,d,f] f collision Phase 1

slide-28
SLIDE 28

28

Path-Based (for DAGs)

a d b f e c j g i left right [a,b,f] [a,d,f] f collision

  • wait until all paths to a node are traversed
  • align path sequences

left [a,b,f] right [a,d,f]

  • compare left-to-right and choose smallest

(lexicographically smallest)

resolution Phase 1

slide-29
SLIDE 29

29

Path-Based (for DAGs)

a d b f e c j g i This phase is done

slide-30
SLIDE 30

30

OPTIMIZATIONS

slide-31
SLIDE 31

31

Path Pruning

[a,c,d,f] a b e d c f [a,b,e,f]

slide-32
SLIDE 32

32

Path Pruning

[a,c,d,f] a b e d c f [a,b,e,f] When two paths reach the same node  There exists a parent “a” where the path split [a,b,…] and [a,c,…]

slide-33
SLIDE 33

33

Path Pruning

[a,c,d,f] a b e d c f [a,b,e,f] When two paths reach the same node  There exists a parent “a” where the path split [a,b,…] and [a,c,…]  It is the comparison between “b” and “c” that allows us to distinguish between paths

slide-34
SLIDE 34

34

Path Pruning

[a,c,d,f] a b e d c f [a,b,e,f] When two paths reach the same node  There exists a parent “a” where the path split [a,b,…] and [a,c,…]  It is the comparison between “b” and “c” that allows us to distinguish between paths  Parent node with a single edge will never be a decision point

slide-35
SLIDE 35

35

Path Pruning

[a,c,f] a b e d c f [a,b,f] When two paths reach the same node  There exists a parent “a” where the path split [a,b,…] and [a,c,…]  It is the comparison between “b” and “c” that allows us to distinguish between paths  Parent node with a single edge will never be a decision point  No need to store nodes with such parents

slide-36
SLIDE 36

36

Path Pruning

slide-37
SLIDE 37

37

Phase Composition

slide-38
SLIDE 38

38

SSSP-BASED VARIANT

slide-39
SLIDE 39

39

SSSP-based (for DAGs)

a d b f e c j g i [1] [1] [1] [1] [1] Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0) Phase 1: Bottom-Up Traversal Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0)

slide-40
SLIDE 40

40

SSSP-based (for DAGs)

a d b f e c j g i [1] [1] [1] [1] [1] [1,1] [1,1,1] Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0) Phase 1: Bottom-Up Traversal Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0)

slide-41
SLIDE 41

41

SSSP-based (for DAGs)

a d b f e c j g i [1] [1] [1] [1] [1] [1,2] [1,2,3] Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0) prefix sum Phase 1: Bottom-Up Traversal Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0)

slide-42
SLIDE 42

42

SSSP-based (for DAGs)

a d b f e c j g i [1] [1] [1] [1] [1] [1,2] [1,1,3] [1,2,3] Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0) Phase 1: Bottom-Up Traversal Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0)

slide-43
SLIDE 43

43

SSSP-based (for DAGs)

a d b f e c j g i [1] [1] [1] [1] [1] [1,2] [1,2,4] [1,2,3] Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0) Phase 1: Bottom-Up Traversal prefix sum Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0)

slide-44
SLIDE 44

44

SSSP-based (for DAGs)

a d b f e c j g i [1] [1] [1] [1] [1] [1,2] [1,5,1,2,1] [1,2,4] [1,2,3] Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0) Phase 1: Bottom-Up Traversal Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0)

slide-45
SLIDE 45

45

SSSP-based (for DAGs)

a d b f e c j g i [1] [1] [1] [1] [1] [1,2] [1,6,7,9,10] [1,2,4] [1,2,3] Run the algorithm for Directed Trees, but  Propagate # of nodes to all the parents  Start prefix sum with 1 (instead of 0) Phase 1: Bottom-Up Traversal

slide-46
SLIDE 46

46

SSSP-based (for DAGs)

a d b f e c j g i 1 2 1 6 7 9 1 1 2 This phase is done, next phase is about to start … Assign # of nodes as the edge weight

slide-47
SLIDE 47

47

SSSP-based (for DAGs)

a d b f e c j g i 1 2 1 6 7 9 1 1 2 Phase 2: Top-down traversal 1+2+2=5 < 9

slide-48
SLIDE 48

48

SSSP-based (for DAGs)

a d b f e c j g i 1 2 1 6 7 9 1 1 2 Phase 2: Top-down traversal Shortest Path is the DFS path 1+2+2=5 < 9

slide-49
SLIDE 49

49

SSSP-based (for DAGs)

a d b f e c j g i Phase 2: This phase is done

slide-50
SLIDE 50

50

OPTIMIZATIONS

slide-51
SLIDE 51

51

Discovery time

 The length of shortest path defines an ordering of nodes a d b f e c j g i 8 1 6 7 2 4 5 3 Phase 3a: Sorting

slide-52
SLIDE 52

52

Discovery time

 The length of shortest path defines an ordering of nodes  We can sort them to obtain discovery time a d b f e c j g i 8 1 6 7 2 4 5 3 Phase 3a: Sorting

slide-53
SLIDE 53

53

Discovery time

a d b f e c j g i 8 1 6 7 2 4 5 3 Discovery: a,b,e,f,i,j,c,d,g  The length of shortest path defines an ordering of nodes  We can sort them to obtain discovery time Phase 3a: This phase is done (Phase 3b will find the finish time)

slide-54
SLIDE 54

54

Phase composition

slide-55
SLIDE 55

55

EXPERIMENTS

slide-56
SLIDE 56

56

Data

# Graph n m Application 1 coPapersDBLP 540487 15251812 Citations 2 auto 448696 3350678 Numeric Sim. 3 hugebubbles-000… 18318144 30144175 Numeric Sim. 4 delaunay_n24 16777217 52556391 Random Tri. 5 il2010 451555 1166978 Census Data 6 fl2010 484482 1270757 Census Data 7 ca2010 710146 1880571 Census Data 8 tx2010 914232 2403504 Census Data 9 great-britain_osm 7733823 8523976 Road Network 10 germanu_osm 11548846 12793527 Road Network 11 road_central 14081817 21414269 Road Network 12 road_usa 23947348 35246600 Road Network

When necessary DAGs are created from general graphs by dropping back edges

slide-57
SLIDE 57

57

Performance

Results obtained with Nvidia Pascal TitanX GPU, Intel Core i7-3930K @3.2GHz CPU, Ubuntu 14.04 LTS OS, CUDA Toolkit 8.0

0.5 1 1.5 2 2.5 3 3.5 4

Speedup (Parallel vs. Seq. DFS)

Path-based SSSP-based 3 BFS 6x 5x

slide-58
SLIDE 58

58

CONCLUSIONS

slide-59
SLIDE 59

59

Conclusions

  • Parallel DFS for DAGs

 Work-efficient O(m+n)  The algorithm takes O(z log n) steps, where z is the maximum depth of a node

  • Performance

 Depends highly on the connectivity/sparsity pattern  Can achieve up to 6x speedup (but slowdown possible)

  • Details

 M. Naumov, A. Vrielink and M. Garland, “Parallel Depth-First Search for Directed Acyclic Graphs”, Technical Report, NVR-2017-001, 2017 https://research.nvidia.com/publication/parallel-depth-first-search-directed-acyclic-graphs

slide-60
SLIDE 60

60

Thank you

https://research.nvidia.com/publication/parallel-depth-first-search-directed-acyclic-graphs