General Transformations for GPU Execution of Tree Traversals - - PowerPoint PPT Presentation

general transformations for gpu execution of tree
SMART_READER_LITE
LIVE PREVIEW

General Transformations for GPU Execution of Tree Traversals - - PowerPoint PPT Presentation

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13 GPU execution of


slide-1
SLIDE 1

General Transformations for GPU Execution of Tree Traversals

Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering

* Now at Qualcomm; ** Now at Google

Thursday, November 21, 13

slide-2
SLIDE 2

GPU execution of irregular programs

  • GPUs offer promise of massive, energy-efficient

parallelism

  • Much success in mapping regular applications to GPUs
  • Regular memory accesses, predictable computation
  • Much less success in mapping irregular applications
  • Pointer-based data structures
  • Unpredictable, input-dependent computation and

memory accesses

2

Thursday, November 21, 13

slide-3
SLIDE 3

Tree traversal algorithms

  • Many irregular algorithms are built around

tree-traversal

  • Barnes-Hut
  • Nearest-neighbor
  • 2-point correlation
  • Numerous papers describing how to map

tree traversal algorithms to GPUs

3

Thursday, November 21, 13

slide-4
SLIDE 4

Point correlation

4

  • Data mining algorithm
  • Goal: given a set of N

points in k dimensions and a point p, find all points within a radius r of p

  • Naïve approach: compare all

N points with p

  • Better approach: build kd-

tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, November 21, 13

slide-5
SLIDE 5

Point correlation

5

  • Data mining algorithm
  • Goal: given a set of N

points in k dimensions and a point p, find all points within a radius r of p

  • Naïve approach: compare all

N points with p

  • Better approach: build kd-

tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, November 21, 13

slide-6
SLIDE 6

Point correlation

6

  • Data mining algorithm
  • Goal: given a set of N

points in k dimensions and a point p, find all points within a radius r of p

  • Naïve approach: compare all

N points with p

  • Better approach: build kd-

tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, November 21, 13

slide-7
SLIDE 7

Point correlation

7

A

Thursday, November 21, 13

slide-8
SLIDE 8

Point correlation

7

A B G

Thursday, November 21, 13

slide-9
SLIDE 9

Point correlation

7

A B G C F

Thursday, November 21, 13

slide-10
SLIDE 10

Point correlation

7

A B G C F

D

E

Thursday, November 21, 13

slide-11
SLIDE 11

Point correlation

7

A B G C F H K

D

E

Thursday, November 21, 13

slide-12
SLIDE 12

Point correlation

7

A B G C F H K

D

E I J

Thursday, November 21, 13

slide-13
SLIDE 13

Point correlation

8

D

E I J C F H K B G A

Thursday, November 21, 13

slide-14
SLIDE 14

Point correlation

8

D

E I J C F H K B G A

Thursday, November 21, 13

slide-15
SLIDE 15

Point correlation

9

D

E I J C F H K B G A

Thursday, November 21, 13

slide-16
SLIDE 16

Point correlation

10

D

E I J C F H K B G A

Thursday, November 21, 13

slide-17
SLIDE 17

Point correlation

11

D

E I J C F H K B G A

Thursday, November 21, 13

slide-18
SLIDE 18

Point correlation

12

D

E I J C F H K B G A

Thursday, November 21, 13

slide-19
SLIDE 19

Point correlation

13

D

E I J C F H K B G A

Thursday, November 21, 13

slide-20
SLIDE 20

Point correlation

14

D

E I J C F H K B G A

Thursday, November 21, 13

slide-21
SLIDE 21

Point correlation

15

KDCell root = /* build kdtree */; Set<Point> ps; double radius; foreach Point p in ps { recurse(p, root, radius); } ... void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); } }

Thursday, November 21, 13

slide-22
SLIDE 22

Basic pattern

16

TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }

Thursday, November 21, 13

slide-23
SLIDE 23

Basic pattern

16

recursive traversal

TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }

Thursday, November 21, 13

slide-24
SLIDE 24

Basic pattern

16

recursive traversal tree structure

TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }

Thursday, November 21, 13

slide-25
SLIDE 25

Basic pattern

16

recursive traversal repeated traversal tree structure

TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }

Thursday, November 21, 13

slide-26
SLIDE 26

Basic pattern

16

recursive traversal repeated traversal tree structure

TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }

Lots of parallelism!

Thursday, November 21, 13

slide-27
SLIDE 27

What’s the problem?

  • GPUs add high overhead for recursion
  • GPUs work best when memory accesses are

regular and strided, but irregular algorithms have unpredictable memory accesses

  • Status quo: ad hoc solutions
  • New algorithm? New GPU techniques!

17

Thursday, November 21, 13

slide-28
SLIDE 28

What’s the problem?

  • GPUs add high overhead for recursion
  • GPUs work best when memory accesses are

regular and strided, but irregular algorithms have unpredictable memory accesses

  • Status quo: ad hoc solutions
  • New algorithm? New GPU techniques!

17

Want generally applicable techniques for mapping irregular applications to GPUs

Thursday, November 21, 13

slide-29
SLIDE 29

Contributions

  • Two general techniques for mapping tree-

traversals to GPUs

  • Autoropes: eliminates recursion overhead
  • Lockstepping: promotes memory coalescing
  • Compiler pass to automatically apply techniques

to recursive tree-traversal code

  • Significant GPU speedups on 5 tree-traversal

algorithms

18

Thursday, November 21, 13

slide-30
SLIDE 30

Naïve GPU implementation

  • Warp-based SIMT (single-instruction, multiple-

thread) execution

  • 32 points put in a single warp
  • Warp traverses tree
  • All points in warp must execute same

instruction

  • If points diverge, some points sit idle while other

threads execute

19

Thursday, November 21, 13

slide-31
SLIDE 31

Naïve GPU implementation

20

A B G C F H K

D

E I J

Thursday, November 21, 13

slide-32
SLIDE 32

Naïve GPU implementation

20

A B G C F H K

D

E I J

Thursday, November 21, 13

slide-33
SLIDE 33

Naïve GPU implementation

20

A B G C F H K

D

E I J A B G C F

D

E

Thursday, November 21, 13

slide-34
SLIDE 34

Naïve GPU implementation

20

A B G C F H K

D

E I J A B G H K I J

Thursday, November 21, 13

slide-35
SLIDE 35

A B G H K I J C F

D

E

Naïve GPU implementation

21

B A G

Thursday, November 21, 13

slide-36
SLIDE 36

A B G H K I J C F

D

E

Naïve GPU implementation

22

B A G

Thursday, November 21, 13

slide-37
SLIDE 37

A B G H K I J C F

D

E

Naïve GPU implementation

23

B A G

Thursday, November 21, 13

slide-38
SLIDE 38

A B G H K I J C F

D

E

Naïve GPU implementation

24

B A G

Thursday, November 21, 13

slide-39
SLIDE 39

A B G H K I J C F

D

E

Naïve GPU implementation

25

B A G

Thursday, November 21, 13

slide-40
SLIDE 40

A B G H K I J C F

D

E

Naïve GPU implementation

26

B A G

Thursday, November 21, 13

slide-41
SLIDE 41

A B G H K I J C F

D

E

Naïve GPU implementation

27

B A G

Thursday, November 21, 13

slide-42
SLIDE 42

A B G H K I J C F

D

E

Naïve GPU implementation

28

B A G

Thursday, November 21, 13

slide-43
SLIDE 43

A B G H K I J C F

D

E

Naïve GPU implementation

29

B A G

Thursday, November 21, 13

slide-44
SLIDE 44

A B G H K I J C F

D

E

Naïve GPU implementation

30

B A G

Thursday, November 21, 13

slide-45
SLIDE 45

A B G H K I J C F

D

E

Naïve GPU implementation

31

B A G

Thursday, November 21, 13

slide-46
SLIDE 46

A B G H K I J C F

D

E

Naïve GPU implementation

32

B A G

Thursday, November 21, 13

slide-47
SLIDE 47

A B G H K I J C F

D

E

Naïve GPU implementation

33

B A G

Thursday, November 21, 13

slide-48
SLIDE 48

A B G H K I J C F

D

E

Naïve GPU implementation

34

B A G

Thursday, November 21, 13

slide-49
SLIDE 49

Lots of accesses to tree

  • Many accesses just moving up the tree in
  • rder to later move down again
  • Lots of function stack manipulation
  • Trees are very large, cannot be stored in

GPU’s fast memory

  • Want to minimize accesses to tree

35

Thursday, November 21, 13

slide-50
SLIDE 50

How to avoid extra accesses to tree?

  • Typical technique:

ropes

  • Pointers in each

tree node that let a traversal jump to the next part

  • f the tree
  • Effectively linearizes

traversal

36

A B G C F H K

D

E I J

Thursday, November 21, 13

slide-51
SLIDE 51

How to avoid extra accesses to tree?

  • Typical technique:

ropes

  • Pointers in each

tree node that let a traversal jump to the next part

  • f the tree
  • Effectively linearizes

traversal

36

A B G C F H K

D

E I J

Thursday, November 21, 13

slide-52
SLIDE 52

How to avoid extra accesses to tree?

  • Typical technique:

ropes

  • Pointers in each

tree node that let a traversal jump to the next part

  • f the tree
  • Effectively linearizes

traversal

36

A B G C F H K

D

E I J

Thursday, November 21, 13

slide-53
SLIDE 53

How to avoid extra accesses to tree?

  • Typical technique:

ropes

  • Pointers in each

tree node that let a traversal jump to the next part

  • f the tree
  • Effectively linearizes

traversal

36

A B G C F H K

D

E I J

Thursday, November 21, 13

slide-54
SLIDE 54

How to avoid extra accesses to tree?

  • Installing ropes into a

tree requires complex, application- specific preprocessing

  • Using ropes correctly

during execution requires complex, application-specific logic

37

A B G C F H K

D

E I J

Thursday, November 21, 13

slide-55
SLIDE 55

Autoropes

  • General technique for “linearizing” tree

traversal for arbitrary traversal algorithms

  • Achieve generality, simplicity and space-

efficiency at the cost of overhead

  • Key insight: recursive tree algorithms are just

depth-first traversals of a tree; can transform into iterative algorithm

38

Thursday, November 21, 13

slide-56
SLIDE 56

Autoropes

39

A B G C F H K

D

E I J A B G C F

D

E

A

Rope stack

Thursday, November 21, 13

slide-57
SLIDE 57

Autoropes

40

A B G C F H K

D

E I J

G

A B G C F

D

E

B

Rope stack

Thursday, November 21, 13

slide-58
SLIDE 58

Autoropes

41

A B G C F H K

D

E I J

F

A B G C F

D

E

G C

Rope stack

Thursday, November 21, 13

slide-59
SLIDE 59

Autoropes

42

A B G C F H K

D

E I J

E

A B G C F

D

E

F G D

Rope stack

Thursday, November 21, 13

slide-60
SLIDE 60

Autoropes

43

A B G C F H K

D

E I J

F

A B G C F

D

E

G E

Rope stack

Thursday, November 21, 13

slide-61
SLIDE 61

Autoropes

44

A B G C F H K

D

E I J

G

A B G C F

D

E

F

Rope stack

Thursday, November 21, 13

slide-62
SLIDE 62

Autoropes

45

A B G C F H K

D

E I J A B G C F

D

E

G

Rope stack

Thursday, November 21, 13

slide-63
SLIDE 63

Autoropes

  • Ropes stored on rope stack instead of in tree
  • No application-specific code to use ropes
  • Ropes instantiated dynamically
  • No preprocessing required
  • Same access patterns as with manual ropes
  • Extra pushes and pops on rope stack add

some overhead

46

Thursday, November 21, 13

slide-64
SLIDE 64

Autoropes

47

void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); } }

Simple compiler transformation converts recursive code into iterative autoropes code

Thursday, November 21, 13

slide-65
SLIDE 65

Autoropes

48

ropeStack.push(root); while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); } } See paper for details of how to transform more complex code

Thursday, November 21, 13

slide-66
SLIDE 66

Unintended consequence

49

  • Recursive calls naturally

lead to thread divergence

  • If some threads make

recursive calls, other threads wait until calls return

  • Does not happen for

iterative code

  • All threads reconverge

at beginning of loop

ropeStack.push(root); while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); } }

Thursday, November 21, 13

slide-67
SLIDE 67

formerly recursion

Unintended consequence

49

  • Recursive calls naturally

lead to thread divergence

  • If some threads make

recursive calls, other threads wait until calls return

  • Does not happen for

iterative code

  • All threads reconverge

at beginning of loop

ropeStack.push(root); while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); } }

formerly return

Thursday, November 21, 13

slide-68
SLIDE 68

A B G H K I J C F

D

E

Autoropes on GPU

50

B A G

Thursday, November 21, 13

slide-69
SLIDE 69

A B G H K I J C F

D

E

Autoropes on GPU

51

B A G

Thursday, November 21, 13

slide-70
SLIDE 70

A B G H K I J C F

D

E

Autoropes on GPU

52

B A G

Thursday, November 21, 13

slide-71
SLIDE 71

A B G H K I J C F

D

E

Autoropes on GPU

53

B A G

Thursday, November 21, 13

slide-72
SLIDE 72

A B G H K I J C F

D

E

Autoropes on GPU

54

B A G

Thursday, November 21, 13

slide-73
SLIDE 73

A B G H K I J C F

D

E

Autoropes on GPU

55

B A G

Thursday, November 21, 13

slide-74
SLIDE 74

A B G H K I J C F

D

E

Autoropes on GPU

56

B A G

Thursday, November 21, 13

slide-75
SLIDE 75

A B G H K I J C F

D

E

Autoropes on GPU

56

B A G

Threads no longer diverge in execution But do diverge in tree!

Thursday, November 21, 13

slide-76
SLIDE 76

Thread divergence vs. memory coalescing

  • Memory accesses on GPU only well behaved

if accesses by all threads in warp can be coalesced

  • Same memory or strided access
  • Bad memory behavior of autoropes
  • utweighs lack of thread divergence
  • Goal: benefits of autoropes while maintaining

memory coalescing

57

Thursday, November 21, 13

slide-77
SLIDE 77

Lockstepping

  • Essentially, force GPU to let threads diverge
  • If any thread in a warp wants to visit a node’s children,

all threads in a warp visit the child

  • Threads that are “dragged along” are programmatically

masked out

  • Warp execution takes longer (proportional to union of

threads’ traversals, rather than longest traversal), but improved memory performance makes up for it

  • Automatically implemented during autoropes compiler

pass

58

Thursday, November 21, 13

slide-78
SLIDE 78

Dynamic lockstepping

  • Some algorithms allow different traversal
  • rders
  • Some points visit left child before right,

and others visit right before left

  • Optimization reduces traversal size
  • Inherently bad memory access patterns

59

Thursday, November 21, 13

slide-79
SLIDE 79

A B G H K I J C F

D

E

Dynamic lockstepping

60

B A G

Thursday, November 21, 13

slide-80
SLIDE 80

A B G H K I J C F

D

E

Dynamic lockstepping

61

B A G

Thursday, November 21, 13

slide-81
SLIDE 81

A B G H K I J C F

D

E

Dynamic lockstepping

62

B A G

Thursday, November 21, 13

slide-82
SLIDE 82

A B G H K I J C F

D

E

Dynamic lockstepping

63

B A G

Thursday, November 21, 13

slide-83
SLIDE 83

A B G H K I J C F

D

E

Dynamic lockstepping

64

B A G

Thursday, November 21, 13

slide-84
SLIDE 84

Dynamic lockstepping

  • Dynamic lockstepping allows all points in a

warp to “vote” on which traversal order to use

  • Maintains memory coalescing
  • Some points do more work than in original

algorithm

  • Tradeoff can still be worth it!

65

Thursday, November 21, 13

slide-85
SLIDE 85

Engineering details

  • Transform point data from array of structures format

to structure of arrays

  • Use analysis from [PACT 2013] to prove safety and

transform automatically

  • Copy tree data to GPU in linearized fashion
  • Lay out fields of tree and point according to use (more

commonly-accessed fields placed in shared memory)

  • Interleave rope stacks for points in warp to allow

strided access

66

Thursday, November 21, 13

slide-86
SLIDE 86

Results

  • Two platforms
  • GPU platform: NVIDIA Tesla C2070 (6GB global

memory, 14 SMs)

  • CPU platform: 32-core, 2.3 GHz Opteron
  • Five benchmarks
  • Barnes-Hut, Point correlation, Nearest neighbor, k-

Nearest neighbor, Vantage point trees

  • Multiple inputs per benchmark
  • Used sorted and unsorted points

67

Thursday, November 21, 13

slide-87
SLIDE 87

High-level takeaways

  • Autoropes+lockstep always faster than

simple recursive GPU implementation (up to 14x faster)

  • For most benchmarks/inputs, best GPU

implementation faster than CPU implementation up to 16 threads

  • Speedups comparable to hand-written

implementations

68

Thursday, November 21, 13

slide-88
SLIDE 88

Barnes-Hut

69

Thursday, November 21, 13

slide-89
SLIDE 89

Point correlation

70

Thursday, November 21, 13

slide-90
SLIDE 90

Nearest-neighbor

71

Thursday, November 21, 13

slide-91
SLIDE 91

Conclusions

  • Mapping irregular applications to GPUs is very

difficult

  • Developed two general techniques, autoropes

and lockstepping, that can achieve significant speedup on GPU

  • vs. baseline GPU code and CPU

implementations

  • Automatic approaches competitive with previous

hand-written implementations

72

Thursday, November 21, 13

slide-92
SLIDE 92

General Transformations for GPU Execution of Tree Traversals

Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering

* Now at Qualcomm; ** Now at Google

Thursday, November 21, 13