General Transformations for GPU Execution of Tree Traversals
Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering
* Now at Qualcomm; ** Now at Google
Thursday, November 21, 13
General Transformations for GPU Execution of Tree Traversals - - PowerPoint PPT Presentation
General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13 GPU execution of
* Now at Qualcomm; ** Now at Google
Thursday, November 21, 13
parallelism
memory accesses
2
Thursday, November 21, 13
3
Thursday, November 21, 13
4
points in k dimensions and a point p, find all points within a radius r of p
N points with p
tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, November 21, 13
5
points in k dimensions and a point p, find all points within a radius r of p
N points with p
tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, November 21, 13
6
points in k dimensions and a point p, find all points within a radius r of p
N points with p
tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, November 21, 13
7
A
Thursday, November 21, 13
7
A B G
Thursday, November 21, 13
7
A B G C F
Thursday, November 21, 13
7
A B G C F
D
E
Thursday, November 21, 13
7
A B G C F H K
D
E
Thursday, November 21, 13
7
A B G C F H K
D
E I J
Thursday, November 21, 13
8
D
E I J C F H K B G A
Thursday, November 21, 13
8
D
E I J C F H K B G A
Thursday, November 21, 13
9
D
E I J C F H K B G A
Thursday, November 21, 13
10
D
E I J C F H K B G A
Thursday, November 21, 13
11
D
E I J C F H K B G A
Thursday, November 21, 13
12
D
E I J C F H K B G A
Thursday, November 21, 13
13
D
E I J C F H K B G A
Thursday, November 21, 13
14
D
E I J C F H K B G A
Thursday, November 21, 13
15
KDCell root = /* build kdtree */; Set<Point> ps; double radius; foreach Point p in ps { recurse(p, root, radius); } ... void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); } }
Thursday, November 21, 13
16
TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }
Thursday, November 21, 13
16
recursive traversal
TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }
Thursday, November 21, 13
16
recursive traversal tree structure
TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }
Thursday, November 21, 13
16
recursive traversal repeated traversal tree structure
TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }
Thursday, November 21, 13
16
recursive traversal repeated traversal tree structure
TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... }
Lots of parallelism!
Thursday, November 21, 13
17
Thursday, November 21, 13
17
Want generally applicable techniques for mapping irregular applications to GPUs
Thursday, November 21, 13
traversals to GPUs
to recursive tree-traversal code
algorithms
18
Thursday, November 21, 13
thread) execution
instruction
threads execute
19
Thursday, November 21, 13
20
A B G C F H K
D
E I J
Thursday, November 21, 13
20
A B G C F H K
D
E I J
Thursday, November 21, 13
20
A B G C F H K
D
E I J A B G C F
D
E
Thursday, November 21, 13
20
A B G C F H K
D
E I J A B G H K I J
Thursday, November 21, 13
A B G H K I J C F
D
E
21
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
22
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
23
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
24
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
25
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
26
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
27
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
28
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
29
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
30
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
31
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
32
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
33
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
34
B A G
Thursday, November 21, 13
35
Thursday, November 21, 13
ropes
tree node that let a traversal jump to the next part
traversal
36
A B G C F H K
D
E I J
Thursday, November 21, 13
ropes
tree node that let a traversal jump to the next part
traversal
36
A B G C F H K
D
E I J
Thursday, November 21, 13
ropes
tree node that let a traversal jump to the next part
traversal
36
A B G C F H K
D
E I J
Thursday, November 21, 13
ropes
tree node that let a traversal jump to the next part
traversal
36
A B G C F H K
D
E I J
Thursday, November 21, 13
tree requires complex, application- specific preprocessing
during execution requires complex, application-specific logic
37
A B G C F H K
D
E I J
Thursday, November 21, 13
38
Thursday, November 21, 13
39
A B G C F H K
D
E I J A B G C F
D
E
A
Rope stack
Thursday, November 21, 13
40
A B G C F H K
D
E I J
G
A B G C F
D
E
B
Rope stack
Thursday, November 21, 13
41
A B G C F H K
D
E I J
F
A B G C F
D
E
G C
Rope stack
Thursday, November 21, 13
42
A B G C F H K
D
E I J
E
A B G C F
D
E
F G D
Rope stack
Thursday, November 21, 13
43
A B G C F H K
D
E I J
F
A B G C F
D
E
G E
Rope stack
Thursday, November 21, 13
44
A B G C F H K
D
E I J
G
A B G C F
D
E
F
Rope stack
Thursday, November 21, 13
45
A B G C F H K
D
E I J A B G C F
D
E
G
Rope stack
Thursday, November 21, 13
46
Thursday, November 21, 13
47
void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); } }
Thursday, November 21, 13
48
ropeStack.push(root); while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); } } See paper for details of how to transform more complex code
Thursday, November 21, 13
49
lead to thread divergence
recursive calls, other threads wait until calls return
iterative code
at beginning of loop
ropeStack.push(root); while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); } }
Thursday, November 21, 13
formerly recursion
49
lead to thread divergence
recursive calls, other threads wait until calls return
iterative code
at beginning of loop
ropeStack.push(root); while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); } }
formerly return
Thursday, November 21, 13
A B G H K I J C F
D
E
50
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
51
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
52
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
53
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
54
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
55
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
56
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
56
B A G
Threads no longer diverge in execution But do diverge in tree!
Thursday, November 21, 13
57
Thursday, November 21, 13
all threads in a warp visit the child
masked out
threads’ traversals, rather than longest traversal), but improved memory performance makes up for it
pass
58
Thursday, November 21, 13
59
Thursday, November 21, 13
A B G H K I J C F
D
E
60
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
61
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
62
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
63
B A G
Thursday, November 21, 13
A B G H K I J C F
D
E
64
B A G
Thursday, November 21, 13
65
Thursday, November 21, 13
to structure of arrays
transform automatically
commonly-accessed fields placed in shared memory)
strided access
66
Thursday, November 21, 13
memory, 14 SMs)
Nearest neighbor, Vantage point trees
67
Thursday, November 21, 13
68
Thursday, November 21, 13
69
Thursday, November 21, 13
70
Thursday, November 21, 13
71
Thursday, November 21, 13
difficult
and lockstepping, that can achieve significant speedup on GPU
implementations
hand-written implementations
72
Thursday, November 21, 13
* Now at Qualcomm; ** Now at Google
Thursday, November 21, 13