general transformations for gpu execution of tree
play

General Transformations for GPU Execution of Tree Traversals - PowerPoint PPT Presentation

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13 GPU execution of


  1. General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13

  2. GPU execution of irregular programs • GPUs offer promise of massive, energy-efficient parallelism • Much success in mapping regular applications to GPUs • Regular memory accesses, predictable computation • Much less success in mapping irregular applications • Pointer-based data structures • Unpredictable, input-dependent computation and memory accesses 2 Thursday, November 21, 13

  3. Tree traversal algorithms • Many irregular algorithms are built around tree-traversal • Barnes-Hut • Nearest-neighbor • 2-point correlation • Numerous papers describing how to map tree traversal algorithms to GPUs 3 Thursday, November 21, 13

  4. Point correlation • Data mining algorithm • Goal: given a set of N points in k dimensions and a point p , find all points within a radius r of p • Naïve approach: compare all N points with p • Better approach: build kd- tree over points, traverse tree for point p , prune subtrees that are far from p 4 Thursday, November 21, 13

  5. Point correlation • Data mining algorithm • Goal: given a set of N points in k dimensions and a point p , find all points within a radius r of p • Naïve approach: compare all N points with p • Better approach: build kd- tree over points, traverse tree for point p , prune subtrees that are far from p 5 Thursday, November 21, 13

  6. Point correlation • Data mining algorithm • Goal: given a set of N points in k dimensions and a point p , find all points within a radius r of p • Naïve approach: compare all N points with p • Better approach: build kd- tree over points, traverse tree for point p , prune subtrees that are far from p 6 Thursday, November 21, 13

  7. Point correlation A 7 Thursday, November 21, 13

  8. Point correlation A G B 7 Thursday, November 21, 13

  9. Point correlation A G B C F 7 Thursday, November 21, 13

  10. Point correlation A G B C F D E 7 Thursday, November 21, 13

  11. Point correlation A G B C H K F D E 7 Thursday, November 21, 13

  12. Point correlation A G B C H K F D E I J 7 Thursday, November 21, 13

  13. Point correlation A G B C H K F D E I J 8 Thursday, November 21, 13

  14. Point correlation A G B C H K F D E I J 8 Thursday, November 21, 13

  15. Point correlation A G B C H K F D E I J 9 Thursday, November 21, 13

  16. Point correlation A G B C H K F D E I J 10 Thursday, November 21, 13

  17. Point correlation A G B C H K F D E I J 11 Thursday, November 21, 13

  18. Point correlation A G B C H K F D E I J 12 Thursday, November 21, 13

  19. Point correlation A G B C H K F D E I J 13 Thursday, November 21, 13

  20. Point correlation A G B C H K F D E I J 14 Thursday, November 21, 13

  21. Point correlation KDCell root = /* build kdtree */; Set<Point> ps; double radius; foreach Point p in ps { recurse(p, root, radius); } ... void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); } } 15 Thursday, November 21, 13

  22. Basic pattern TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... } 16 Thursday, November 21, 13

  23. Basic pattern TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

  24. Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

  25. Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } repeated traversal ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

  26. Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } repeated traversal ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) Lots of parallelism! { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

  27. What’s the problem? • GPUs add high overhead for recursion • GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses • Status quo: ad hoc solutions • New algorithm? New GPU techniques! 17 Thursday, November 21, 13

  28. What’s the problem? • GPUs add high overhead for recursion • GPUs work best when memory accesses are Want generally applicable techniques for mapping irregular applications to GPUs regular and strided, but irregular algorithms have unpredictable memory accesses • Status quo: ad hoc solutions • New algorithm? New GPU techniques! 17 Thursday, November 21, 13

  29. Contributions • Two general techniques for mapping tree- traversals to GPUs • Autoropes: eliminates recursion overhead • Lockstepping: promotes memory coalescing • Compiler pass to automatically apply techniques to recursive tree-traversal code • Significant GPU speedups on 5 tree-traversal algorithms 18 Thursday, November 21, 13

  30. Naïve GPU implementation • Warp -based SIMT (single-instruction, multiple- thread) execution • 32 points put in a single warp • Warp traverses tree • All points in warp must execute same instruction • If points diverge , some points sit idle while other threads execute 19 Thursday, November 21, 13

  31. Naïve GPU implementation A G B C H K F D E I J 20 Thursday, November 21, 13

  32. Naïve GPU implementation A G B C H K F D E I J 20 Thursday, November 21, 13

  33. Naïve GPU implementation A A G G B B C C H K F F D D E E I J 20 Thursday, November 21, 13

  34. Naïve GPU implementation A A G G B B C H H K K F D E I I J J 20 Thursday, November 21, 13

  35. Naïve GPU implementation A A G G B B C H K F D E I J 21 Thursday, November 21, 13

  36. Naïve GPU implementation A A G G B B C H K F D E I J 22 Thursday, November 21, 13

  37. Naïve GPU implementation A A G G B B C H K F D E I J 23 Thursday, November 21, 13

  38. Naïve GPU implementation A A G G B B C H K F D E I J 24 Thursday, November 21, 13

  39. Naïve GPU implementation A A G G B B C H K F D E I J 25 Thursday, November 21, 13

  40. Naïve GPU implementation A A G G B B C H K F D E I J 26 Thursday, November 21, 13

  41. Naïve GPU implementation A A G G B B C H K F D E I J 27 Thursday, November 21, 13

  42. Naïve GPU implementation A A G G B B C H K F D E I J 28 Thursday, November 21, 13

  43. Naïve GPU implementation A A G G B B C H K F D E I J 29 Thursday, November 21, 13

  44. Naïve GPU implementation A A G G B B C H K F D E I J 30 Thursday, November 21, 13

  45. Naïve GPU implementation A A G G B B C H K F D E I J 31 Thursday, November 21, 13

  46. Naïve GPU implementation A A G G B B C H K F D E I J 32 Thursday, November 21, 13

  47. Naïve GPU implementation A A G G B B C H K F D E I J 33 Thursday, November 21, 13

  48. Naïve GPU implementation A A G G B B C H K F D E I J 34 Thursday, November 21, 13

  49. Lots of accesses to tree • Many accesses just moving up the tree in order to later move down again • Lots of function stack manipulation • Trees are very large, cannot be stored in GPU’s fast memory • Want to minimize accesses to tree 35 Thursday, November 21, 13

  50. How to avoid extra accesses to tree? • Typical technique: ropes A • Pointers in each G B tree node that let a traversal jump to the next part C H K F of the tree • Effectively linearizes D E I J traversal 36 Thursday, November 21, 13

  51. How to avoid extra accesses to tree? • Typical technique: ropes A • Pointers in each G B tree node that let a traversal jump to the next part C H K F of the tree • Effectively linearizes D E I J traversal 36 Thursday, November 21, 13

  52. How to avoid extra accesses to tree? • Typical technique: ropes A • Pointers in each G B tree node that let a traversal jump to the next part C H K F of the tree • Effectively linearizes D E I J traversal 36 Thursday, November 21, 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend