General Transformations for GPU Execution of Tree Traversals - PowerPoint PPT Presentation

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13

GPU execution of irregular programs • GPUs offer promise of massive, energy-efficient parallelism • Much success in mapping regular applications to GPUs • Regular memory accesses, predictable computation • Much less success in mapping irregular applications • Pointer-based data structures • Unpredictable, input-dependent computation and memory accesses 2 Thursday, November 21, 13

Tree traversal algorithms • Many irregular algorithms are built around tree-traversal • Barnes-Hut • Nearest-neighbor • 2-point correlation • Numerous papers describing how to map tree traversal algorithms to GPUs 3 Thursday, November 21, 13

Point correlation • Data mining algorithm • Goal: given a set of N points in k dimensions and a point p , find all points within a radius r of p • Naïve approach: compare all N points with p • Better approach: build kdtree over points, traverse tree for point p , prune subtrees that are far from p 4 Thursday, November 21, 13

Point correlation A 7 Thursday, November 21, 13

Point correlation A G B 7 Thursday, November 21, 13

Point correlation A G B C F 7 Thursday, November 21, 13

Point correlation A G B C F D E 7 Thursday, November 21, 13

Point correlation A G B C H K F D E 7 Thursday, November 21, 13

Point correlation A G B C H K F D E I J 7 Thursday, November 21, 13

Point correlation KDCell root = /* build kdtree */; Set<Point> ps; double radius; foreach Point p in ps { recurse(p, root, radius); } ... void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); } } 15 Thursday, November 21, 13

Basic pattern TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... } 16 Thursday, November 21, 13

Basic pattern TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } repeated traversal ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } repeated traversal ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) Lots of parallelism! { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

What’s the problem? • GPUs add high overhead for recursion • GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses • Status quo: ad hoc solutions • New algorithm? New GPU techniques! 17 Thursday, November 21, 13

What’s the problem? • GPUs add high overhead for recursion • GPUs work best when memory accesses are Want generally applicable techniques for mapping irregular applications to GPUs regular and strided, but irregular algorithms have unpredictable memory accesses • Status quo: ad hoc solutions • New algorithm? New GPU techniques! 17 Thursday, November 21, 13

Contributions • Two general techniques for mapping tree- traversals to GPUs • Autoropes: eliminates recursion overhead • Lockstepping: promotes memory coalescing • Compiler pass to automatically apply techniques to recursive tree-traversal code • Significant GPU speedups on 5 tree-traversal algorithms 18 Thursday, November 21, 13

Naïve GPU implementation • Warp -based SIMT (single-instruction, multiple- thread) execution • 32 points put in a single warp • Warp traverses tree • All points in warp must execute same instruction • If points diverge , some points sit idle while other threads execute 19 Thursday, November 21, 13

Naïve GPU implementation A G B C H K F D E I J 20 Thursday, November 21, 13

Naïve GPU implementation A A G G B B C C H K F F D D E E I J 20 Thursday, November 21, 13

Naïve GPU implementation A A G G B B C H H K K F D E I I J J 20 Thursday, November 21, 13

Naïve GPU implementation A A G G B B C H K F D E I J 21 Thursday, November 21, 13

Lots of accesses to tree • Many accesses just moving up the tree in order to later move down again • Lots of function stack manipulation • Trees are very large, cannot be stored in GPU’s fast memory • Want to minimize accesses to tree 35 Thursday, November 21, 13

How to avoid extra accesses to tree? • Typical technique: ropes A • Pointers in each G B tree node that let a traversal jump to the next part C H K F of the tree • Effectively linearizes D E I J traversal 36 Thursday, November 21, 13

General Transformations for GPU Execution of Tree Traversals - PowerPoint PPT Presentation

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13 GPU execution of

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Linear Transformations Linear Transformations 1 / 21 Linear Transformations A function T from R

CMSC427 Transformations I Credit: slides 9+ from Prof. Zwicker Transformations: outline

Transformations Composition of Transformations Congruence Transformations Dilations Similarity

Lecture 6: Normal Transformations, 3D Transformations, Euler Angles COMPSCI/MATH 290-04 Chris

lecture 3 view transformations model transformations GL_MODELVIEW transformation view

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Transformations and Matrices Transformations I Transformations are functions Matrices

Review Transformations Scale Translate Rotate Combining Transformations

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies

Lecture 2 Measures of Information I-Hsiang Wang Department of Electrical Engineering National

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23 Total

CS 221 Tuesday 8 November 2011 Agenda 1. Announcements 2. Review: Solving Equations (Text

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John

Unsupervised Domain Adaptation Based on Source-guided Discrepancy 23 th Sep. Han Bao (The

Formalizing the Informal, From Equations to . . . Precisiating the Imprecise: Divergence: A

General Transformations for GPU Execution of Tree Traversals - PowerPoint PPT Presentation

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13 GPU execution of

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Linear Transformations Linear Transformations 1 / 21 Linear Transformations A function T from R

CMSC427 Transformations I Credit: slides 9+ from Prof. Zwicker Transformations: outline

Transformations Composition of Transformations Congruence Transformations Dilations Similarity

Lecture 6: Normal Transformations, 3D Transformations, Euler Angles COMPSCI/MATH 290-04 Chris

lecture 3 view transformations model transformations GL_MODELVIEW transformation view

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

Transformations and Matrices Transformations I Transformations are functions Matrices

Review Transformations Scale Translate Rotate Combining Transformations

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies

Lecture 2 Measures of Information I-Hsiang Wang Department of Electrical Engineering National

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23 Total

CS 221 Tuesday 8 November 2011 Agenda 1. Announcements 2. Review: Solving Equations (Text

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John

Unsupervised Domain Adaptation Based on Source-guided Discrepancy 23 th Sep. Han Bao (The

Formalizing the Informal, From Equations to . . . Precisiating the Imprecise: Divergence: A

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team