Efficient GPU-only Tree Walks in ChaNGa Jianqiao Liu, Milind - PowerPoint PPT Presentation

Efficient GPU-only Tree Walks in ChaNGa Jianqiao Liu, Milind Kulkarni Purdue University

gpus! • GPUs are an important component of modern supercomputers, and are becoming increasingly important to obtain peak performance • Blue Waters (2007) had 1 GPU (K20) for every 16 CPU cores • Summit (2018) has 1 GPU (Volta) for every 7 CPU cores • ChaNGa, unsurprisingly, leverages GPUs for maximum performance • But can we do better?

barnes-hut refresher • Accelerate n-body codes by subdividing space into octree • Compute forces on red bodies by traversing tree • Approximate contribution from purple bodies by using summary information at blue node https://en.wikipedia.org/wiki/Octree

dual-tree approach • Classical Barnes-Hut is a single tree approach: for each leaf node, traverse the tree   → O(n log n) force computation, O(n log n) traversals • Can also adopt a dual tree approach: for each interior node traverse the tree   → O(n log n) force computation, O(n) traversals

moving to gpus • Key challenge for Barnes-Hut (and other tree traversals): significant irregularity so does not map well to GPUs • Existing approach in ChaNGa: CPU computes interaction lists and sends to GPU for computation • Goal: put whole computation on GPU

return to single tree • Putting dual-tree computation on GPUs is challenging • Asymptotic complexity wins come from sacrificing parallelism during traversal to do cell-cell interactions, but GPUs need parallelism to keep them busy • Instead, return to single-tree computation for local tree walks • Adopt many existing e ff ective implementation tricks [Burtscher and Pingali; Goldfarb et al.; Liu et al.] • Tweak open criterion (traversal conditions) to work better for single-tree traversals

full single-tree walk on gpu Construct Construct Construct Construct Remote CPU: interaction list interaction list interaction list interaction list work GPU: Remote CPU: work GPU: Data transfer Remote Compute Initialization Local Compute ✓ Less CPU/GPU communication ✓ No latency while waiting for CPU to compute interaction lists ✓ Free up CPU to do other computations (e.g., remote tree walks) ✘ Loses asymptotic complexity (back to O(n log n) traversals) but OK for local tree walks

results P100 Speed test (in seconds) Original ChaNGa new ChaNGa Configuration bucket_size 32 64 32 64 Average Runtime(s) Runtime(s) Runtime(s) Speedup Runtime(s) Speedup Speedup lambs, 3M, theta=0.6 9.58 5.10 1.06 9.01x 0.85 6.01x lambb, 80M, theta=0.6 359.67 189.29 31.85 11.29x 26.01 7.28x 1 node, 1 process per node 8.25x dwf1, 5M, theta=0.7 16.89 9.16 1.71 9.86x 1.40 6.54x dwf1.6144, 50M, theta=0.7 194.84 103.93 19.69 9.90x 16.95 6.13x lambs, 3M, theta=0.6 3.08 1.66 1.22 2.53x 0.89 1.88x lambb, 80M, theta=0.6 101.22 54.38 29.55 3.43x 23.18 2.35x 1 node, 4 processes per node 2.13x dwf1, 5M, theta=0.7 6.26 3.42 3.15 1.99x 1.95 1.76x dwf1.6144, 50M, theta=0.7 67.52 37.07 40.73 1.66x 25.20 1.47x lambs, 3M, theta=0.6 1.89 1.07 1.05 1.80x 0.77 1.38x lambb, 80M, theta=0.6 55.16 30.94 24.07 2.29x 19.83 1.56x 1 node, 8 processes per node 1.55x dwf1, 5M, theta=0.7 3.49 1.90 2.40 1.45x 1.55 1.22x dwf1.6144, 50M, theta=0.7 38.40 20.71 26.75 1.44x 16.32 1.27x lambs, 3M, theta=0.6 1.92 1.04 1.07 1.80x 0.78 1.33x lambb, 80M, theta=0.6 49.49 27.47 15.41 3.21x 10.41 2.64x 8 nodes, 1 process per node 1.80x dwf1, 5M, theta=0.7 3.51 1.90 2.37 1.48x 1.55 1.22x dwf1.6144, 50M, theta=0.7 39.10 20.67 27.36 1.43x 16.56 1.25x lambs, 3M, theta=0.6 1.50 0.88 0.90 1.67x 0.67 1.31x lambb, 80M, theta=0.6 41.11 22.13 16.94 2.43x 13.36 1.66x 8 nodes, 4 processes per node 1.53x dwf1, 5M, theta=0.7 2.27 1.37 1.68 1.35x 1.20 1.14x dwf1.6144, 50M, theta=0.7 22.93 12.46 14.92 1.54x 10.49 1.19x lambs, 3M, theta=0.6 0.80 0.57 0.57 1.39x 0.45 1.27x lambb, 80M, theta=0.6 21.55 11.70 10.15 2.12x 7.58 1.54x 8 nodes, 8 processes per node 1.40x dwf1, 5M, theta=0.7 1.28 0.82 1.05 1.22x 0.74 1.10x dwf1.6144, 50M, theta=0.7 11.80 6.50 8.66 1.36x 5.43 1.20x

summary • GPUs are ill-suited for dual-tree walks, so ChaNGa didn’t use the GPU for tree walks • Switch local tree walk to classical single-tree walk and put it on GPU • Lose in asymptotic complexity, but massive win in parallelism • Work is in ChaNGa main branch https://en.wikipedia.org/wiki/Octree as of August 2018

Efficient GPU-only Tree Walks in ChaNGa Jianqiao Liu, Milind - PowerPoint PPT Presentation

Efficient GPU-only Tree Walks in ChaNGa Jianqiao Liu, Milind Kulkarni Purdue University gpus! GPUs are an important component of modern supercomputers, and are becoming increasingly important to obtain peak performance Blue Waters (2007)

Random Walks on Graphs Larry Fenn DATE Larry Fenn Random Walks on Graphs Introduction

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

18.175: Lecture 23 Random walks Scott Sheffield MIT 18.175 Lecture 23 1 Outline Random walks

Outline Mechanisms Mechanisms Mechanisms for Generating Random Walks Random Walks Power-Law

Quantum walks Daniel J. Bernstein University of Illinois at Chicago Focusing on quantum walks

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf

ChaNGa CHArm N-body GrAvity Laxmikant Kale Thomas Quinn Filippo Gioachin Graeme Lufkin

ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin Pritish Jetley Celso Mendes

Moving-Mesh Hydrodynamics in ChaNGa Philip Chang (UWM), Tom Quinn (UWashington), James Wadsley

> Closure - > IBS - IFS Elm W 21 > BM u![ -BM known Distribute errors SM Bock Eev

Another System De fi nition Facility version 3.1 A traverse across the build A monster hunt story

Performance Introspec/on of Graph Databases Peter Macko Daniel

5 Rules 1 Red Black Tree Properties - A 1. Every Node Is Either RED or BLACK 2. Every NILL Node

OpenMP Instructor PanteA Zardoshti Department of Computer Engineering Sharif University of

Kernel level task management 1. Advanced/scalable task management schemes 2. (Multi-core) CPU

MULTITREADING What is a thread? A thread is a concurrent unit of execution Threads share

Genesis 3:1-24 Micah 7:19 (NIV) 19 You will again have compassion on us; you will tread our sins

Efficient GPU-only Tree Walks in ChaNGa Jianqiao Liu, Milind - PowerPoint PPT Presentation

Efficient GPU-only Tree Walks in ChaNGa Jianqiao Liu, Milind Kulkarni Purdue University gpus! GPUs are an important component of modern supercomputers, and are becoming increasingly important to obtain peak performance Blue Waters (2007)

Random Walks on Graphs Larry Fenn DATE Larry Fenn Random Walks on Graphs Introduction

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

18.175: Lecture 23 Random walks Scott Sheffield MIT 18.175 Lecture 23 1 Outline Random walks

Outline Mechanisms Mechanisms Mechanisms for Generating Random Walks Random Walks Power-Law

Quantum walks Daniel J. Bernstein University of Illinois at Chicago Focusing on quantum walks

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf

ChaNGa CHArm N-body GrAvity Laxmikant Kale Thomas Quinn Filippo Gioachin Graeme Lufkin

ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin Pritish Jetley Celso Mendes

Moving-Mesh Hydrodynamics in ChaNGa Philip Chang (UWM), Tom Quinn (UWashington), James Wadsley

&gt; Closure - &gt; IBS - IFS Elm W 21 &gt; BM u![ -BM known Distribute errors SM Bock Eev

Another System De fi nition Facility version 3.1 A traverse across the build A monster hunt story

Performance Introspec/on of Graph Databases Peter Macko Daniel

5 Rules 1 Red Black Tree Properties - A 1. Every Node Is Either RED or BLACK 2. Every NILL Node

OpenMP Instructor PanteA Zardoshti Department of Computer Engineering Sharif University of

Kernel level task management 1. Advanced/scalable task management schemes 2. (Multi-core) CPU

MULTITREADING What is a thread? A thread is a concurrent unit of execution Threads share

Genesis 3:1-24 Micah 7:19 (NIV) 19 You will again have compassion on us; you will tread our sins

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

> Closure - > IBS - IFS Elm W 21 > BM u![ -BM known Distribute errors SM Bock Eev