Dynamic Data Structures for the GPU John Owens Child Family - PowerPoint PPT Presentation

Dynamic Data Structures for the GPU John Owens Child Family Professor of Engineering & Entrepreneurship Department of Electrical & Computer Engineering UC Davis Joint work with Martin Farach-Colton

CUDA Programming Model (SPMD + SIMD) • Flow: Copy data to “device” (GPU); run Host Device kernels; copy results back Grid 1 • A kernel is executed as a grid of thread Block Block Block Kernel 1 (0, 0) (1, 0) (2, 0) blocks Block Block Block • One thread block maps to one GPU (0, 1) (1, 1) (2, 1) “core” (SM) • Grid 2 A thread block is a batch of threads that can cooperate with each other by: Kernel 2 • E ffi ciently sharing data through Block (1, 1) shared memory • Thread Thread Thread Thread Thread Synchronizing their execution (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) • Thread Thread Thread Thread Thread Two threads from two di ff erent blocks (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) cannot cooperate Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) • Blocks are independent

Computation/Memory Hierarchy Level Computation Memory Global Kernels DRAM (12 GB) Blocks (MIMD within Per-block L2 cache (1.57 MB) a kernel) (~15) Shared/L1 cache (48 Warps (MIMD within Per-warp kB/SM x 15 SMs = a block) 720 kB) Registers (64k/SM * 4 Threads (32-wide B/register = 262 kB/ Per-thread SIMD within a SM * 15 SMs = 3.93 thread) ( ≥ 30k) MB)

Memory: What does/doesn’t matter • Matters: • Use fastest level of   memory hierarchy   Initial key distribution Buckets e.g., thread coarsening 6 Warp level reordering 4 • Coalesced memory   Block level reordering 2 accesses (threads in   0 Multisplit result a warp should access   0 32 64 96 128 160 192 224 256 neighboring   locations in memory) • Doesn’t matter (to date): • Cache & cache-0bliviousness   1.57 MB L2 / 30k threads = 51 B/thread

NVIDIA OptiX & the BVH Tero Karras. Maximizing parallelism in the construction of BVHs, octrees, and k - d trees. In High-Performance Graphics , HPG ’12, pages 33–37, June 2012.

The problem • Many data structures are built on the CPU and used on the GPU • Very few data structures can be built on the GPU • Sorted array • (Cuckoo) hash table • Several application-specific DS (e.g., BVH tree) • No data structures can be updated on the GPU

Scale of updates • Update 1–few items • Fall back to serial case, slow, probably don’t care • Update very large number of items • Rebuild whole data structure from scratch • Middle ground: our goal • Question: When do you do this in practice?

Approach • Pick data structures useful in serial case, try to find parallelizations? • Pick what look like parallel-friendly data structures with parallel-friendly updates?

If you think of other/interesting data structure candidates, I’m all ears! If you think “But surely he’s already considered X and rejected it”, you’re probably wrong!

Cache-oblivious lookup array a) c) b) • Supports dictionary and range queries • log n sorted levels, each level 2x the size of the last • Insert into a filled level results in a merge, possibly cascaded. Operations are coarse (threads cooperate).

COLA results/questions • Insertions/lookups for point queries • 600M/52M for COLA • 140M/326M for hash table • Deletes using tombstones • Semantics for parallel insert/delete operations? • Minimum batch size? • Atom size for searching? • Fractional cascading? Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. GPU COLA: A dynamic dictionary data structure for the GPU. February 2016. Unpublished.

Hash-array mapped trie (HAMT) Root • Hash maps in Clojure 0110 bitmap C node subtrie • S-nodes (key-value pairs) • C-nodes (branching nodes) bitmap 0101 key 0010 C node S node subtrie • Operations are fine (threads operate   key key 1001 0001 independently) S node S node • Has concurrent (CPU) implementation • Requires fine-grained memory allocation • Custom memory allocators?

Relaxed Radix Balanced (RRB) Trees • Clojure and Scala’s Vector • ~Relaxed unsorted B-tree • Index/update/iterations cheap • concat/insert-at/split in O (log n )

Packed memory array (PMA) • Di ff ers from RRB tree: • Stores ordered elements   (set not list) • Tree is implicit • Maintains gaps between elements • Insertions require rebalancing

Cross-cutting issues • Useful models for GPU memory hierarchy • Independent threads vs. cooperative threads? • Memory allocation (& impact on hardware) • Persistent data structures • Integration into higher-level programming environments • Use cases!

Dynamic Data Structures for the GPU John Owens Child Family - PowerPoint PPT Presentation

Dynamic Data Structures for the GPU John Owens Child Family Professor of Engineering & Entrepreneurship Department of Electrical & Computer Engineering UC Davis Joint work with Martin Farach-Colton CUDA Programming Model (SPMD + SIMD)

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

Theory and Implementation of Dynamic Data Structures for the GPU John Owens Martn

Memory Questions? What is main memory? CSCI [4|6]730 How does multiple processes share

Dynamic Memory Overview Dynamically allocated memory is stored in the Heap-section of

Allocating memory in a lock-free manner Anders Gidenstam, Marina Papatriantafilou and Philippas

Clicker Question 1 Topic 11 Linked Lists What is output by the following code?

Week 11 -Wednesday What did we talk about last time? Exam 2 Before that: Review

Dynamically Allocating 2-D Arrays Spring Semester 2011 Programming and Data Structure 1 You may

Dictionaries and Dynamic Sets Abstract Data Type (ADT) Dictionary : Insert ( x , D ): inserts x

Dynamic Data Structures for the GPU John Owens Child Family - PowerPoint PPT Presentation

Dynamic Data Structures for the GPU John Owens Child Family Professor of Engineering & Entrepreneurship Department of Electrical & Computer Engineering UC Davis Joint work with Martin Farach-Colton CUDA Programming Model (SPMD + SIMD)

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

Theory and Implementation of Dynamic Data Structures for the GPU John Owens Martn

Memory Questions? What is main memory? CSCI [4|6]730 How does multiple processes share

Dynamic Memory Overview Dynamically allocated memory is stored in the Heap-section of

Allocating memory in a lock-free manner Anders Gidenstam, Marina Papatriantafilou and Philippas

Clicker Question 1 Topic 11 Linked Lists What is output by the following code?

Week 11 -Wednesday What did we talk about last time? Exam 2 Before that: Review

Dynamically Allocating 2-D Arrays Spring Semester 2011 Programming and Data Structure 1 You may

Dictionaries and Dynamic Sets Abstract Data Type (ADT) Dictionary : Insert ( x , D ): inserts x

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,