 
              Dynamic Data Structures for the GPU John Owens Child Family Professor of Engineering & Entrepreneurship Department of Electrical & Computer Engineering UC Davis Joint work with Martin Farach-Colton
CUDA Programming Model (SPMD + SIMD) • Flow: Copy data to “device” (GPU); run Host Device kernels; copy results back Grid 1 • A kernel is executed as a grid of thread Block Block Block Kernel 1 (0, 0) (1, 0) (2, 0) blocks Block Block Block • One thread block maps to one GPU (0, 1) (1, 1) (2, 1) “core” (SM) • Grid 2 A thread block is a batch of threads that can cooperate with each other by: Kernel 2 • E ffi ciently sharing data through Block (1, 1) shared memory • Thread Thread Thread Thread Thread Synchronizing their execution (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) • Thread Thread Thread Thread Thread Two threads from two di ff erent blocks (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) cannot cooperate Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) • Blocks are independent
Computation/Memory Hierarchy Level Computation Memory Global Kernels DRAM (12 GB) Blocks (MIMD within Per-block L2 cache (1.57 MB) a kernel) (~15) Shared/L1 cache (48 Warps (MIMD within Per-warp kB/SM x 15 SMs = a block) 720 kB) Registers (64k/SM * 4 Threads (32-wide B/register = 262 kB/ Per-thread SIMD within a SM * 15 SMs = 3.93 thread) ( ≥ 30k) MB)
Memory: What does/doesn’t matter • Matters: • Use fastest level of memory hierarchy Initial key distribution Buckets e.g., thread coarsening 6 Warp level reordering 4 • Coalesced memory Block level reordering 2 accesses (threads in 0 Multisplit result a warp should access 0 32 64 96 128 160 192 224 256 neighboring locations in memory) • Doesn’t matter (to date): • Cache & cache-0bliviousness 1.57 MB L2 / 30k threads = 51 B/thread
NVIDIA OptiX & the BVH Tero Karras. Maximizing parallelism in the construction of BVHs, octrees, and k - d trees. In High-Performance Graphics , HPG ’12, pages 33–37, June 2012.
The problem • Many data structures are built on the CPU and used on the GPU • Very few data structures can be built on the GPU • Sorted array • (Cuckoo) hash table • Several application-specific DS (e.g., BVH tree) • No data structures can be updated on the GPU
Scale of updates • Update 1–few items • Fall back to serial case, slow, probably don’t care • Update very large number of items • Rebuild whole data structure from scratch • Middle ground: our goal • Question: When do you do this in practice?
Approach • Pick data structures useful in serial case, try to find parallelizations? • Pick what look like parallel-friendly data structures with parallel-friendly updates?
If you think of other/interesting data structure candidates, I’m all ears! If you think “But surely he’s already considered X and rejected it”, you’re probably wrong!
Cache-oblivious lookup array a) c) b) • Supports dictionary and range queries • log n sorted levels, each level 2x the size of the last • Insert into a filled level results in a merge, possibly cascaded. Operations are coarse (threads cooperate).
COLA results/questions • Insertions/lookups for point queries • 600M/52M for COLA • 140M/326M for hash table • Deletes using tombstones • Semantics for parallel insert/delete operations? • Minimum batch size? • Atom size for searching? • Fractional cascading? Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. GPU COLA: A dynamic dictionary data structure for the GPU. February 2016. Unpublished.
Hash-array mapped trie (HAMT) Root • Hash maps in Clojure 0110 bitmap C node subtrie • S-nodes (key-value pairs) • C-nodes (branching nodes) bitmap 0101 key 0010 C node S node subtrie • Operations are fine (threads operate key key 1001 0001 independently) S node S node • Has concurrent (CPU) implementation • Requires fine-grained memory allocation • Custom memory allocators?
Relaxed Radix Balanced (RRB) Trees • Clojure and Scala’s Vector • ~Relaxed unsorted B-tree • Index/update/iterations cheap • concat/insert-at/split in O (log n )
Packed memory array (PMA) • Di ff ers from RRB tree: • Stores ordered elements (set not list) • Tree is implicit • Maintains gaps between elements • Insertions require rebalancing
Cross-cutting issues • Useful models for GPU memory hierarchy • Independent threads vs. cooperative threads? • Memory allocation (& impact on hardware) • Persistent data structures • Integration into higher-level programming environments • Use cases!
Recommend
More recommend