Dynamic Data Structures for the GPU John Owens Child Family - - PowerPoint PPT Presentation

dynamic data structures for the gpu
SMART_READER_LITE
LIVE PREVIEW

Dynamic Data Structures for the GPU John Owens Child Family - - PowerPoint PPT Presentation

Dynamic Data Structures for the GPU John Owens Child Family Professor of Engineering & Entrepreneurship Department of Electrical & Computer Engineering UC Davis Joint work with Martin Farach-Colton CUDA Programming Model (SPMD + SIMD)


slide-1
SLIDE 1

Dynamic Data Structures for the GPU

John Owens Child Family Professor of Engineering & Entrepreneurship Department of Electrical & Computer Engineering UC Davis Joint work with Martin Farach-Colton

slide-2
SLIDE 2

CUDA Programming Model (SPMD + SIMD)

  • Flow: Copy data to “device” (GPU); run

kernels; copy results back

  • A kernel is executed as a grid of thread

blocks

  • One thread block maps to one GPU

“core” (SM)

  • A thread block is a batch of threads

that can cooperate with each other by:

  • Efficiently sharing data through

shared memory

  • Synchronizing their execution
  • Two threads from two different blocks

cannot cooperate

  • Blocks are independent

Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

slide-3
SLIDE 3

Computation/Memory Hierarchy

Level Computation Memory Global Kernels DRAM (12 GB) Per-block Blocks (MIMD within a kernel) (~15) L2 cache (1.57 MB) Per-warp Warps (MIMD within a block) Shared/L1 cache (48 kB/SM x 15 SMs = 720 kB) Per-thread Threads (32-wide SIMD within a thread) (≥30k) Registers (64k/SM * 4 B/register = 262 kB/ SM * 15 SMs = 3.93 MB)

slide-4
SLIDE 4

Memory: What does/doesn’t matter

  • Matters:
  • Use fastest level of


memory hierarchy
 e.g., thread coarsening

  • Coalesced memory


accesses (threads in
 a warp should access
 neighboring
 locations in memory)

  • Doesn’t matter (to date):
  • Cache & cache-0bliviousness


1.57 MB L2 / 30k threads = 51 B/thread

Initial key distribution Warp level reordering Block level reordering Multisplit result

32 64 96 128 160 192 224 256 2 4 6 Buckets

slide-5
SLIDE 5

NVIDIA OptiX & the BVH

Tero Karras. Maximizing parallelism in the construction of BVHs, octrees, and k- d trees. In High-Performance Graphics, HPG ’12, pages 33–37, June 2012.

slide-6
SLIDE 6

The problem

  • Many data structures are built on the CPU and used on the

GPU

  • Very few data structures can be built on the GPU
  • Sorted array
  • (Cuckoo) hash table
  • Several application-specific DS (e.g., BVH tree)
  • No data structures can be updated on the GPU
slide-7
SLIDE 7

Scale of updates

  • Update 1–few items
  • Fall back to serial case, slow, probably don’t care
  • Update very large number of items
  • Rebuild whole data structure from scratch
  • Middle ground: our goal
  • Question: When do you do this in practice?
slide-8
SLIDE 8

Approach

  • Pick data structures useful in serial case, try to find

parallelizations?

  • Pick what look like parallel-friendly data structures with

parallel-friendly updates?

slide-9
SLIDE 9

If you think of other/interesting data structure candidates, I’m all ears! If you think “But surely he’s already considered X and rejected it”, you’re probably wrong!

slide-10
SLIDE 10

Cache-oblivious lookup array

  • Supports dictionary and range queries
  • log n sorted levels, each level 2x the size of the last
  • Insert into a filled level results in a merge, possibly
  • cascaded. Operations are coarse (threads cooperate).

a) b) c)

slide-11
SLIDE 11

COLA results/questions

  • Insertions/lookups for point queries
  • 600M/52M for COLA
  • 140M/326M for hash table
  • Deletes using tombstones
  • Semantics for parallel insert/delete operations?
  • Minimum batch size?
  • Atom size for searching?
  • Fractional cascading?

Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. GPU COLA: A dynamic dictionary data structure for the GPU. February 2016. Unpublished.

slide-12
SLIDE 12

Hash-array mapped trie (HAMT)

  • Hash maps in Clojure
  • S-nodes (key-value pairs)
  • C-nodes (branching nodes)
  • Operations are fine (threads operate


independently)

  • Has concurrent (CPU) implementation
  • Requires fine-grained memory allocation
  • Custom memory allocators?

bitmap subtrie key

0110

bitmap subtrie

0101 0010

key key

0001 1001 Root C node C node S node S node S node

slide-13
SLIDE 13

Relaxed Radix Balanced (RRB) Trees

  • Clojure and Scala’s Vector
  • ~Relaxed unsorted B-tree
  • Index/update/iterations cheap
  • concat/insert-at/split in O(log n)
slide-14
SLIDE 14

Packed memory array (PMA)

  • Differs from RRB tree:
  • Stores ordered elements


(set not list)

  • Tree is implicit
  • Maintains gaps between elements
  • Insertions require rebalancing
slide-15
SLIDE 15

Cross-cutting issues

  • Useful models for GPU memory hierarchy
  • Independent threads vs. cooperative threads?
  • Memory allocation (& impact on hardware)
  • Persistent data structures
  • Integration into higher-level programming environments
  • Use cases!