CUB: A pattern of collective software design, abstraction, and reuse - - PowerPoint PPT Presentation

cub
SMART_READER_LITE
LIVE PREVIEW

CUB: A pattern of collective software design, abstraction, and reuse - - PowerPoint PPT Presentation

CUB: A pattern of collective software design, abstraction, and reuse for kernel-level programming DUANE MERRILL, PH.D. NVIDIA RESEARCH What is CUB ? 1. A design model for collective kernel-level primitives How to make reusable software


slide-1
SLIDE 1

DUANE MERRILL, PH.D.

NVIDIA RESEARCH

CUB:

A pattern of “collective” software design, abstraction, and reuse for kernel-level programming

slide-2
SLIDE 2

2

What is CUB?

1. A design model for collective kernel-level primitives

How to make reusable software components for SIMT groups (warps, blocks, etc.)

2. A library of collective primitives

Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc.

3. A library of global primitives (built from collectives)

Device-reduce, device-sort, device-scan, etc. Demonstrate collective composition, performance, performance-portability

slide-3
SLIDE 3

3

Outline

1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts

slide-4
SLIDE 4

4

Software reuse

Abstraction & composability are fundamental design principles

Reduce redundant programmer effort

Save time, energy, money Reduce buggy software

Encapsulate complexity

Empower productivity-oriented programmers Insulation from underlying hardware

– five NVIDIA GPU architectures between 2008-2014

Software reuse empowers a durable programming model

slide-5
SLIDE 5

5

Software reuse

Abstraction & composability are fundamental design principles

Reduce redundant programmer effort

Save time, energy, money Reduce buggy software

Encapsulate complexity

Empower productivity-oriented programmers Insulation from changing capabilities of the underlying hardware

– NVIDIA has produced nine different CUDA GPU architectures since 2008!

Software reuse empowers a durable programming model

slide-6
SLIDE 6

6

Outline

1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts

slide-7
SLIDE 7

7

Parallel programming is hard…

slide-8
SLIDE 8

8

Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.) Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.

No, cooperative parallel programming is hard…

slide-9
SLIDE 9

9

Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.) Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.

No, cooperative parallel programming is hard…

slide-10
SLIDE 10

10

CUDA today

Threadblock Threadblock Threadblock CUDA function stub Application

slide-11
SLIDE 11

11

Software abstraction in CUDA

PROBLEM: virtually every CUDA kernel written today is cobbled from scratch

A tunability, portability, and maintenance concern

CUDA function stub Kernel threadblock

Application

slide-12
SLIDE 12

12

Kernel function stub

Software abstraction in CUDA

… collective interface

Application

scalar interface

Collective software components

reduce development cost, hide complexity, bugs, etc.

BlockStore BlockSort BlockLoad BlockStore BlockSort BlockLoad

collective function

slide-13
SLIDE 13

13

What do these applications have in common?

1 1 2 1 1 2 2 2 2 1 3 2 3 2 1 2 2

∞ ∞ ∞

Parallel sparse graph traversal Parallel radix sort Parallel BWT compression Parallel SpMV

slide-14
SLIDE 14

14

What do these applications have in common?

Block-wide prefix-scan

Scan for enqueueing Scan for segmented reduction Scan for solving recurrences (move-to-front) Scan for partitioning

1 1 2 1 1 2 2 2 2 1 3 2 3 2 1 2 2

∞ ∞ ∞

Parallel sparse graph traversal Parallel radix sort Parallel BWT compression Parallel SpMV

slide-15
SLIDE 15

15

Examples of parallel scan data flow

16 threads contributing 4 items each

t3 t3 t0 t3 t2 t1 t2 t2 t3 t3 t3 t3 id id id t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t7 t7 t4 t7 t6 t5 t6 t6 t5 t5 t4 t4 id id id t1

1

t1

1

t8 t1

1

t1 t9 t1 t1 t9 t9 t8 t8 id id id t1

5

t1

5

t1

2

t1

5

t1

4

t1

3

t1

4

t1

4

t1

3

t1

3

t1

2

t1

2

id id id t4 t7 t6 t5 t1

1

t8 t1

1

t1 t9 t1 t9 t8 t1

5

t1

5

t1

2

t1

5

t1

4

t1

3

t1

4

t1

4

t1

3

t1

3

t1

2

t1

2

t1 t0 t2 t3 t1 t0 t2 t3 t1 t0 t2 t3 t5 t4 t6 t7 t5 t4 t6 t7 t5 t4 t6 t7 t9 t8 t10 t11 t9 t8 t10 t11 t9 t8 t10 t11 t13 t12 t14 t15 t13 t12 t14 t15 t13 t12 t14 t15

t3 t3 t3 t2 t2 t2 t1 t1 t1 t0 t0 t0 t3 t3 t0 t3 t2 t1 t2 t2 t1 t1 t0 t0 id id id t3 t3 t3 t2 t2 t2 t1 t1 t1 t0 t0 t0

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

Brent-Kung hybrid

(Work-efficient ~130 binary ops, depth 15)

Kogge-Stone hybrid

(Depth-efficient ~170 binary ops, depth 12)

slide-16
SLIDE 16

16

CUDA today

Kernel programming is complicating

threadblock threadblock threadblock CUDA function stub Application

slide-17
SLIDE 17

17

Kernel function stub

Software abstraction in CUDA

… collective interface

Application

scalar interface

Collective software components

reduce development cost, hide complexity, bugs, etc.

BlockStore BlockSort BlockLoad BlockStore BlockSort BlockLoad

collective function

slide-18
SLIDE 18

18

Outline

1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts

slide-19
SLIDE 19

19

threadblock

BlockSort

Collective composition

CUB primitives are easily nested & sequenced

threadblock threadblock threadblock

CUDA stub application BlockSort BlockSort BlockSort

slide-20
SLIDE 20

20

threadblock

BlockSort

Collective composition

CUB primitives are easily nested & sequenced BlockRadixRank BlockExchange

threadblock threadblock threadblock

CUDA stub application BlockSort BlockSort BlockSort

slide-21
SLIDE 21

21

threadblock

BlockSort

Collective composition

CUB primitives are easily nested & sequenced

BlockRadixRank BlockScan BlockExchange

threadblock threadblock threadblock

CUDA stub application BlockSort BlockSort BlockSort

slide-22
SLIDE 22

22

threadblock BlockSort

Collective composition

CUB primitives are easily nested & sequenced

BlockRadixRank BlockScan WarpScan BlockExchange

threadblock threadblock threadblock

CUDA stub application BlockSort BlockSort BlockSort

slide-23
SLIDE 23

23

Parllel width

Tunable composition

Flexible grain-size (“shape” remains the same) threadblock BlockSort

BlockRadixRank BlockScan WarpScan BlockExchange

CUDA stub application

threadblock

Block Sort

threadblock

Block Sort

threadblock

Block Sort

slide-24
SLIDE 24

24

threadblock

Parllel width

Tunable composition

Flexible grain-size (“shape” remains the same) BlockSort

BlockRadixRank BlockScan WarpScan BlockExchange

CUDA stub application

threadblock

Block Sort

threadblock

Block Sort

threadblock

Block Sort

slide-25
SLIDE 25

25

Parllel width

Tunable composition

Algorithmic-variant selection threadblock BlockSort

BlockRadixRank BlockScan WarpScan BlockExchange

CUDA stub application

threadblock

Block Sort

threadblock

Block Sort

threadblock

Block Sort

slide-26
SLIDE 26

26

Parllel width

Tunable composition

Algorithmic-variant selection threadblock BlockSort

BlockRadixRank BlockScan WarpScan BlockExchange

CUDA stub application

threadblock

Block Sort

threadblock

Block Sort

threadblock

Block Sort

slide-27
SLIDE 27

27

Parllel width

Tunable composition

Algorithmic-variant selection threadblock BlockSort

BlockRadixRank BlockScan WarpScan BlockExchange

CUDA stub application

threadblock

Block Sort

threadblock

Block Sort

threadblock

Block Sort

slide-28
SLIDE 28

28

Parllel width

Tunable composition

Algorithmic-variant selection threadblock BlockSort

BlockRadixRank BlockScan WarpScan BlockExchange

CUDA stub application

threadblock

Block Sort

threadblock

Block Sort

threadblock

Block Sort

slide-29
SLIDE 29

29

CUB: device-wide performance-portability

  • vs. Thrust and NPP across the last 4 major NVIDIA arch families (Telsa, Fermi, Kepler, Maxwell)

0.50 1.05 1.40 0.51 0.71 0.66 Tesla C1060 Tesla C2050 Tesla K20C billions of 32b keys / sec

Global radix sort

CUB Thrust v1.7.1 8 14 21 4 6 6 Tesla C1060 Tesla C2050 Tesla K20C billions of 32b items / sec

Global prefix scan

CUB Thrust v1.7.1 2.7 16.2 19.3 2 2 Tesla C1060 Tesla C2050 Tesla K20C billions of 8b items / sec

Global Histogram

CUB NPP 4.2 8.6 16.4 1.7 2.2 2.4 Tesla C1060 Tesla C2050 Tesla K20C billions of 32b inputs / sec

Global partition-if

CUB Thrust v1.7.1

slide-30
SLIDE 30

30

Outline

1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts

slide-31
SLIDE 31

31

CUB collective usage

__global__ void ExampleKernel(...) { }

slide-32
SLIDE 32

32

CUB collective usage

  • 1. Collective specialization

__global__ void ExampleKernel(...) { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; }

1

slide-33
SLIDE 33

33

CUB collective usage

3 parameter fields (specialization, construction, function call) + resource reflection

  • 1. Collective specialization
  • 2. Reflected shared resource type

__global__ void ExampleKernel(...) { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ typename BlockScanT::TempStorage scan_storage; }

1 2

slide-34
SLIDE 34

34

CUB collective usage

  • 1. Collective specialization
  • 2. Reflected shared resource type

__global__ void ExampleKernel(...) { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ typename BlockScanT::TempStorage scan_storage; // Obtain a tile of 512 items blocked across 128 threads int items[4]; ... }

1 2

slide-35
SLIDE 35

35

CUB collective usage

  • 1. Collective specialization
  • 2. Reflected shared resource type
  • 3. Collective construction
  • 4. Collective function call

__global__ void ExampleKernel(...) { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ typename BlockScanT::TempStorage scan_storage; // Obtain a tile of 512 items blocked across 128 threads int items[4]; ... // Compute block-wide prefix sum BlockScanT(scan_storage).ExclusiveSum(items, items); ... }

1 3 4 2

slide-36
SLIDE 36

36

CUB collective usage

3 parameter fields (specialization, construction, function call) + resource reflection

  • 1. Collective specialization
  • 2. Reflected shared resource type
  • 3. Collective construction
  • 4. Collective function call

__global__ void ExampleKernel(...) { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ typename BlockScanT::TempStorage scan_storage; // Obtain a tile of 512 items blocked across 128 threads int items[4]; ... // Compute block-wide prefix sum BlockScanT(scan_storage).ExclusiveSum(items, items); ... }

1 3 4 2

slide-37
SLIDE 37

37

Sequencing CUB primitives

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { }

slide-38
SLIDE 38

38

Sequencing CUB primitives

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; typedef cub::BlockStore<int*, 128, 4> BlockStoreT; }

  • 1. Specialize the collective

primitive types

slide-39
SLIDE 39

39

Sequencing CUB primitives

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; typedef cub::BlockStore<int*, 128, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; }

  • 1. Specialize the collective

primitive types

  • 2. Allocate shared memory with a

union of TempStorage structured- layout types

slide-40
SLIDE 40

40

Sequencing CUB primitives

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; typedef cub::BlockStore<int*, 128, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items); }

  • 3. Block-wide load
  • 1. Specialize the collective

primitive types

  • 2. Allocate shared memory with a

union of TempStorage structured- layout types

slide-41
SLIDE 41

41

Sequencing CUB primitives

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; typedef cub::BlockStore<int*, 128, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse }

  • 3. Block-wide load,
  • 4. barrier
  • 1. Specialize the collective

primitive types

  • 2. Allocate shared memory with a

union of TempStorage structured- layout types

slide-42
SLIDE 42

42

Sequencing CUB primitives

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; typedef cub::BlockStore<int*, 128, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); }

  • 3. Block-wide load,
  • 4. barrier,
  • 5. block-wide scan
  • 1. Specialize the collective

primitive types

  • 2. Allocate shared memory with a

union of TempStorage structured- layout types

slide-43
SLIDE 43

43

Sequencing CUB primitives

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; typedef cub::BlockStore<int*, 128, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse }

  • 3. Block-wide load,
  • 4. barrier,
  • 5. block-wide scan,
  • 6. barrier
  • 1. Specialize the collective

primitive types

  • 2. Allocate shared memory with a

union of TempStorage structured- layout types

slide-44
SLIDE 44

44

Sequencing CUB primitives

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; typedef cub::BlockStore<int*, 128, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of 512 items across 128 threads BlockStoreT(temp_storage.load).Store(d_in, items); }

  • 3. Block-wide load,
  • 4. barrier,
  • 5. block-wide scan
  • 6. barrier,
  • 7. block-wide store
  • 1. Specialize the collective

primitive types

  • 2. Allocate shared memory with a

union of TempStorage structured- layout types

slide-45
SLIDE 45

45

Tuning with CUB primitives

int* d_in; // = ... int* d_out; // = ... // Invoke kernel (GF110 Fermi) ExampleKernel <<<1, 128>>>( d_in, d_out);

// A kernel for computing tiled prefix sums __global__ void ExampleKernel(int* d_in, int* d_out) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; typedef cub::BlockStore<int*, 128, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of 512 items across 128 threads BlockStoreT(temp_storage.load).Store(d_in, items); }

slide-46
SLIDE 46

46

Tuning with CUB primitives

int* d_in; // = ... int* d_out; // = ... // Invoke kernel (GF110 Fermi) ExampleKernel <<<1, 128>>>( d_in, d_out);

template <typename T> __global__ void ExampleKernel(T* d_in, T* d_out) { // Specialize for 128 threads owning 4 Ts each typedef cub::BlockLoad<T*, 128, 4> BlockLoadT; typedef cub::BlockScan<T, 128> BlockScanT; typedef cub::BlockStore<T*, 128, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of 512 items across 128 threads T items[4]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of 512 items across 128 threads BlockStoreT(temp_storage.load).Store(d_in, items); }

slide-47
SLIDE 47

47

Tuning with CUB primitives

int* d_in; // = ... int* d_out; // = ... // Invoke kernel (GF110 Fermi) ExampleKernel <128> <<<1, 128>>>( d_in, d_out);

template <int BLOCK_THREADS, typename T> __global__ void ExampleKernel(T* d_in, T* d_out) { // Specialize for BLOCK_THREADS threads owning 4 integers each typedef cub::BlockLoad<T*, BLOCK_THREADS, 4> BlockLoadT; typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT; typedef cub::BlockStore<T*, BLOCK_THREADS, 4> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of items T items[4]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of items BlockStoreT(temp_storage.load).Store(d_in, items); }

slide-48
SLIDE 48

48

Tuning with CUB primitives

int* d_in; // = ... int* d_out; // = ... // Invoke kernel (GF110 Fermi) ExampleKernel <128, 4> <<<1, 128>>>( d_in, d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, typename T> __global__ void ExampleKernel(T* d_in, T* d_out) { // Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockLoadT; typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT; typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of items T items[ITEMS_PER_THREAD]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of items BlockStoreT(temp_storage.load).Store(d_in, items); }

slide-49
SLIDE 49

49

Tuning with CUB primitives

int* d_in; // = ... int* d_out; // = ... // Invoke kernel (GF110 Fermi) ExampleKernel <128, 4, BLOCK_LOAD_WARP_TRANSPOSE> <<<1, 128>>>( d_in, d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockLoadAlgorithm LOAD_ALGO> __global__ void ExampleKernel(T* d_in, T* d_out) { // Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT; typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT; typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of items T items[ITEMS_PER_THREAD]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of items BlockStoreT(temp_storage.load).Store(d_in, items); }

slide-50
SLIDE 50

50

Tuning with CUB primitives

int* d_in; // = ... int* d_out; // = ... // Invoke kernel (GF110 Fermi) ExampleKernel <128, 4, BLOCK_LOAD_WARP_TRANSPOSE, BLOCK_SCAN_RAKING> <<<1, 128>>>( d_in, d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockLoadAlgorithm LOAD_ALGO, BlockScanAlgorithm SCAN_ALGO, typename T> __global__ void ExampleKernel(T* d_in, T* d_out) { // Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT; typedef cub::BlockScan<T, BLOCK_THREADS, SCAN_ALGO> BlockScanT; typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of items T items[ITEMS_PER_THREAD]; BlockLoadT(temp_storage.load).Load(d_in, items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of items BlockStoreT(temp_storage.load).Store(d_in, items); }

slide-51
SLIDE 51

51

Tuning with CUB primitives

int* d_in; // = ... int* d_out; // = ... // Invoke kernel (GF110 Fermi) ExampleKernel <128, 4, BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, BLOCK_SCAN_RAKING> <<<1, 128>>>( d_in, d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockLoadAlgorithm LOAD_ALGO, CacheLoadModifier LOAD_MODIFIER, BlockScanAlgorithm SCAN_ALGO, typename T> __global__ void ExampleKernel(T* d_in, T* d_out) { // Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT; typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT; typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of items T items[ITEMS_PER_THREAD]; typedef cub::CacheModifiedInputIterator<LOAD_MODIFIER, T> InputItr; BlockLoadT(temp_storage.load).Load(InputItr(d_in), items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of items BlockStoreT(temp_storage.load).Store(d_in, items); }

slide-52
SLIDE 52

52

Tuning with CUB primitives

int* d_in; // = ... int* d_out; // = ... // Invoke kernel (GK110 Kepler) ExampleKernel <128, 21, BLOCK_LOAD_DIRECT, LOAD_LDG, BLOCK_SCAN_WARP_SCANS> <<<1, 128>>>( d_in, d_out);

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockLoadAlgorithm LOAD_ALGO, CacheLoadModifier LOAD_MODIFIER, BlockScanAlgorithm SCAN_ALGO, typename T> __global__ void ExampleKernel(T* d_in, T* d_out) { // Specialize for BLOCK_THREADS threads owning ITEMS_PER_THREAD integers each typedef cub::BlockLoad<T*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT; typedef cub::BlockScan<T, BLOCK_THREADS> BlockScanT; typedef cub::BlockStore<T*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; typename BlockStoreT::TempStorage store; } temp_storage; // Cooperatively load a tile of items T items[ITEMS_PER_THREAD]; typedef cub::CacheModifiedInputIterator<LOAD_MODIFIER, T> InputItr; BlockLoadT(temp_storage.load).Load(InputItr(d_in), items); __syncthreads(); // Barrier for smem reuse // Compute and block-wide exclusive prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); __syncthreads(); // Barrier for smem reuse // Cooperatively store a tile of items BlockStoreT(temp_storage.load).Store(d_in, items); }

slide-53
SLIDE 53

53

Outline

1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts

slide-54
SLIDE 54

54

// Simple collective primitive for block-wide prefix sum template <typename T, int BLOCK_THREADS> class BlockScan { };

Block-wide prefix sum (simplified)

x0 x1 x2 x3 x4 x5 x6 x7 x0:x0 x0:x1 x0:x2 x0:x3 x0:x4 x0:x5 x0:x6 x0:x7

slide-55
SLIDE 55

55

// Simple collective primitive for block-wide prefix sum template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; };

Block-wide prefix sum (simplified)

x0 x1 x2 x3 x4 x5 x6 x7 x0:x0 x0:x1 x0:x2 x0:x3 x0:x4 x0:x5 x0:x6 x0:x7

slide-56
SLIDE 56

56

// Simple collective primitive for block-wide prefix sum template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; // Per-thread data (reference to shared storage) TempStorage &temp_storage; // Constructor BlockScan (TempStorage &storage) : temp_storage(storage) {} };

Block-wide prefix sum (simplified)

x0 x1 x2 x3 x4 x5 x6 x7 x0:x0 x0:x1 x0:x2 x0:x3 x0:x4 x0:x5 x0:x6 x0:x7

slide-57
SLIDE 57

57

// Simple collective primitive for block-wide prefix sum template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; // Per-thread data (reference to shared storage) TempStorage &temp_storage; // Constructor BlockScan (TempStorage &storage) : temp_storage(storage) {} // Inclusive prefix sum operation (each thread contributes its own data item) T InclusiveSum (T thread_data) { #pragma unroll for (int i = 1; i < BLOCK_THREADS; i *= 2) { temp_storage[tid] = thread_data; __syncthreads(); if (tid – i >= 0) thread_data += temp_storage[tid]; __syncthreads(); } return thread_data; } };

Block-wide prefix sum (simplified)

x0 x1 x2 x3 x4 x5 x6 x7 x0:x0 x0:x1 x0:x2 x0:x3 x0:x4 x0:x5 x0:x6 x0:x7

slide-58
SLIDE 58

58

Block-wide reduce-by-key (simplified)

a a b b c c c c 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 1

values keys head-flags

a a b b c c c

prev-keys

  • 1.0

2.0 1.0 2.0 1.0 2.0 3.0 1 1 2 2 2

scanned-values scanned-flags

// Reduce-by-segment scan data type struct ValueOffsetPair { ValueT value; int offset; // Sum operation ValueOffsetPair operator+(ValueOffsetPair &other) { ValueOffsetPair retval; retval.offset = offset + other.offset; retval.value = (other.offset) ?

  • ther.value :

value + other.value; return retval; } }; 2 4.0 c 1

slide-59
SLIDE 59

59

// Block-wide reduce-by-key template <typename KeyT, typename ValueT, int BLOCK_THREADS, int ITEMS_PER_THREAD> struct BlockReduceByKey { // Parameterized BlockDiscontinuity type for keys typedef BlockDiscontinuity<KeyT, BLOCK_THREADS> BlockDiscontinuityT; // Parameterized BlockScan type typedef BlockScan<ValueOffsetPair, BLOCK_THREADS> BlockScanT; // Temporary storage type union TempStorage { typename BlockDiscontinuityT::TempStorage discontinuity; typename BlockDiscontinuityT::TempStorage scan; }; // Reduce segments using addition operator. // Returns the "carry-out" of the last segment ValueT Sum( TempStorage& temp_storage, // shared storage reference KeyT keys[ITEMS_PER_THREAD], // [in|out] keys ValueT values[ITEMS_PER_THREAD], // [in|out] values int segment_indices[ITEMS_PER_THREAD]) // [out] segment indices (-1 if invalid) { ... } };

Block-wide reduce-by-key (simplified)

slide-60
SLIDE 60

60

// Reduce segments using addition operator. // Returns the "carry-out" of the last segment ValueT Sum( TempStorage& temp_storage, KeyT keys[ITEMS_PER_THREAD], ValueT values[ITEMS_PER_THREAD], int segment_indices[ITEMS_PER_THREAD]) { KeyT prev_keys[ITEMS_PER_THREAD]; ValueOffsetPair scan_items[ITEMS_PER_THREAD]; // Set head segment_flags. BlockDiscontinuityKeysT(temp_storage.discontinuity).FlagHeads( segment_indices, keys, prev_keys); __syncthreads(); // Unset the flag for the first item if (threadIdx.x == 0) segment_indices[0] = 0; // Zip values and segment_flags for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM) { scan_items[ITEM].offset = segment_indices[ITEM]; scan_items[ITEM].value = values[ITEM]; } ... ... // Exclusive scan of values and segment_flags ValueOffsetPair tile_aggregate; BlockScanT(temp_storage.scan).ExclusiveSum( scan_items, scan_items, tile_aggregate); // Unzip values and segment indices for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM) { segment_indices[ITEM] = segment_indices[ITEM] ? scan_items[ITEM].offset :

  • 1;

keys[ITEM] = prev_keys[ITEM]; values[ITEM] = scan_items[ITEM].value; } // Return “carry-out” return tile_aggregate.value; }

Block-wide reduce-by-key (simplified)

slide-61
SLIDE 61

61

Outline

1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts

slide-62
SLIDE 62

62

Cache-modified input iterators

#include <cub/cub.cuh> // Standard layout type struct Foo { double x; char y; }; __global__ void Kernel(Foo* d_in, Foo* d_out) { // In host or device code: create an LDG wrapper cub::CacheModifiedInputIterator<cub::LOAD_LDG, Foo> ldg_itr(d_in); cub::CacheModifiedOutputIterator<cub::STORE_WT, Foo> volatile_itr(d_out); volatile_itr[threadIdx.x] = ldg_itr[threadIdx.x]; } code for sm_35 Function : _Z6KernelPdS_ MOV R1, c[0x0][0x44]; S2R R0, SR_TID.X; ISCADD R2, R0, c[0x0][0x140], 0x3; LDG.64 R4, [R2]; LDG.64 R2, [R6]; ISCADD R0, R0, c[0x0][0x144], 0x4; TEXDEPBAR 0x1; ST.WT.64 [R0], R4; TEXDEPBAR 0x0; ST.WT.64 [R0+0x8], R2; EXIT;

LOAD_DEFAULT, ///< Default (no modifier) LOAD_CA, ///< Cache at all levels LOAD_CG, ///< Cache at global level LOAD_CS, ///< Cache streaming (likely to be accessed once) LOAD_CV, ///< Cache as volatile (including cached system lines) LOAD_LDG, ///< Cache as texture LOAD_VOLATILE, ///< Volatile (any memory space)

slide-63
SLIDE 63

63

Texture obj (and ref) input iterators

#include <cub/cub.cuh> // Standard layout type struct Foo { int y; double x; }; template <typename InputIteratorT, typename OutputIteratorT> __global__ void Kernel(InputIteratorT d_in, OutputIteratorT d_out) { d_out[threadIdx.x] = d_in[threadIdx.x]; } // Create a texture object input iterator Foo* d_foo; cub::TexObjInputIterator<Foo> d_foo_tex; d_foo_tex.BindTexture(d_foo); Kernel<<<1, 32>>>(d_foo_tex, d_foo); d_foo_tex.UnbindTexture();

code for sm_35 Function : _Z6KernelIN3cub19TexObjInputIteratorI3FooiEEPS 2_EvT_T0_ MOV R1, c[0x0][0x44]; S2R R0, SR_TID.X; IADD R2, R0, c[0x0][0x144]; SHF.L R2, RZ, 0x1, R2; IADD R3, R2, 0x1; TLD.LZ.T R2, R2, 0x52, 1D, 0x1; TLD.LZ.P R4, R3, 0x52, 1D, 0x3; ISCADD R0, R0, c[0x0][0x150], 0x4; TEXDEPBAR 0x1; ST [R0], R2; TEXDEPBAR 0x0; ST.64 [R0+0x8], R4; EXIT;

slide-64
SLIDE 64

64

Collective primitives

WarpReduce

reduction & segmented reduction

WarpScan BlockDiscontinuity BlockExchange BlockHistogram BlockLoad & BlockStore BlockRadixSort BlockReduce BlockScan

slide-65
SLIDE 65

65

Device-wide (global) primitives

(Usable with CDP, streams, and your own memory allocator)

DeviceHistogram

histogram-even histogram-range

DevicePartition

partition-if partition-flagged

DeviceRadixSort

ascending / descending

DeviceReduce

reduction arg-min, arg-max reduce-by-key

DeviceRunLengthEncode

RLE Non-trivial segments

DeviceScan

inclusive / exclusive

DeviceSelect

select-flagged select-if keep-unique

DeviceSpmv

slide-66
SLIDE 66

66

NEW: performance-resilient histogram

Simple intra-thread RLE provides a uniform performance response regardless of input sample distribution

50 100 150 200 250 300 350

Avg elapsed time (us)

GeForce GTX980: 1-channel (1920x1080 uchar1 pixels)

RLE CUB SMEM Atomic GMEM Atomic

// RLE pixel counts within the thread's pixels int accumulator = 1; for (int PIXEL = 0; PIXEL < PIXELS_PER_THREAD - 1; ++PIXEL) { if (bins[PIXEL] == bins[PIXEL + 1]) { accumulator++; } else { atomicAdd( privatized_histogram + bins[PIXEL], accumulator); accumulator = 1; } }

slide-67
SLIDE 67

67

NEW: CSR SpMV

Merge-based parallel decomposition for load balance

1 2 3 4 5 6 7 2 2 4 8

CSR row-offsets CSR non-zeros Indices of non-zeros (ℕ)

2.0 0.0 2.0 4.0

1 2 1 2 1 2 3 4

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

5 10 15 20 25 30 35 40 gigaflops

fp32 (Tesla K40)

CUB cuSPARSE

slide-68
SLIDE 68

68

Outline

1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts

slide-69
SLIDE 69

69

Benefits of using CUB primitives

Simplicity of composition

Kernels are simply sequences of primitives

High performance

CUB uses the best known algorithms, abstractions, and strategies, and techniques

Performance portability

CUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.)

Simplicity of tuning

CUB adapts to various grain sizes (threads per block, items per thread, etc.) CUB provides alterative algorithms

Robustness and durability

CUB supports arbitrary data types and block sizes

slide-70
SLIDE 70

70

Questions?

Please visit the CUB project on GitHub

http://nvlabs.github.com/cub

Duane Merrill (dumerrill@nvidia.com)

slide-71
SLIDE 71

71

THANK YOU

slide-72
SLIDE 72

72

p0 p1 p2 p3 p1 p2 p3 p0 p1 p2 p3 p0

barrier

id id id

barrier