DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA - - PowerPoint PPT Presentation

dynamic parallelism
SMART_READER_LITE
LIVE PREVIEW

DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley DEEP DIVE INTO DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA 1 OVERVIEW AND INTRODUCTION 2 WHAT IS DYNAMIC PARALLELISM? The ability to launch new kernels from the GPU


slide-1
SLIDE 1

1 April 4-7, 2016 | Silicon Valley

SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA

DEEP DIVE INTO DYNAMIC PARALLELISM

slide-2
SLIDE 2

2

OVERVIEW AND INTRODUCTION

slide-3
SLIDE 3

3

WHAT IS DYNAMIC PARALLELISM?

The ability to launch new kernels from the GPU

  • Dynamically - based on run-time data
  • Simultaneously - from multiple threads at once
  • Independently - each thread can launch a different grid
  • Introduced with CUDA 5.0 and compute capability 3.5 and up

CPU GPU CPU GPU

Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself

slide-4
SLIDE 4

4

CPU GPU CPU GPU

DYNAMIC PARALLELISM

slide-5
SLIDE 5

5

AN EASY TO PARALLELIZE PROGRAM

for i = 1 to N for j = 1 to M convolution(i, j) next j next i

M N

slide-6
SLIDE 6

6

for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i

A DIFFICULT TO PARALLELIZE PROGRAM

slide-7
SLIDE 7

7

A DIFFICULT TO PARALLELIZE PROGRAM

for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i N max(x[i]) N

Bad alternative #2: Tail Effect Bad alternative #1: Idle Threads

slide-8
SLIDE 8

8

DYNAMIC PARALLELISM

for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i

Serial Program

__global__ void convolution(int x[]) { for j = 1 to x[blockIdx] kernel<<< ... >>>(blockIdx, j) }

CUDA Program

With Dynamic Parallelism

N void main() { setup(x); convolution<<< N, 1 >>>(x); }

slide-9
SLIDE 9

9

EXPERIMENT

* Device/SDK = K40m/v7.5 * K40m-CPU = E5-2690

Time (ms) lower is better

Matrix Size

50 100 150 200 250 300 512 1024 2048 4096 8192 16384

dynpar idleThreads tailEffect

slide-10
SLIDE 10

10

LAUNCH EXAMPLE

B<<<1,1>>>()

SM SM SM

Grid Scheduler

SM

Grid A A0

Task Tracking Structures A0 Tracking Structure cudaLaunchDevice( B, 1, 1 );

slide-11
SLIDE 11

11

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A A0

Task Tracking Structures A0 Tracking Structure Allocate Task data structure

slide-12
SLIDE 12

12

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A A0

Task Tracking Structures B A0 Tracking Structure Fill out Task data structure

slide-13
SLIDE 13

13

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A A0

A0 Tracking Structure Task Tracking Structures B Track Task B in Block A0

slide-14
SLIDE 14

14

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A A0

Task Tracking Structures A0 Tracking Structure B Launch Task B to GPU

slide-15
SLIDE 15

15

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

C<<<1,1>>>()

Grid A, Grid B A0 B0

Task Tracking Structures A0 Tracking Structure B cudaLaunchDevice( C, 1, 1 );

slide-16
SLIDE 16

16

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

C<<<1,1>>>()

Grid A, Grid B A0 B0

Task Tracking Structures A0 Tracking Structure B Allocate, fill out, and track Task C in block A0 C

slide-17
SLIDE 17

17

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Grid B A0 B0

Task Tracking Structures A0 Tracking Structure B C Task C is not yet runnable. Track C to run after B.

slide-18
SLIDE 18

18

LAUNCH EXAMPLE

Task Tracking Structures A0 Tracking Structure B C Task B completes. SKED runs Scheduler.

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler A0

Task B completes. Scheduler kernel runs.

slide-19
SLIDE 19

19

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler A0 Sched

Task Tracking Structures A0 Tracking Structure B C Scheduler searches for work.

slide-20
SLIDE 20

20

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler A0 Sched

A0 Tracking Structure Task Tracking Structures B C Scheduler completes B, and Identifies C as ready-to-run.

slide-21
SLIDE 21

21

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler A0 Sched

C<<<1,1>>>()

Task Tracking Structures A0 Tracking Structure C Scheduler frees B for re-use, and launches C to the Grid Scheduler.

slide-22
SLIDE 22

22

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Grid C A0 C0

Task Tracking Structures A0 Tracking Structure C Task C now executes.

slide-23
SLIDE 23

23 23

BASIC RULES

Programming Model

Essentially the same as CUDA Launch is per-thread and asynchronous Sync is per-block CUDA primitives are per-block

(cannot pass streams/events to children)

cudaDeviceSynchronize() != __syncthreads() Events allow inter-stream dependencies Streams are shared within a block

Implicit NULL stream results in ordering within a block; use named streams

Time Grid A - Parent Grid B - Child

Grid A Threads Grid B Threads

CPU Thread

Grid B Launch Grid A Launch Grid B Complete Grid A Complete

CUDA API available on the device: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#api-reference

slide-24
SLIDE 24

25 25

MEMORY CONSISTENCY RULES

Memory Model

Launch implies membar

(child sees parent state at time of launch)

Sync implies invalidate

(parent sees child writes after sync)

Texture changes by child are visible to parent after sync

(i.e. sync == tex cache invalidate)

Constants are immutable Local & shared memory are private: cannot be passed as child kernel args

Time Grid A - Parent Grid B - Child

Grid A Threads Grid B Threads

CPU Thread

Grid B Launch Grid A Launch Grid B Complete Grid A Complete

Fully consistent

slide-25
SLIDE 25

26

EXPERIMENTS

slide-26
SLIDE 26

27

DIRECTED BENCHMARKS

Kernels written to measure specific aspects of dynamic parallelism Launch throughput Launch latency As a function of different configurations SDK Versions Varying Clocks

slide-27
SLIDE 27

28

RESULTS – LAUNCH THROUGHPUT

slide-28
SLIDE 28

29

LAUNCH THROUGHPUT

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/875 * K40m-CPU = E5-2690 * Host launches are with 32 streams

Grids/sec Num Child kernels launched

200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000 32 128 512 1024 2048 4096 8192 16384 32768 65536

K40m K40m-CPU

slide-29
SLIDE 29

30

LAUNCH THROUGHPUT

Observations

About an order of magnitude higher than from host Dynamic parallelism is very useful when there are a lot of child kernels Two major limiters of launch throughput Pending Launch Count Grid Scheduler Limit

slide-30
SLIDE 30

31 31

PENDING LAUNCH COUNT

Grids/sec Num Child kernels launched

* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875 * Different curves represent different pending launch count limits

200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000 32 128 512 1024 2048 4096 8192 16384 32768 65536

1024 4096 16384 32768

slide-31
SLIDE 31

32

PENDING LAUNCH COUNT

Observations

Pre-allocated buffer in Global Memory to store kernels before their launch Default value – 2048 kernels Buffer overflow implies resize performed on-the-go Substantial reduction in launch throughput! Know the number of pending child kernels!

slide-32
SLIDE 32

33

PENDING LAUNCH COUNT

CUDA API’S

4/4/2016

cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, yourLimit); cudaDeviceGetLimit(&yourLimit, cudaLimitDevRuntimePendingLaunchCount);

Setting Limit Querying Limit

slide-33
SLIDE 33

34 34

GRID SCHEDULER LIMIT

Grids/sec Num device streams

* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875 * Different curves represent the total number of child kernels launched

500000 1000000 1500000 2000000 2500000 3000000 8 16 32 64 128 256

512 1024 2048 4096 8192 16384

slide-34
SLIDE 34

35

GRID SCHEDULER LIMIT

Ability of grid scheduler to track the number of concurrent kernels The limit is currently 32 If this limit is crossed, upto 50% loss in launch throughput

Observations

slide-35
SLIDE 35

36

RESULTS – LAUNCH LATENCY

slide-36
SLIDE 36

37

LAUNCH LATENCY

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875 * K40m-CPU = E5-2690 * Host launches are with 32 streams

Time (ns)

5000 10000 15000 20000 25000 30000 K40m K40m-CPU

Initial Subsequent

slide-37
SLIDE 37

38

LAUNCH LATENCY

Observations

Initial and subsequent latencies are about 2-3x slower than that of host Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial kernel launches We are working towards improving this**

** Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, Jin Wang and Sudhakar Yalamanchili, 2014 IEEE International Symposium on Workload Characterization (IISWC).

slide-38
SLIDE 38

39

LAUNCH LATENCY - STREAMS

Time (ns) Host streams

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875

Time (ns) Device streams

50000 100000 150000 200000 250000 300000 350000 2 4 8 16 2000 4000 6000 8000 10000 12000 14000 16000 18000 2 4 8 16

slide-39
SLIDE 39

40

LAUNCH LATENCY - STREAMS

Observations

Host streams affect device-side launch latency Prefer device streams for dynamic parallelism

slide-40
SLIDE 40

41

RESULTS – DEVICE SYNCHRONIZE

slide-41
SLIDE 41

42

DEVICE SYNCHRONIZE

cudaDeviceSynchronize is costly Avoid it when possible, example below

__global__ void parent() { doSomeInitialization(); childKernel<<<grid,blk>>>(); cudaDeviceSynchronize(); }

  • Unnecessary. Implicit join

enforced by the programming model!

slide-42
SLIDE 42

43 43

DEVICE SYNCHRONIZE - COST

Time (ms) Amount of work per thread (higher the number, more the work)

* Device/SDK = K40/v7.5

1 2 3 4 5 6 7 2 4 8 16 32

sync nosync

slide-43
SLIDE 43

44

DEVICE SYNCHRONIZE DEPTH

Deepest recursion level until where cudaDeviceSynchronize works CUDA limit cudaLimitDevRuntimeSyncDepth controls it Default is level 2 At the cost of extra global memory reserved for storing parent blocks

slide-44
SLIDE 44

45

DEVICE SYNCHRONIZE DEPTH

Memory Usage

100 200 300 400 500 600 700 800 2 3 4 5

Memory Reserved (MB) Device Synchronize Depth

slide-45
SLIDE 45

46

DEVICE SYNCHRONIZE DEPTH

cudaDeviceSynchronize fails silently beyond the set SyncDepth Use cudaGetLastError on device to inspect the error

Kernel (depth=1) Kernel (depth=2) Kernel (depth=3) Kernel (depth=4)

  

Error Handling

Kernel (depth=5)

SyncDepth=2

slide-46
SLIDE 46

47

DYNAMIC PARALLELISM - LIMITS

slide-47
SLIDE 47

48

DYNAMIC PARALLELISM

Limits

Recursion depth is currently 24 Maximum size of formal parameters in the child kernel is 4096 B Violation causes a compile-time error Runtime exceptions in child kernel are only visible from host-side

slide-48
SLIDE 48

49

ERROR HANDLING

Visible only from host-side

  • lineinfo of nvcc along with cuda-memcheck to locate the error location

Runtime exceptions in child kernels

__global__ void child(float* arr) { arr[0] = 1.0f; } __global__ void parent() { child<<<1,1>>>(NULL); cudaDeviceSynchronize(); printf(“%d\n”, cudaGetLastError()); } parent<<<1,1>>>(); cudaError_t err = cudaDeviceSynchronize();

Control never reaches here! Error caught here

slide-49
SLIDE 49

50

SUCCESS STORIES

slide-50
SLIDE 50

51

FMM

Fast Multipole Method

  • Solving the N-body problem
  • Computational complexity O(n)
  • Tree-based approach

Image source: http://www.bu.edu/pasi/courses/12-steps-to-having-a-fast-multipole-method-on-gpus/

slide-51
SLIDE 51

52 52

FMM (2)

  • Dynamic 1: launch child

grids for neighbors and children

  • Dynamic 2: launch child

grids for children only

  • Dynamic 3: launch child

grids for children only; start only p2 kernel threads; use shared GPU memory

Performance

From: FMM goes GPU A smooth trip or bumpy ride?, B. Kohnke, I.Kabadshow MPI BPC Göttingen & Jülich Supercomputing Centre, GTC2015

lower is better

slide-52
SLIDE 52

53

PANDA

anti-Proton ANnihilation at DArmstadt

  • State-of-the-art hadron particle physics experiment
slide-53
SLIDE 53

54 54

PANDA (2)

  • Avoiding extra PCI-e data transfers.
  • Launch configuration data

dependencies

  • Higher launch throughput
  • Reducing false dependencies between

kernel launches.

  • Waiting on stream prevents

enqueuing of work into other streams

Performance and Reasons for Improvements

Source: A CUDA Dynamic Parallelism Case Study: PANDA, Andrew Adinetz

http://devblogs.nvidia.com/parallelforall/a-cuda-dynamic-parallelism-case-study-panda/

slide-54
SLIDE 54

55

SUMMARY

slide-55
SLIDE 55

56

WHEN TO USE CUDA DYNAMIC PARALLELISM

Three Good Reasons

  • Algorithmic: “Dynamically Formed Pockets of Structured Parallelism”*
  • Unbalanced load (e.g., vertex expansion in graphs, compressed sparse row)
  • Tree traversal (fat and shallow computation trees)
  • Adaptive Mesh Refinement
  • Performance:
  • Improve launch throughput
  • Reduce PCIe traffic and false dependencies
  • Maintenance:
  • Simplified, more natural program flow

*) from: Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, J.Wang and S. Yalamanchili, IISWC 2014

slide-56
SLIDE 56

58

REFERENCES

  • CUDA-C Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-

guide/#cuda-dynamic-parallelism

  • Adaptive Parallel Computation with CUDA Dynamic Parallelism

https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic- parallelism/

  • FMM goes GPU, B. Kohnke and I.Kabadshow, GTC 2015, https://shar.es/1Y38Vf
slide-57
SLIDE 57

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join