 
              April 4-7, 2016 | Silicon Valley DEEP DIVE INTO DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA 1
OVERVIEW AND INTRODUCTION 2
WHAT IS DYNAMIC PARALLELISM? The ability to launch new kernels from the GPU Dynamically - based on run-time data  Simultaneously - from multiple threads at once  Independently - each thread can launch a different grid  Introduced with CUDA 5.0 and compute capability 3.5 and up  CPU GPU CPU GPU Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself 3
DYNAMIC PARALLELISM CPU GPU CPU GPU 4
AN EASY TO PARALLELIZE PROGRAM M for i = 1 to N for j = 1 to M N convolution(i, j) next j next i 5
A DIFFICULT TO PARALLELIZE PROGRAM for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i 6
A DIFFICULT TO PARALLELIZE PROGRAM max(x[i]) N for i = 1 to N Bad alternative #1: Idle Threads for j = 1 to x[i] convolution(i, j) next j next i N Bad alternative #2: Tail Effect 7
DYNAMIC PARALLELISM Serial Program for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i CUDA Program __global__ void convolution(int x[]) { for j = 1 to x[blockIdx] N kernel<<< ... >>>(blockIdx, j) } void main() { With Dynamic Parallelism setup(x); convolution<<< N, 1 >>>(x); } 8
EXPERIMENT dynpar idleThreads tailEffect 300 Time (ms) lower is better 250 200 150 100 50 0 512 1024 2048 4096 8192 16384 Matrix Size * Device/SDK = K40m/v7.5 9 * K40m-CPU = E5-2690
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A A0 Tracking Structure SM SM SM SM A0 B<<<1,1>>>() cudaLaunchDevice( B, 1, 1 ); 10
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A A0 Tracking Structure SM SM SM SM A0 B<<<1,1>>>() Allocate Task data structure 11
LAUNCH EXAMPLE Task Tracking Structures B Grid Scheduler Grid A A0 Tracking Structure SM SM SM SM A0 B<<<1,1>>>() Fill out Task data structure 12
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A A0 Tracking Structure B SM SM SM SM A0 B<<<1,1>>>() Track Task B in Block A0 13
LAUNCH EXAMPLE Task Tracking Structures B<<<1,1>>>() Grid Scheduler Grid A A0 Tracking Structure B SM SM SM SM A0 Launch Task B to GPU 14
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Grid B A0 Tracking Structure B SM SM SM SM A0 B0 C<<<1,1>>>() cudaLaunchDevice( C, 1, 1 ); 15
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Grid B A0 Tracking Structure B C SM SM SM SM A0 B0 C<<<1,1>>>() Allocate, fill out, and track Task C in block A0 16
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Grid B A0 Tracking Structure B C SM SM SM SM A0 B0 Task C is not yet runnable. Track C to run after B. 17
LAUNCH EXAMPLE Task Tracking Structures Task B completes. Task B completes. Scheduler kernel runs. SKED runs Scheduler. Grid Scheduler Grid A, Scheduler A0 Tracking Structure B C SM SM SM SM A0 18
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Scheduler A0 Tracking Structure B C SM SM SM SM Sched A0 Scheduler searches for work. 19
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Scheduler A0 Tracking Structure B C SM SM SM SM Sched A0 Scheduler completes B, and Identifies C as ready-to-run. 20
LAUNCH EXAMPLE Task Tracking Structures C<<<1,1>>>() Grid Scheduler Grid A, Scheduler A0 Tracking Structure C SM SM SM SM Sched A0 Scheduler frees B for re-use, and launches C to the Grid Scheduler. 21
LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Grid C A0 Tracking Structure C SM SM SM SM A0 C0 Task C now executes. 22
BASIC RULES Programming Model Essentially the same as CUDA Time CPU Thread Grid A Launch Launch is per-thread and asynchronous Grid A Complete Grid A Threads Sync is per-block Grid A - Parent CUDA primitives are per-block Grid B Launch Grid B Complete (cannot pass streams/events to children) cudaDeviceSynchronize() != __syncthreads() Grid B Grid B - Child Threads Events allow inter-stream dependencies Streams are shared within a block Implicit NULL stream results in ordering within a block; use named streams CUDA API available on the device: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#api-reference 23 23
MEMORY CONSISTENCY RULES Memory Model Launch implies membar Time (child sees parent state at time of launch) CPU Thread Grid A Launch Grid A Complete Sync implies invalidate Grid A Threads (parent sees child writes after sync) Grid A - Parent Texture changes by child are Grid B Launch Grid B Complete visible to parent after sync (i.e. sync == tex cache invalidate) Grid B Grid B - Child Threads Constants are immutable Local & shared memory are private: cannot be passed as child kernel args Fully consistent 25 25
EXPERIMENTS 26
DIRECTED BENCHMARKS Kernels written to measure specific aspects of dynamic parallelism Launch throughput Launch latency As a function of different configurations SDK Versions Varying Clocks 27
RESULTS – LAUNCH THROUGHPUT 28
LAUNCH THROUGHPUT K40m K40m-CPU 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 Grids/sec 200000 0 32 128 512 1024 2048 4096 8192 16384 32768 65536 Num Child kernels launched * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/875 * K40m-CPU = E5-2690 29 * Host launches are with 32 streams
LAUNCH THROUGHPUT Observations About an order of magnitude higher than from host Dynamic parallelism is very useful when there are a lot of child kernels Two major limiters of launch throughput Pending Launch Count Grid Scheduler Limit 30
PENDING LAUNCH COUNT 1024 4096 16384 32768 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 Grids/sec 200000 0 32 128 512 1024 2048 4096 8192 16384 32768 65536 Num Child kernels launched * Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875 * Different curves represent different pending launch count limits 31 31
PENDING LAUNCH COUNT Observations Pre-allocated buffer in Global Memory to store kernels before their launch Default value – 2048 kernels Buffer overflow implies resize performed on-the-go Substantial reduction in launch throughput! Know the number of pending child kernels! 32
PENDING LAUNCH COUNT CUDA API’S Setting Limit cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, yourLimit ); Querying Limit cudaDeviceGetLimit( &yourLimit , cudaLimitDevRuntimePendingLaunchCount); 4/4/2016 33
GRID SCHEDULER LIMIT 512 1024 2048 4096 8192 16384 3000000 2500000 2000000 1500000 1000000 Grids/sec 500000 0 8 16 32 64 128 256 Num device streams * Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875 * Different curves represent the total number of child kernels launched 34 34
GRID SCHEDULER LIMIT Observations Ability of grid scheduler to track the number of concurrent kernels The limit is currently 32 If this limit is crossed, upto 50% loss in launch throughput 35
RESULTS – LAUNCH LATENCY 36
LAUNCH LATENCY Initial Subsequent 30000 25000 20000 15000 10000 Time (ns) 5000 0 K40m K40m-CPU * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875 * K40m-CPU = E5-2690 37 * Host launches are with 32 streams
LAUNCH LATENCY Observations Initial and subsequent latencies are about 2-3x slower than that of host Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial kernel launches We are working towards improving this** ** Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, Jin Wang and Sudhakar Yalamanchili, 2014 IEEE International Symposium on Workload Characterization (IISWC). 38
LAUNCH LATENCY - STREAMS 350000 18000 16000 300000 14000 250000 12000 200000 10000 8000 150000 6000 100000 4000 Time (ns) Time (ns) 50000 2000 0 0 2 4 8 16 2 4 8 16 Host streams Device streams * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875 39
LAUNCH LATENCY - STREAMS Observations Host streams affect device-side launch latency Prefer device streams for dynamic parallelism 40
RESULTS – DEVICE SYNCHRONIZE 41
DEVICE SYNCHRONIZE cudaDeviceSynchronize is costly Avoid it when possible, example below __global__ void parent() { doSomeInitialization(); childKernel<<<grid,blk>>>(); Unnecessary. Implicit join enforced by the programming cudaDeviceSynchronize(); } model! 42
DEVICE SYNCHRONIZE - COST sync nosync 7 6 5 4 3 2 Time (ms) 1 0 2 4 8 16 32 Amount of work per thread (higher the number, more the work) * Device/SDK = K40/v7.5 43 43
DEVICE SYNCHRONIZE DEPTH Deepest recursion level until where cudaDeviceSynchronize works CUDA limit cudaLimitDevRuntimeSyncDepth controls it Default is level 2 At the cost of extra global memory reserved for storing parent blocks 44
Recommend
More recommend