DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley DEEP DIVE INTO DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA 1

OVERVIEW AND INTRODUCTION 2

WHAT IS DYNAMIC PARALLELISM? The ability to launch new kernels from the GPU Dynamically - based on run-time data  Simultaneously - from multiple threads at once  Independently - each thread can launch a different grid  Introduced with CUDA 5.0 and compute capability 3.5 and up  CPU GPU CPU GPU Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself 3

DYNAMIC PARALLELISM CPU GPU CPU GPU 4

AN EASY TO PARALLELIZE PROGRAM M for i = 1 to N for j = 1 to M N convolution(i, j) next j next i 5

A DIFFICULT TO PARALLELIZE PROGRAM for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i 6

A DIFFICULT TO PARALLELIZE PROGRAM max(x[i]) N for i = 1 to N Bad alternative #1: Idle Threads for j = 1 to x[i] convolution(i, j) next j next i N Bad alternative #2: Tail Effect 7

DYNAMIC PARALLELISM Serial Program for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i CUDA Program __global__ void convolution(int x[]) { for j = 1 to x[blockIdx] N kernel<<< ... >>>(blockIdx, j) } void main() { With Dynamic Parallelism setup(x); convolution<<< N, 1 >>>(x); } 8

EXPERIMENT dynpar idleThreads tailEffect 300 Time (ms) lower is better 250 200 150 100 50 0 512 1024 2048 4096 8192 16384 Matrix Size * Device/SDK = K40m/v7.5 9 * K40m-CPU = E5-2690

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A A0 Tracking Structure SM SM SM SM A0 B<<<1,1>>>() cudaLaunchDevice( B, 1, 1 ); 10

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A A0 Tracking Structure SM SM SM SM A0 B<<<1,1>>>() Allocate Task data structure 11

LAUNCH EXAMPLE Task Tracking Structures B Grid Scheduler Grid A A0 Tracking Structure SM SM SM SM A0 B<<<1,1>>>() Fill out Task data structure 12

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A A0 Tracking Structure B SM SM SM SM A0 B<<<1,1>>>() Track Task B in Block A0 13

LAUNCH EXAMPLE Task Tracking Structures B<<<1,1>>>() Grid Scheduler Grid A A0 Tracking Structure B SM SM SM SM A0 Launch Task B to GPU 14

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Grid B A0 Tracking Structure B SM SM SM SM A0 B0 C<<<1,1>>>() cudaLaunchDevice( C, 1, 1 ); 15

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Grid B A0 Tracking Structure B C SM SM SM SM A0 B0 C<<<1,1>>>() Allocate, fill out, and track Task C in block A0 16

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Grid B A0 Tracking Structure B C SM SM SM SM A0 B0 Task C is not yet runnable. Track C to run after B. 17

LAUNCH EXAMPLE Task Tracking Structures Task B completes. Task B completes. Scheduler kernel runs. SKED runs Scheduler. Grid Scheduler Grid A, Scheduler A0 Tracking Structure B C SM SM SM SM A0 18

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Scheduler A0 Tracking Structure B C SM SM SM SM Sched A0 Scheduler searches for work. 19

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Scheduler A0 Tracking Structure B C SM SM SM SM Sched A0 Scheduler completes B, and Identifies C as ready-to-run. 20

LAUNCH EXAMPLE Task Tracking Structures C<<<1,1>>>() Grid Scheduler Grid A, Scheduler A0 Tracking Structure C SM SM SM SM Sched A0 Scheduler frees B for re-use, and launches C to the Grid Scheduler. 21

LAUNCH EXAMPLE Task Tracking Structures Grid Scheduler Grid A, Grid C A0 Tracking Structure C SM SM SM SM A0 C0 Task C now executes. 22

BASIC RULES Programming Model Essentially the same as CUDA Time CPU Thread Grid A Launch Launch is per-thread and asynchronous Grid A Complete Grid A Threads Sync is per-block Grid A - Parent CUDA primitives are per-block Grid B Launch Grid B Complete (cannot pass streams/events to children) cudaDeviceSynchronize() != __syncthreads() Grid B Grid B - Child Threads Events allow inter-stream dependencies Streams are shared within a block Implicit NULL stream results in ordering within a block; use named streams CUDA API available on the device: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#api-reference 23 23

MEMORY CONSISTENCY RULES Memory Model Launch implies membar Time (child sees parent state at time of launch) CPU Thread Grid A Launch Grid A Complete Sync implies invalidate Grid A Threads (parent sees child writes after sync) Grid A - Parent Texture changes by child are Grid B Launch Grid B Complete visible to parent after sync (i.e. sync == tex cache invalidate) Grid B Grid B - Child Threads Constants are immutable Local & shared memory are private: cannot be passed as child kernel args Fully consistent 25 25

EXPERIMENTS 26

DIRECTED BENCHMARKS Kernels written to measure specific aspects of dynamic parallelism Launch throughput Launch latency As a function of different configurations SDK Versions Varying Clocks 27

RESULTS – LAUNCH THROUGHPUT 28

LAUNCH THROUGHPUT K40m K40m-CPU 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 Grids/sec 200000 0 32 128 512 1024 2048 4096 8192 16384 32768 65536 Num Child kernels launched * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/875 * K40m-CPU = E5-2690 29 * Host launches are with 32 streams

LAUNCH THROUGHPUT Observations About an order of magnitude higher than from host Dynamic parallelism is very useful when there are a lot of child kernels Two major limiters of launch throughput Pending Launch Count Grid Scheduler Limit 30

PENDING LAUNCH COUNT 1024 4096 16384 32768 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 Grids/sec 200000 0 32 128 512 1024 2048 4096 8192 16384 32768 65536 Num Child kernels launched * Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875 * Different curves represent different pending launch count limits 31 31

PENDING LAUNCH COUNT Observations Pre-allocated buffer in Global Memory to store kernels before their launch Default value – 2048 kernels Buffer overflow implies resize performed on-the-go Substantial reduction in launch throughput! Know the number of pending child kernels! 32

PENDING LAUNCH COUNT CUDA API’S Setting Limit cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, yourLimit ); Querying Limit cudaDeviceGetLimit( &yourLimit , cudaLimitDevRuntimePendingLaunchCount); 4/4/2016 33

GRID SCHEDULER LIMIT 512 1024 2048 4096 8192 16384 3000000 2500000 2000000 1500000 1000000 Grids/sec 500000 0 8 16 32 64 128 256 Num device streams * Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875 * Different curves represent the total number of child kernels launched 34 34

GRID SCHEDULER LIMIT Observations Ability of grid scheduler to track the number of concurrent kernels The limit is currently 32 If this limit is crossed, upto 50% loss in launch throughput 35

RESULTS – LAUNCH LATENCY 36

LAUNCH LATENCY Initial Subsequent 30000 25000 20000 15000 10000 Time (ns) 5000 0 K40m K40m-CPU * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875 * K40m-CPU = E5-2690 37 * Host launches are with 32 streams

LAUNCH LATENCY Observations Initial and subsequent latencies are about 2-3x slower than that of host Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial kernel launches We are working towards improving this** ** Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, Jin Wang and Sudhakar Yalamanchili, 2014 IEEE International Symposium on Workload Characterization (IISWC). 38

LAUNCH LATENCY - STREAMS 350000 18000 16000 300000 14000 250000 12000 200000 10000 8000 150000 6000 100000 4000 Time (ns) Time (ns) 50000 2000 0 0 2 4 8 16 2 4 8 16 Host streams Device streams * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875 39

LAUNCH LATENCY - STREAMS Observations Host streams affect device-side launch latency Prefer device streams for dynamic parallelism 40

RESULTS – DEVICE SYNCHRONIZE 41

DEVICE SYNCHRONIZE cudaDeviceSynchronize is costly Avoid it when possible, example below __global__ void parent() { doSomeInitialization(); childKernel<<<grid,blk>>>(); Unnecessary. Implicit join enforced by the programming cudaDeviceSynchronize(); } model! 42

DEVICE SYNCHRONIZE - COST sync nosync 7 6 5 4 3 2 Time (ms) 1 0 2 4 8 16 32 Amount of work per thread (higher the number, more the work) * Device/SDK = K40/v7.5 43 43

DEVICE SYNCHRONIZE DEPTH Deepest recursion level until where cudaDeviceSynchronize works CUDA limit cudaLimitDevRuntimeSyncDepth controls it Default is level 2 At the cost of extra global memory reserved for storing parent blocks 44

DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley DEEP DIVE INTO DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA 1 OVERVIEW AND INTRODUCTION 2 WHAT IS DYNAMIC PARALLELISM? The ability to launch new kernels from the GPU

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding

1 A partial ontology for soil investigation Overview Administrative Administrative

implementation research December 4, 2017 Maren T. Scheuner, MD, MPH, FACMG Chief, Medical

Logic-based Evaluation of Forest Logic-based Evaluation of Forest Ecosystem Sustainability

Annual Report to State Board of Education June 5, 2019 Please note that this Powerpoint has

Chapter 5 Structural Specification of Hardware Part2 1 benyamin@mehr.sharif.edu 4-Bit

Market Operator User Group Belfast, 23 October 2019 1 Agenda Item Presenter Welcome Anne

A Formal Specification of the Horus Modeling Language Using FDMM Hans-Georg Fill, Susan Hickl,

Programme Specification: Presentation Skills Introduction: This programme is designed to equip