ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush - PowerPoint PPT Presentation

ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush Rajoria, Ashok Kelur 20 th March 2019

CONTENTS CUDA Everywhere • • Deterministic Execution times Automotive Trade-offs • • Application execution flow Factors affecting Runtime determinism • 2

CUDA EVERYWHERE CUDA AUTOMOTIVE/ HPC DATA CENTER EMBEDDED 3

DETERMINISTIC EXECUTION TIMES Automotive use-cases and deterministic execution. • • How to write CUDA applications which are deterministic in nature. 4

AUTOMOTIVE TRADE-OFFS Determinism over Ease of programming • • Example: cudaMalloc over cudaMallocManaged (CUDA unified memory) Determinism over GPU utilization • • Example: Single context over MPS 5

AUTOMOTIVE TRADE-OFFS Different trade-offs needs to be considered for every CUDA functionality. • Some CUDA functionality might be more deterministic than others. • • Trade- offs could be different for different phases of an application’s lifecycle. One simple application lifecycle is as below. • INIT DETERMINISM DEINIT INNER LOOP 6

APPLICATION EXECUTION FLOW // Init phase Initilaize Camera; Do All memory allocation; Sets up all the dependencies; // Runtime phase While() { Inference_Kernel<<< ..., stream1 >>>(); Decision_Kernel<<< ..., stream1 >>>(); } // Deinit phase Free memory; Free all the system resources; 7

FACTORS AFFECTING DETERMINISM OF THE RUNTIME PHASE 8

FACTORS AFFECTING DETERMINISM OF THE RUNTIME PHASE GPU work submission. • • GPU work scheduling Other factors • 9

GPU WORK SUBMISSION 10

GPU WORK SUBMISSION CUDA DRIVER WORK SUBMISSION IMPROVEMENTS GPU work submission APIs are the most frequently used APIs in the runtime phase. • CUDA driver has done various improvements for making the GPU work submission time • deterministic over the past few years. CUDA DRIVER IMPROVEMENTS 25 22.8 20 Time in us 16.3 15 10 Avg submit time in us 7.17 Standard Deviation in us 3.52 5 5.1 1.55 0 CUDA 8.x CUDA 9.x CUDA 10.x CUDA Versions 11 Source: Nvidia Internal micro benchmark ran on a Drive platforms on QNX

GPU WORK SUBMISSION SUGGESTIONS FOR APPLICATIONS Using less number of GPU work submission to solve the problem at hand is always more • deterministic as compared to more number of GPU work submissions. Number of GPU work submission can be reduced by: • • Kernel fusion • CUDA graphs 12

GPU WORK SUBMISSION Kernel Fusion 1.colorConversion_YUV_RGB<<< >>> (); 2.imageHistogram<<< >>> (); 3.edgeDetection<<< >>> (); With Kernel Fusion 1.__device__ colorConversion_YUV_RGB() 2.__device__ imageHistogram() 3.__device__ edgeDetection() 4. 5.fusedKernel <<< >>> () { 6. colorConversion_YUV_RGB(); 7. imageHistogram(); 8. edgeDetection(); 9.} 13

GPU WORK SUBMISSION CUDA graphs CUDA graphs helps in batching multiple kernels, memcpy, memset into a optimal number of • GPU work submission. CUDA graphs allows application to describe GPU work and its dependencies ahead of time. • This allows CUDA driver to do all resource allocation ahead of the time. 14

GPU WORK SUBMISSION Three-Stage execution model Define + Instantiate Execute Destroy s1 s2 s3 A A B X A B X A B X C D D C cudaGraphDestroy(); B X C D E Y E Y C D E Y End E Y End End End INIT PHASE RUNTIME PHASE DEINIT PHASE 15 Execution flow for deterministic applications.

EXECUTION OPTIMIZATIONS Latency & Overhead Reductions Launch latencies: ▪ Pre-defined graph allows launch of any number of kernels in one single operation CPU TIMELINE Launch Launch Launch Launch Launch CPU Idle A B C D E GPU TIMELINE A B C D E time Build CPU Idle Launch Graph Graph A B C D E 16 Source: Nvidia Internal benchmarks ran on a Drive platforms on QNX

HOST ENQUEUE TIME COMPARISON Batching GPU work using CUDA graphs. 1.6 1.49 1.4 Host Enqueue time in ms 1.2 1 0.8 0.61 0.6 0.4 0.31 0.24 0.13 0.2 0.07 0 ResNet50 INT8 ResNet152 INT8 MobileNet INT8 Neural Network Enqueue time without Graphs Enqueue time with Graphs 17 Source: Nvidia benchmarks ran on a Drive platforms on QNX with CUDA10.1

GPU WORK SCHEDULING 18

GPU WORK SCHEDULING GPU Context switches Tasks in two GPU contexts can preempt each other which can affect the determinism of the • application. It is advised not to create multiple CUDA contexts on the same device in the same process. • • In case the application has multiple contexts in the same process, the dependency between them can be established with: cudaStreamWaitEvent() • In case the application has multiple contexts in different process, the dependency between • them can be established with: • EGLSTREAMS 19

GPU WORK SCHEDULING GPU Context switches Context Save-Restore time Inserted Dependency THREAD 1 CTX1 CPU CTX2 LAUNCH TASK1 CTX1 LAUNCH TASK2 CTX2 GPU TASK1 TASK2 TASK1 TASK2 Achieved Deadline for Task1 time THREAD 1 LAUNCH TASK1 CTX1 LAUNCH TASK2 CTX2 Expected Deadline for Task1 Saved time TASK1 TASK2 Explicit Dependency 20

WORK SCHEDULING CPU thread scheduling If the CPU thread scheduling the GPU work gets switched out then it can result in increase in the launch overhead. Potential solutions: • Pin the CPU thread to the core and increase the thread priority of the thread submitting CUDA work Have a custom scheduler which guarantees that the CPU thread is active on a CPU core • while submitting CUDA kernels 21

WORK SCHEDULING Launch E CPU thread scheduling Thread 1 Thread 3 Thread 1 Thread 2 Thread 1 Launch A CPU WORK Launch B Launch C Launch D Launch E CPU WORK C GPU IDLE D E A GPU IDLE B Actual Finish time Thread 1 Thread 2 Thread 3 Launch A Launch B Launch C Launch D Launch E CPU WORK CPU WORK A B C D E Expected Finish 22

WORK SUBMISSION ON NULL/DEFAULT STREAM 23

OTHER FACTORS 24

CUDA STREAM CALLBACKS cudaStreamCallback runs a CPU function in a helper thread in a stream order. • • Do not use cudaStreamAddCallback / cuStreamAddCallback. It involves GPU interrupt latency • • Application does not have control on the thread which executes callback. • Potential solution: Use explicit CPU synchronization to schedule the dependent CPU work. • 25

PINNED MEMORY The page-locked host memory. • • All CPU memory used by the deterministic applications should be pinned (cudaMallocHost, cudaHostAlloc). Tradeoff between pinned memory usage and determinism. Without Pinned memory: Asynchronous DMA transfers can not be done due to copying of pageable memory to staging • memory involved. 26

LOCAL MEMORY RESIZES Use CU_CTX_LMEM_RESIZE_TO_MAX to avoid local memory resizes during kernel launches • which can result in dynamic allocation. Tradeoff between resource utilization and determinism. • In the init phase, run all kernels in the program at least once. This will ensure that enough local memory for the highest local memory requiring kernel has been allocated. • All calls to cuCtxSetLimit() for CU_LIMIT_STACK_SIZE should be made in the init phase. Changing the stack size also results in the local memory reallocation. 27

UNIFIED MEMORY Avoid using CUDA unified memory (created using cudaMallocManaged or • cuMemAllocManaged). On current generation of hardware, managed memory results in dynamic behavior and resource allocations/deallocations. • Tradeoff between ease of programming and determinism. 28

DEVICE SIDE ALLOCATIONS Do not use new, delete, malloc and free calls in CUDA kernels. Deterministic applications • should allocate memory in the init phase and free/delete at the deinit phase. Tradeoff between resource utilization vs determinism and also ease of programming vs • determinism. 29

REFERENCES CUDA - New Features and Beyond by Stephen Jones – GTC Europe 2018 • http://on-demand.gputechconf.com/gtc-eu/2018/video/e8128/ Image Sources: Google Images • 30

CONTACT US Aayush Rajoria – arajoria@nvidia.com • • Ashok Kelur – akelur@nvidia.com 31

ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush - PowerPoint PPT Presentation

ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush Rajoria, Ashok Kelur 20 th March 2019 CONTENTS CUDA Everywhere Deterministic Execution times Automotive Trade-offs Application execution flow Factors

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Training Deterministic Parsers with Non-Deterministic Oracles by Yoav Goldberg and Joakim

TIMES TABLES HOW WE TEACH TIMES TABLES AND HOW YOU CAN HELP WHY ARE TIMES TABLES IMPORTANT?

Financial Impacts of Achieving Aggressive Financial Impacts of Achieving Aggressive Financial

Deterministic Networking Lab Part Bernhard Frmel Institut fr Technische Informatik

From normal to anomalous deterministic diffusion Part 1: Normal deterministic diffusion Rainer

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

Challenges for Worst-case Execution Time Analysis of Multi-core Architectures Jan Reineke @

UML Model Refactoring Viktor Stojkovski University of Antwerpen, Faculty of Computer Science,

judgments of the European Court of Human Rights Department for the Supervision of the Execution

Office of Water Project Review And the Deep Draft Navigation Planning Center of Expertise US

MAXS: Scaling Malware Execution with Sequential Multi-Hypothesis Testing Authors: Phani Vadrevu

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Tethys Tower Off-Grid Rainwater Distribution System P roject T eam M embers : J ason H elle &

FY2018 Full Year Roadshow Presentation Barry Irvin Executive Chairman Paul van

ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush - PowerPoint PPT Presentation

ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush Rajoria, Ashok Kelur 20 th March 2019 CONTENTS CUDA Everywhere Deterministic Execution times Automotive Trade-offs Application execution flow Factors

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Training Deterministic Parsers with Non-Deterministic Oracles by Yoav Goldberg and Joakim

TIMES TABLES HOW WE TEACH TIMES TABLES AND HOW YOU CAN HELP WHY ARE TIMES TABLES IMPORTANT?

Financial Impacts of Achieving Aggressive Financial Impacts of Achieving Aggressive Financial

Deterministic Networking Lab Part Bernhard Frmel Institut fr Technische Informatik

From normal to anomalous deterministic diffusion Part 1: Normal deterministic diffusion Rainer

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

Challenges for Worst-case Execution Time Analysis of Multi-core Architectures Jan Reineke @

UML Model Refactoring Viktor Stojkovski University of Antwerpen, Faculty of Computer Science,

judgments of the European Court of Human Rights Department for the Supervision of the Execution

Office of Water Project Review And the Deep Draft Navigation Planning Center of Expertise US

MAXS: Scaling Malware Execution with Sequential Multi-Hypothesis Testing Authors: Phani Vadrevu

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Tethys Tower Off-Grid Rainwater Distribution System P roject T eam M embers : J ason H elle &amp;

FY2018 Full Year Roadshow Presentation Barry Irvin Executive Chairman Paul van

Tethys Tower Off-Grid Rainwater Distribution System P roject T eam M embers : J ason H elle &