ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush - - PowerPoint PPT Presentation

achieving deterministic execution times in cuda
SMART_READER_LITE
LIVE PREVIEW

ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush - - PowerPoint PPT Presentation

ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS Aayush Rajoria, Ashok Kelur 20 th March 2019 CONTENTS CUDA Everywhere Deterministic Execution times Automotive Trade-offs Application execution flow Factors


slide-1
SLIDE 1

Aayush Rajoria, Ashok Kelur 20th March 2019

ACHIEVING DETERMINISTIC EXECUTION TIMES IN CUDA APPLICATIONS

slide-2
SLIDE 2

2

CONTENTS

  • CUDA Everywhere
  • Deterministic Execution times
  • Automotive Trade-offs
  • Application execution flow
  • Factors affecting Runtime determinism
slide-3
SLIDE 3

3

CUDA HPC DATA CENTER AUTOMOTIVE/ EMBEDDED

CUDA EVERYWHERE

slide-4
SLIDE 4

4

DETERMINISTIC EXECUTION TIMES

  • Automotive use-cases and deterministic execution.
  • How to write CUDA applications which are deterministic in nature.
slide-5
SLIDE 5

5

AUTOMOTIVE TRADE-OFFS

  • Determinism over Ease of programming
  • Example: cudaMalloc over cudaMallocManaged (CUDA unified memory)
  • Determinism over GPU utilization
  • Example: Single context over MPS
slide-6
SLIDE 6

6

AUTOMOTIVE TRADE-OFFS

INIT DETERMINISM DEINIT

INNER LOOP

  • Different trade-offs needs to be considered for every CUDA functionality.
  • Some CUDA functionality might be more deterministic than others.
  • Trade-offs could be different for different phases of an application’s lifecycle.
  • One simple application lifecycle is as below.
slide-7
SLIDE 7

7

APPLICATION EXECUTION FLOW

// Init phase Initilaize Camera; Do All memory allocation; Sets up all the dependencies; // Runtime phase While() { Inference_Kernel<<< ..., stream1 >>>(); Decision_Kernel<<< ..., stream1 >>>(); } // Deinit phase Free memory; Free all the system resources;

slide-8
SLIDE 8

8

FACTORS AFFECTING DETERMINISM OF THE RUNTIME PHASE

slide-9
SLIDE 9

9

FACTORS AFFECTING DETERMINISM OF THE RUNTIME PHASE

  • GPU work submission.
  • GPU work scheduling
  • Other factors
slide-10
SLIDE 10

10

GPU WORK SUBMISSION

slide-11
SLIDE 11

11

Source: Nvidia Internal micro benchmark ran on a Drive platforms on QNX

GPU WORK SUBMISSION

CUDA DRIVER WORK SUBMISSION IMPROVEMENTS

  • GPU work submission APIs are the most frequently used APIs in the runtime phase.
  • CUDA driver has done various improvements for making the GPU work submission time

deterministic over the past few years.

22.8 7.17 3.52 16.3 5.1 1.55 5 10 15 20 25 CUDA 8.x CUDA 9.x CUDA 10.x

Time in us CUDA Versions

CUDA DRIVER IMPROVEMENTS

Avg submit time in us Standard Deviation in us

slide-12
SLIDE 12

12

GPU WORK SUBMISSION

  • Using less number of GPU work submission to solve the problem at hand is always more

deterministic as compared to more number of GPU work submissions.

  • Number of GPU work submission can be reduced by:
  • Kernel fusion
  • CUDA graphs

SUGGESTIONS FOR APPLICATIONS

slide-13
SLIDE 13

13

GPU WORK SUBMISSION

Kernel Fusion

1.colorConversion_YUV_RGB<<< >>> (); 2.imageHistogram<<< >>> (); 3.edgeDetection<<< >>> ();

With Kernel Fusion

1.__device__ colorConversion_YUV_RGB() 2.__device__ imageHistogram() 3.__device__ edgeDetection() 4. 5.fusedKernel <<< >>> () { 6. colorConversion_YUV_RGB(); 7. imageHistogram(); 8. edgeDetection(); 9.}

slide-14
SLIDE 14

14

GPU WORK SUBMISSION

CUDA graphs

  • CUDA graphs helps in batching multiple kernels, memcpy, memset into a optimal number of

GPU work submission.

  • CUDA graphs allows application to describe GPU work and its dependencies ahead of time.

This allows CUDA driver to do all resource allocation ahead of the time.

slide-15
SLIDE 15

15

GPU WORK SUBMISSION

Define + Instantiate

A B X C D E Y

End A B X C D E Y

End

A B X C D E Y

End

A B X C D E Y

End

Execute

s1 s2 s3

Three-Stage execution model

INIT PHASE RUNTIME PHASE DEINIT PHASE

Execution flow for deterministic applications.

Destroy

cudaGraphDestroy();

slide-16
SLIDE 16

16

EXECUTION OPTIMIZATIONS

Launch latencies:

▪ Pre-defined graph allows launch of any number of kernels in one single operation

Latency & Overhead Reductions

time

Launch A Launch B Launch C Launch D Launch E

A B C D E

Build Graph Launch Graph

CPU Idle CPU Idle A B C D E

CPU TIMELINE GPU TIMELINE

Source: Nvidia Internal benchmarks ran on a Drive platforms on QNX

slide-17
SLIDE 17

17

HOST ENQUEUE TIME COMPARISON

Batching GPU work using CUDA graphs.

0.61 1.49 0.31 0.13 0.24 0.07 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 ResNet50 INT8 ResNet152 INT8 MobileNet INT8

Host Enqueue time in ms Neural Network

Enqueue time without Graphs Enqueue time with Graphs

Source: Nvidia benchmarks ran on a Drive platforms on QNX with CUDA10.1

slide-18
SLIDE 18

18

GPU WORK SCHEDULING

slide-19
SLIDE 19

19

GPU WORK SCHEDULING

GPU Context switches

  • Tasks in two GPU contexts can preempt each other which can affect the determinism of the

application.

  • It is advised not to create multiple CUDA contexts on the same device in the same process.
  • In case the application has multiple contexts in the same process, the dependency between

them can be established with:

  • cudaStreamWaitEvent()
  • In case the application has multiple contexts in different process, the dependency between

them can be established with:

  • EGLSTREAMS
slide-20
SLIDE 20

20

GPU WORK SCHEDULING

GPU Context switches

LAUNCH TASK1 CTX1 TASK1 THREAD 1 LAUNCH TASK2 CTX2 TASK2 TASK1 TASK2 LAUNCH TASK1 CTX1 THREAD 1 LAUNCH TASK2 CTX2 TASK1 TASK2

Context Save-Restore time Explicit Dependency

time Saved time Expected Deadline for Task1 Achieved Deadline for Task1

Inserted Dependency CTX1 CTX2 CPU GPU

slide-21
SLIDE 21

21

WORK SCHEDULING

CPU thread scheduling

If the CPU thread scheduling the GPU work gets switched out then it can result in increase in the launch overhead. Potential solutions:

  • Pin the CPU thread to the core and increase the thread priority of the thread submitting

CUDA work

  • Have a custom scheduler which guarantees that the CPU thread is active on a CPU core

while submitting CUDA kernels

slide-22
SLIDE 22

22

WORK SCHEDULING

CPU thread scheduling

time B Launch A Launch B Launch C Launch D Launch E GPU IDLE D E CPU WORK Thread 1 Thread 2 Thread 1 CPU WORK Thread 3 GPU IDLE Thread 1 A C Launch A Launch B Launch C Launch D A C D E Thread 1 B CPU WORK CPU WORK Thread 3 Thread 2

Actual Finish Expected Finish

Launch E Launch E

slide-23
SLIDE 23

23

WORK SUBMISSION ON NULL/DEFAULT STREAM

slide-24
SLIDE 24

24

OTHER FACTORS

slide-25
SLIDE 25

25

CUDA STREAM CALLBACKS

  • cudaStreamCallback runs a CPU function in a helper thread in a stream order.
  • Do not use cudaStreamAddCallback / cuStreamAddCallback.
  • It involves GPU interrupt latency
  • Application does not have control on the thread which executes callback.
  • Potential solution:
  • Use explicit CPU synchronization to schedule the dependent CPU work.
slide-26
SLIDE 26

26

PINNED MEMORY

  • The page-locked host memory.
  • All CPU memory used by the deterministic applications should be pinned (cudaMallocHost,

cudaHostAlloc). Tradeoff between pinned memory usage and determinism. Without Pinned memory:

  • Asynchronous DMA transfers can not be done due to copying of pageable memory to staging

memory involved.

slide-27
SLIDE 27

27

LOCAL MEMORY RESIZES

  • Use CU_CTX_LMEM_RESIZE_TO_MAX to avoid local memory resizes during kernel launches

which can result in dynamic allocation. Tradeoff between resource utilization and determinism.

  • In the init phase, run all kernels in the program at least once. This will ensure that enough

local memory for the highest local memory requiring kernel has been allocated.

  • All calls to cuCtxSetLimit() for CU_LIMIT_STACK_SIZE should be made in the init phase. Changing

the stack size also results in the local memory reallocation.

slide-28
SLIDE 28

28

UNIFIED MEMORY

  • Avoid using CUDA unified memory (created using cudaMallocManaged or

cuMemAllocManaged). On current generation of hardware, managed memory results in dynamic behavior and resource allocations/deallocations.

  • Tradeoff between ease of programming and determinism.
slide-29
SLIDE 29

29

DEVICE SIDE ALLOCATIONS

  • Do not use new, delete, malloc and free calls in CUDA kernels. Deterministic applications

should allocate memory in the init phase and free/delete at the deinit phase.

  • Tradeoff between resource utilization vs determinism and also ease of programming vs

determinism.

slide-30
SLIDE 30

30

REFERENCES

  • CUDA - New Features and Beyond by Stephen Jones – GTC Europe 2018

http://on-demand.gputechconf.com/gtc-eu/2018/video/e8128/

  • Image Sources: Google Images
slide-31
SLIDE 31

31

CONTACT US

  • Aayush Rajoria – arajoria@nvidia.com
  • Ashok Kelur – akelur@nvidia.com
slide-32
SLIDE 32