experiments for time predictable execution of gpu kernels
play

Experiments for Time-Predictable Execution of GPU Kernels Flavio - PowerPoint PPT Presentation

Experiments for Time-Predictable Execution of GPU Kernels Flavio Kreiliger, Joel Matjka, Michal Sojka and Zdenk Hanzlek OSPERT 2019 July 9, 2019, Stuttgart, Germany F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU


  1. Experiments for Time-Predictable Execution of GPU Kernels Flavio Kreiliger, Joel Matějka, Michal Sojka and Zdeněk Hanzálek OSPERT 2019 July 9, 2019, Stuttgart, Germany F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 1 / 21

  2. Motivation/Approach NVIDIA Tegra X2 F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 2 / 21 ▶ CPUs: 4× ARM Cortex A57, 2× Denver (ARM/NVIDIA) ▶ GPU: 256 CUDA cores in 2 streaming multiprocessors (SM)

  3. Motivation/Approach Outline Motivation/Approach Experiments and results Future work F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 3 / 21

  4. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21

  5. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs

  6. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs GPU

  7. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs USB SATA GPU Video & display

  8. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs USB SATA GPU Video & display MEM

  9. Motivation/Approach GPU execution times under CPU interference OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 5 / 21 Tegra X2, CPUs performing sequential memory accesses 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy

  10. Motivation/Approach Safety-Critical applications E.g. autonomous driving performance critical one ISO26262: Freedom from interference F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 6 / 21 ▶ Future application will need to combine safety and high ▶ Typically, only some parts of the system are safety-critical ▶ Goal: isolate critical parts from non-critical ones ▶ Failure in non-critical component should not propagate to a

  11. Motivation/Approach Safety-Critical applications E.g. autonomous driving performance critical one F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 6 / 21 ▶ Future application will need to combine safety and high ▶ Typically, only some parts of the system are safety-critical ▶ Goal: isolate critical parts from non-critical ones ▶ Failure in non-critical component should not propagate to a ▶ ISO26262: Freedom from interference

  12. � � Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 2. GPU-to-CPU 3. CPU-to-CPU 4. GPU-to-GPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy

  13. � � Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 2. GPU-to-CPU 3. CPU-to-CPU 4. GPU-to-GPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy

  14. Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 3. CPU-to-CPU 4. GPU-to-GPU 2. GPU-to-CPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy CPU – sequen � al read, sequen � al interference 60 50 Latency [ns] 40 30 Cache limit 20 Alone Interf 1 10 Interf 2 0 Interf 3 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 WSS [B]

  15. Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 3. CPU-to-CPU 4. GPU-to-GPU 2. GPU-to-CPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy CPU – sequen � al read, sequen � al interference 60 50 Latency [ns] 40 30 Cache limit 20 Alone Interf 1 10 Interf 2 0 Interf 3 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 WSS [B]

  16. Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 3. CPU-to-CPU 4. GPU-to-GPU 2. GPU-to-CPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy CPU – sequen � al read, sequen � al interference 60 50 Latency [ns] 40 30 Cache limit 20 Alone Interf 1 10 Interf 2 0 Interf 3 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 WSS [B]

  17. Motivation/Approach » PREM applications: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. CPU-to-CPU interference these 8 / 21 number-crunching (PREM) memory (cache/scratchpad) and synchronize on access to main ▶ Possible solution (a part of): PRedictable Execution Model ▶ Tasks prefetch batches of data to CPU-local memory ▶ Well applicable to P C C W CPU1 P C W CPU2 ▶ Image processing ▶ Neural networks P P W W MC ▶ GPUs are better suited for

  18. Motivation/Approach » PREM performance OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Problems with PREM on GPUs 9 / 21 parallelism bottleneck ▶ Memory bandwidth is almost always a P C C W CPU1 ▶ Compute-phases are shorter due to high P C W CPU2 P P W W MC ▶ Mutual exclusion for memory access kills ▶ Costly synchronization ( ≈ 2 µs) ▶ between CPU and GPU or ▶ between multiple SMs in the GPU

  19. Motivation/Approach » PREM PREM on GPU: Early approach – GPUguard (ETHZ) OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. CPU/GPU CPU and GPU 10 / 21 Create GG-Interface, exchange SHM, Retrieve GG-stats CPU cudaDeviceSynchronize M-WCET C-WCET Setup T K(GPU-GUARD) Offload Req Comp-phase Req Mem-phase Check-in M? C? Further execution Check-out C? HV SHM GPU Init Spinning on SHM M Spinning on SHM C M fini ▶ Low performance due to excessive synchronization between

  20. Motivation/Approach » PREM PREM on GPU: Early approach – GPUguard (ETHZ) OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. CPU/GPU CPU and GPU 10 / 21 Create GG-Interface, exchange SHM, Retrieve GG-stats CPU cudaDeviceSynchronize M-WCET C-WCET Setup T K(GPU-GUARD) Offload Req Comp-phase Req Mem-phase Check-in M? C? Further execution Check-out C? HV SHM GPU Init Spinning on SHM M Spinning on SHM C M fini ▶ Low performance due to excessive synchronization between

  21. Motivation/Approach » Time-Triggered scheduling Cons: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. approach Reduced by our uncertain execution time Over-provisioning due to workload Cannot handle dynamic chip Another approach: Time-Triggered scheduling but can span the whole Applies not only to GPU overhead Low synchronization Pros: frame) 11 / 21 ▶ GPU jobs are often offmoaded in batches (e.g. one video ▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

  22. Motivation/Approach » Time-Triggered scheduling Cons: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. approach Reduced by our uncertain execution time Over-provisioning due to workload Cannot handle dynamic chip Another approach: Time-Triggered scheduling but can span the whole Applies not only to GPU overhead Pros: frame) 11 / 21 ▶ GPU jobs are often offmoaded in batches (e.g. one video ▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety) ▶ Low synchronization

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend