Experiments for Time-Predictable Execution of GPU Kernels Flavio - PowerPoint PPT Presentation

Experiments for Time-Predictable Execution of GPU Kernels Flavio Kreiliger, Joel Matějka, Michal Sojka and Zdeněk Hanzálek OSPERT 2019 July 9, 2019, Stuttgart, Germany F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 1 / 21

Motivation/Approach NVIDIA Tegra X2 F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 2 / 21 ▶ CPUs: 4× ARM Cortex A57, 2× Denver (ARM/NVIDIA) ▶ GPU: 256 CUDA cores in 2 streaming multiprocessors (SM)

Motivation/Approach Outline Motivation/Approach Experiments and results Future work F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 3 / 21

Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21

Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs

Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs GPU

Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs USB SATA GPU Video & display

Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs USB SATA GPU Video & display MEM

Motivation/Approach GPU execution times under CPU interference OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 5 / 21 Tegra X2, CPUs performing sequential memory accesses 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy

Motivation/Approach Safety-Critical applications E.g. autonomous driving performance critical one ISO26262: Freedom from interference F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 6 / 21 ▶ Future application will need to combine safety and high ▶ Typically, only some parts of the system are safety-critical ▶ Goal: isolate critical parts from non-critical ones ▶ Failure in non-critical component should not propagate to a

Motivation/Approach Safety-Critical applications E.g. autonomous driving performance critical one F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 6 / 21 ▶ Future application will need to combine safety and high ▶ Typically, only some parts of the system are safety-critical ▶ Goal: isolate critical parts from non-critical ones ▶ Failure in non-critical component should not propagate to a ▶ ISO26262: Freedom from interference

� � Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 2. GPU-to-CPU 3. CPU-to-CPU 4. GPU-to-GPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy

Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 3. CPU-to-CPU 4. GPU-to-GPU 2. GPU-to-CPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy CPU – sequen � al read, sequen � al interference 60 50 Latency [ns] 40 30 Cache limit 20 Alone Interf 1 10 Interf 2 0 Interf 3 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 WSS [B]

Motivation/Approach » PREM applications: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. CPU-to-CPU interference these 8 / 21 number-crunching (PREM) memory (cache/scratchpad) and synchronize on access to main ▶ Possible solution (a part of): PRedictable Execution Model ▶ Tasks prefetch batches of data to CPU-local memory ▶ Well applicable to P C C W CPU1 P C W CPU2 ▶ Image processing ▶ Neural networks P P W W MC ▶ GPUs are better suited for

Motivation/Approach » PREM performance OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Problems with PREM on GPUs 9 / 21 parallelism bottleneck ▶ Memory bandwidth is almost always a P C C W CPU1 ▶ Compute-phases are shorter due to high P C W CPU2 P P W W MC ▶ Mutual exclusion for memory access kills ▶ Costly synchronization ( ≈ 2 µs) ▶ between CPU and GPU or ▶ between multiple SMs in the GPU

Motivation/Approach » PREM PREM on GPU: Early approach – GPUguard (ETHZ) OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. CPU/GPU CPU and GPU 10 / 21 Create GG-Interface, exchange SHM, Retrieve GG-stats CPU cudaDeviceSynchronize M-WCET C-WCET Setup T K(GPU-GUARD) Offload Req Comp-phase Req Mem-phase Check-in M? C? Further execution Check-out C? HV SHM GPU Init Spinning on SHM M Spinning on SHM C M fini ▶ Low performance due to excessive synchronization between

Motivation/Approach » Time-Triggered scheduling Cons: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. approach Reduced by our uncertain execution time Over-provisioning due to workload Cannot handle dynamic chip Another approach: Time-Triggered scheduling but can span the whole Applies not only to GPU overhead Low synchronization Pros: frame) 11 / 21 ▶ GPU jobs are often offmoaded in batches (e.g. one video ▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

Motivation/Approach » Time-Triggered scheduling Cons: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. approach Reduced by our uncertain execution time Over-provisioning due to workload Cannot handle dynamic chip Another approach: Time-Triggered scheduling but can span the whole Applies not only to GPU overhead Pros: frame) 11 / 21 ▶ GPU jobs are often offmoaded in batches (e.g. one video ▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety) ▶ Low synchronization

Experiments for Time-Predictable Execution of GPU Kernels Flavio - PowerPoint PPT Presentation

Experiments for Time-Predictable Execution of GPU Kernels Flavio Kreiliger, Joel Matjka, Michal Sojka and Zdenk Hanzlek OSPERT 2019 July 9, 2019, Stuttgart, Germany F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU

Integrated Schedulers for a Predictable Interrupt Management on Real-Time Kernels A. Crespo S.

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

CROSS-LAYER CROSS-LAYER LATENCY-AWARE AND -PREDICTABLE LATENCY-AWARE AND -PREDICTABLE DATA

Tow ards Self-Tim ed Logic in the Tim e- Triggered Protocol Overview Introduction

WiFi and Multiple Interfaces: Adequate for Virtual Reality? Huanle Zhang*, Ahmed Elmokashfi # ,

Visible aesthetics IN TR OD U C TION TO DATA VISU AL IZATION W ITH G G P L OT 2 Rick Sca v e

Scheduling An Engineering Approach to Computer Networking An Engineering Approach to Computer

Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based

Topics in Computational Sustainability CS 325 Spring 2016 Note to other teachers and users of

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

SLIDES AGAINST HUMANITY 1 Slides Against Humanity Course/Program: The Basic Course, Public

Experiments for Time-Predictable Execution of GPU Kernels Flavio - PowerPoint PPT Presentation

Experiments for Time-Predictable Execution of GPU Kernels Flavio Kreiliger, Joel Matjka, Michal Sojka and Zdenk Hanzlek OSPERT 2019 July 9, 2019, Stuttgart, Germany F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU

Integrated Schedulers for a Predictable Interrupt Management on Real-Time Kernels A. Crespo S.

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

CROSS-LAYER CROSS-LAYER LATENCY-AWARE AND -PREDICTABLE LATENCY-AWARE AND -PREDICTABLE DATA

Tow ards Self-Tim ed Logic in the Tim e- Triggered Protocol Overview Introduction

WiFi and Multiple Interfaces: Adequate for Virtual Reality? Huanle Zhang*, Ahmed Elmokashfi # ,

Visible aesthetics IN TR OD U C TION TO DATA VISU AL IZATION W ITH G G P L OT 2 Rick Sca v e

Scheduling An Engineering Approach to Computer Networking An Engineering Approach to Computer

Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based

Topics in Computational Sustainability CS 325 Spring 2016 Note to other teachers and users of

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

SLIDES AGAINST HUMANITY 1 Slides Against Humanity Course/Program: The Basic Course, Public

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team