Experiments for Time-Predictable Execution of GPU Kernels Flavio - - PowerPoint PPT Presentation

experiments for time predictable execution of gpu kernels
SMART_READER_LITE
LIVE PREVIEW

Experiments for Time-Predictable Execution of GPU Kernels Flavio - - PowerPoint PPT Presentation

Experiments for Time-Predictable Execution of GPU Kernels Flavio Kreiliger, Joel Matjka, Michal Sojka and Zdenk Hanzlek OSPERT 2019 July 9, 2019, Stuttgart, Germany F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU


slide-1
SLIDE 1

Experiments for Time-Predictable Execution of GPU Kernels

Flavio Kreiliger, Joel Matějka, Michal Sojka and Zdeněk Hanzálek OSPERT 2019 July 9, 2019, Stuttgart, Germany

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 1 / 21

slide-2
SLIDE 2

Motivation/Approach

NVIDIA Tegra X2

▶ CPUs: 4× ARM Cortex A57, 2× Denver (ARM/NVIDIA) ▶ GPU: 256 CUDA cores in 2 streaming multiprocessors (SM)

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 2 / 21

slide-3
SLIDE 3

Motivation/Approach

Outline

Motivation/Approach Experiments and results Future work

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 3 / 21

slide-4
SLIDE 4

Motivation/Approach

NVIDIA Tegra X2 block diagram

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21

slide-5
SLIDE 5

Motivation/Approach

NVIDIA Tegra X2 block diagram

CPUs

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21

slide-6
SLIDE 6

Motivation/Approach

NVIDIA Tegra X2 block diagram

CPUs

GPU

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21

slide-7
SLIDE 7

Motivation/Approach

NVIDIA Tegra X2 block diagram

CPUs

GPU Video & display

USB SATA

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21

slide-8
SLIDE 8

Motivation/Approach

NVIDIA Tegra X2 block diagram

CPUs

GPU Video & display

USB SATA

MEM

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21

slide-9
SLIDE 9

Motivation/Approach

GPU execution times under CPU interference

Tegra X2, CPUs performing sequential memory accesses 80% 100% 120% 140% 160% 180% 200% CUDA UVM CUDA Kernel CUDA memset CUDA memcpy Relative execution time %

Alone Interf 1 Interf 2 Interf 3 Interf 4 Interf 5

Source: Capodieci et al., Detailed characterization of platforms, Deliverable D2.2, H2020 project HERCULES, 2017.

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 5 / 21

slide-10
SLIDE 10

Motivation/Approach

Safety-Critical applications

E.g. autonomous driving

▶ Future application will need to combine safety and high performance ▶ Typically, only some parts of the system are safety-critical ▶ Goal: isolate critical parts from non-critical ones

▶ Failure in non-critical component should not propagate to a critical one

ISO26262: Freedom from interference

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 6 / 21

slide-11
SLIDE 11

Motivation/Approach

Safety-Critical applications

E.g. autonomous driving

▶ Future application will need to combine safety and high performance ▶ Typically, only some parts of the system are safety-critical ▶ Goal: isolate critical parts from non-critical ones

▶ Failure in non-critical component should not propagate to a critical one

▶ ISO26262: Freedom from interference

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 6 / 21

slide-12
SLIDE 12

Motivation/Approach

Interference on TX2

  • 1. CPU-to-GPU
  • 2. GPU-to-CPU
  • 3. CPU-to-CPU
  • 4. GPU-to-GPU

80% 100% 120% 140% 160% 180% 200% CUDA UVM CUDA Kernel CUDA memset CUDA memcpy Relative execution time %

Alone Interf 1 Interf 2 Interf 3 Interf 4 Interf 5

  • Source: Capodieci et al., Detailed characterization of platforms,

Deliverable D2.2, H2020 project HERCULES, 2017.

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 7 / 21

slide-13
SLIDE 13

Motivation/Approach

Interference on TX2

  • 1. CPU-to-GPU
  • 2. GPU-to-CPU
  • 3. CPU-to-CPU
  • 4. GPU-to-GPU

80% 100% 120% 140% 160% 180% 200% CUDA UVM CUDA Kernel CUDA memset CUDA memcpy Relative execution time %

Alone Interf 1 Interf 2 Interf 3 Interf 4 Interf 5

  • Source: Capodieci et al., Detailed characterization of platforms,

Deliverable D2.2, H2020 project HERCULES, 2017.

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 7 / 21

slide-14
SLIDE 14

Motivation/Approach

Interference on TX2

  • 1. CPU-to-GPU
  • 2. GPU-to-CPU
  • 3. CPU-to-CPU
  • 4. GPU-to-GPU

80% 100% 120% 140% 160% 180% 200% CUDA UVM CUDA Kernel CUDA memset CUDA memcpy Relative execution time %

Alone Interf 1 Interf 2 Interf 3 Interf 4 Interf 5

10 20 30 40 50 60 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 Latency [ns] WSS [B]

CPU – sequenal read, sequenal interference

Cache limit Alone Interf 1 Interf 2 Interf 3

Source: Capodieci et al., Detailed characterization of platforms, Deliverable D2.2, H2020 project HERCULES, 2017.

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 7 / 21

slide-15
SLIDE 15

Motivation/Approach

Interference on TX2

  • 1. CPU-to-GPU
  • 2. GPU-to-CPU
  • 3. CPU-to-CPU
  • 4. GPU-to-GPU

80% 100% 120% 140% 160% 180% 200% CUDA UVM CUDA Kernel CUDA memset CUDA memcpy Relative execution time %

Alone Interf 1 Interf 2 Interf 3 Interf 4 Interf 5

10 20 30 40 50 60 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 Latency [ns] WSS [B]

CPU – sequenal read, sequenal interference

Cache limit Alone Interf 1 Interf 2 Interf 3

Source: Capodieci et al., Detailed characterization of platforms, Deliverable D2.2, H2020 project HERCULES, 2017.

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 7 / 21

slide-16
SLIDE 16

Motivation/Approach

Interference on TX2

  • 1. CPU-to-GPU
  • 2. GPU-to-CPU
  • 3. CPU-to-CPU
  • 4. GPU-to-GPU

80% 100% 120% 140% 160% 180% 200% CUDA UVM CUDA Kernel CUDA memset CUDA memcpy Relative execution time %

Alone Interf 1 Interf 2 Interf 3 Interf 4 Interf 5

10 20 30 40 50 60 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 Latency [ns] WSS [B]

CPU – sequenal read, sequenal interference

Cache limit Alone Interf 1 Interf 2 Interf 3

Source: Capodieci et al., Detailed characterization of platforms, Deliverable D2.2, H2020 project HERCULES, 2017.

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 7 / 21

slide-17
SLIDE 17

Motivation/Approach » PREM

CPU-to-CPU interference

▶ Possible solution (a part of): PRedictable Execution Model (PREM) ▶ Tasks prefetch batches of data to CPU-local memory (cache/scratchpad) and synchronize on access to main memory ▶ Well applicable to number-crunching applications:

▶ Image processing ▶ Neural networks

▶ GPUs are better suited for these

P C C W P C W P W W P CPU1 CPU2 MC

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 8 / 21

slide-18
SLIDE 18

Motivation/Approach » PREM

Problems with PREM on GPUs

▶ Memory bandwidth is almost always a bottleneck ▶ Compute-phases are shorter due to high parallelism ▶ Mutual exclusion for memory access kills performance

P C C W P C W P W W P CPU1 CPU2 MC

▶ Costly synchronization (≈ 2 µs)

▶ between CPU and GPU or ▶ between multiple SMs in the GPU

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 9 / 21

slide-19
SLIDE 19

Motivation/Approach » PREM

PREM on GPU: Early approach – GPUguard (ETHZ)

CPU K(GPU-GUARD) HV SHM GPU Init

Offload Check-in Req Mem-phase M?

M

Req Comp-phase C?

C

Check-out C?

fini M

Spinning on SHM Spinning on SHM Further execution

cudaDeviceSynchronize Create GG-Interface, exchange SHM, Retrieve GG-stats

M-WCET C-WCET Setup T

▶ Low performance due to excessive synchronization between CPU and GPU CPU/GPU

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 10 / 21

slide-20
SLIDE 20

Motivation/Approach » PREM

PREM on GPU: Early approach – GPUguard (ETHZ)

CPU K(GPU-GUARD) HV SHM GPU Init

Offload Check-in Req Mem-phase M?

M

Req Comp-phase C?

C

Check-out C?

fini M

Spinning on SHM Spinning on SHM Further execution

cudaDeviceSynchronize Create GG-Interface, exchange SHM, Retrieve GG-stats

M-WCET C-WCET Setup T

▶ Low performance due to excessive synchronization between CPU and GPU CPU/GPU

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 10 / 21

slide-21
SLIDE 21

Motivation/Approach » Time-Triggered scheduling

Another approach: Time-Triggered scheduling

▶ GPU jobs are often offmoaded in batches (e.g. one video frame)

▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

Pros: Low synchronization

  • verhead

Applies not only to GPU but can span the whole chip Cons: Cannot handle dynamic workload Over-provisioning due to uncertain execution time

Reduced by our approach

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 11 / 21

slide-22
SLIDE 22

Motivation/Approach » Time-Triggered scheduling

Another approach: Time-Triggered scheduling

▶ GPU jobs are often offmoaded in batches (e.g. one video frame)

▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

Pros: ▶ Low synchronization

  • verhead

Applies not only to GPU but can span the whole chip Cons: Cannot handle dynamic workload Over-provisioning due to uncertain execution time

Reduced by our approach

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 11 / 21

slide-23
SLIDE 23

Motivation/Approach » Time-Triggered scheduling

Another approach: Time-Triggered scheduling

▶ GPU jobs are often offmoaded in batches (e.g. one video frame)

▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

Pros: ▶ Low synchronization

  • verhead

▶ Applies not only to GPU but can span the whole chip Cons: Cannot handle dynamic workload Over-provisioning due to uncertain execution time

Reduced by our approach

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 11 / 21

slide-24
SLIDE 24

Motivation/Approach » Time-Triggered scheduling

Another approach: Time-Triggered scheduling

▶ GPU jobs are often offmoaded in batches (e.g. one video frame)

▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

Pros: ▶ Low synchronization

  • verhead

▶ Applies not only to GPU but can span the whole chip Cons: ▶ Cannot handle dynamic workload Over-provisioning due to uncertain execution time

Reduced by our approach

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 11 / 21

slide-25
SLIDE 25

Motivation/Approach » Time-Triggered scheduling

Another approach: Time-Triggered scheduling

▶ GPU jobs are often offmoaded in batches (e.g. one video frame)

▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

Pros: ▶ Low synchronization

  • verhead

▶ Applies not only to GPU but can span the whole chip Cons: ▶ Cannot handle dynamic workload ▶ Over-provisioning due to uncertain execution time

Reduced by our approach

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 11 / 21

slide-26
SLIDE 26

Motivation/Approach » Time-Triggered scheduling

Another approach: Time-Triggered scheduling

▶ GPU jobs are often offmoaded in batches (e.g. one video frame)

▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

Pros: ▶ Low synchronization

  • verhead

▶ Applies not only to GPU but can span the whole chip Cons: ▶ Cannot handle dynamic workload ▶ Over-provisioning due to uncertain execution time

▶ Reduced by our approach

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 11 / 21

slide-27
SLIDE 27

Experiments and results

Outline

Motivation/Approach Experiments and results Future work

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 12 / 21

slide-28
SLIDE 28

Experiments and results

Overview

Interference Approach When CPU-CPU PREM and TT scheduling past GPU-GPU “PREM” and TT scheduling started in this paper CPU-GPU TT scheduling + ? future Experiments:

  • 1. Synchronization overhead
  • 2. Inter-kernel interference (2D convolution)
  • 3. Detailed interference characterization (2D convolution)
  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 13 / 21

slide-29
SLIDE 29

Experiments and results

Overview

Interference Approach When CPU-CPU PREM and TT scheduling past GPU-GPU “PREM” and TT scheduling started in this paper CPU-GPU TT scheduling + ? future Experiments:

  • 1. Synchronization overhead
  • 2. Inter-kernel interference (2D convolution)
  • 3. Detailed interference characterization (2D convolution)
  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 13 / 21

slide-30
SLIDE 30

Experiments and results

Overview

Interference Approach When CPU-CPU PREM and TT scheduling past GPU-GPU “PREM” and TT scheduling started in this paper CPU-GPU TT scheduling + ? future Experiments:

  • 1. Synchronization overhead
  • 2. Inter-kernel interference (2D convolution)
  • 3. Detailed interference characterization (2D convolution)
  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 13 / 21

slide-31
SLIDE 31

Experiments and results

Overview

Interference Approach When CPU-CPU PREM and TT scheduling past GPU-GPU “PREM” and TT scheduling started in this paper CPU-GPU TT scheduling + ? future Experiments:

  • 1. Synchronization overhead
  • 2. Inter-kernel interference (2D convolution)
  • 3. Detailed interference characterization (2D convolution)
  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 13 / 21

slide-32
SLIDE 32

Experiments and results » Synchronization

Synchronization between GPU jobs/kernels

▶ Within one CUDA block (one SM of the GPU) – built-in Across multiple CUDA blocks (SMs):

Spinlock-like in pinned (non-cached) memory: 2 µs Time-based (globaltimer register): Default timer resolution is not suffjcient: 1 µs nvprof recofjgures the resolution to about 160 ns

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 14 / 21

slide-33
SLIDE 33

Experiments and results » Synchronization

Synchronization between GPU jobs/kernels

▶ Within one CUDA block (one SM of the GPU) – built-in ▶ Across multiple CUDA blocks (SMs):

▶ Spinlock-like in pinned (non-cached) memory: 2 µs Time-based (globaltimer register): Default timer resolution is not suffjcient: 1 µs nvprof recofjgures the resolution to about 160 ns

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 14 / 21

slide-34
SLIDE 34

Experiments and results » Synchronization

Synchronization between GPU jobs/kernels

▶ Within one CUDA block (one SM of the GPU) – built-in ▶ Across multiple CUDA blocks (SMs):

▶ Spinlock-like in pinned (non-cached) memory: 2 µs ▶ Time-based (globaltimer register):

10 20 30 40 50 60 70 Iterations 1 2 3 4 Time [µs] Default

▶ Default timer resolution is not suffjcient: 1 µs nvprof recofjgures the resolution to about 160 ns

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 14 / 21

slide-35
SLIDE 35

Experiments and results » Synchronization

Synchronization between GPU jobs/kernels

▶ Within one CUDA block (one SM of the GPU) – built-in ▶ Across multiple CUDA blocks (SMs):

▶ Spinlock-like in pinned (non-cached) memory: 2 µs ▶ Time-based (globaltimer register):

10 20 30 40 50 60 70 Iterations 1 2 3 4 Time [µs] Default After nvprof

▶ Default timer resolution is not suffjcient: 1 µs ▶ nvprof recofjgures the resolution to about 160 ns

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 14 / 21

slide-36
SLIDE 36

Experiments and results » Intra-GPU interference

2D convolution benchmark

From Polybench-ACC benchmark suite Dataset

Mask

▶ Original (legacy) implementation Tiled implementation with prefetching

Tile size 4 kernels can run simultaneously

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 15 / 21

slide-37
SLIDE 37

Experiments and results » Intra-GPU interference

2D convolution benchmark

From Polybench-ACC benchmark suite Dataset Tile placed in Shared memory

Mask

▶ Original (legacy) implementation ▶ Tiled implementation with prefetching

▶ Tile size ⇒ 4 kernels can run simultaneously

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 15 / 21

slide-38
SLIDE 38

Experiments and results » Intra-GPU interference

2D convolution benchmark

From Polybench-ACC benchmark suite Dataset Tile placed in Shared memory Block 0 (SM0) Block 1 (SM1)

Mask

GPU kernel

Mask

▶ Original (legacy) implementation ▶ Tiled implementation with prefetching

▶ Tile size ⇒ 4 kernels can run simultaneously

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 15 / 21

slide-39
SLIDE 39

Experiments and results » Intra-GPU interference

Tiled 2D convolution schedule

▶ 4 kernels, 2 streaming multiprocessors ▶ prefetch, compute, writeback phases + spinning ▶ difgerent kernels started with difgerent ofgsets

512 1024 1536 2048 NofThreads

SM 0

K0:B0 K1:B0 K2:B0 K3:B0 32.5 35.0 37.5 40.0 42.5 45.0 47.5 50.0 Time [us] 512 1024 1536 2048 NofThreads

SM 1

K0:B1 K1:B1 K2:B1 K3:B1

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 16 / 21

slide-40
SLIDE 40

Experiments and results » Intra-GPU interference

Results: Execution + jitter

L e g a c y : 1 k e r n e l L e g a c y : 4 k e r n e l s T i l e d : 4 k e r n e l s , n

  • s

c h e d u l e r T i l e d : 4 k e r n e l s , 1 3 n s

  • ff

s e t T i l e d : 4 k e r n e l s , 1 4 n s

  • ff

s e t 1 2 3

  • Avg. scenario exec.

time [ms] 2 4 6 Jitter [%]

1.84% 6.47% 1.47% 0.15% 0.04%

  • Avg. time

Jitter in %

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 17 / 21

slide-41
SLIDE 41

Experiments and results » Intra-GPU interference

Results: Execution + jitter

L e g a c y : 1 k e r n e l L e g a c y : 4 k e r n e l s T i l e d : 4 k e r n e l s , n

  • s

c h e d u l e r T i l e d : 4 k e r n e l s , 1 3 n s

  • ff

s e t T i l e d : 4 k e r n e l s , 1 4 n s

  • ff

s e t 1 2 3

  • Avg. scenario exec.

time [ms] 2 4 6 Jitter [%]

1.84% 6.47% 1.47% 0.15% 0.04%

  • Avg. time

Jitter in % 500 1000 1100 1200 1300 1400 1500 1600 1700 Tile offset [ns] 2.5 3.0 3.5

  • Avg. scenario exec.

time [ms] 0.0 0.5 1.0 1.5 Jitter [%] Baseline [ms]

  • Avg. time

Jitter

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 17 / 21

slide-42
SLIDE 42

Experiments and results » Interference characterization

Interference between prefetch and compute phases

n s 5 n s 1 n s 1 5 n s 2 n s 3 n s 2000 4000 Phase exec. time [ns]

Prefetch time Compute time

50 100 150 Jitter relative to

  • avg. phase exec. time [%]

Less overlap of prefetch phases shorter execution time and smaller jitter Compute phases interfere with each other (shared memory bank confmicts) prevents straighformward application of PREM

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 18 / 21

slide-43
SLIDE 43

Experiments and results » Interference characterization

Interference between prefetch and compute phases

n s 5 n s 1 n s 1 5 n s 2 n s 3 n s 2000 4000 Phase exec. time [ns]

Prefetch time Compute time

50 100 150 Jitter relative to

  • avg. phase exec. time [%]

▶ Less overlap of prefetch phases ⇒ shorter execution time and smaller jitter ▶ Compute phases interfere with each other (shared memory bank confmicts) ⇒ prevents straighformward application of PREM

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 18 / 21

slide-44
SLIDE 44

Experiments and results » Interference characterization

Interference between prefetch and compute phases

n s 5 n s 1 n s 1 5 n s 2 n s 3 n s 2000 4000 Phase exec. time [ns]

Prefetch time Prefetch jitter Compute time Compute jitter

50 100 150 Jitter relative to

  • avg. phase exec. time [%]

81.55% 77.87% 78.62% 68.20% 60.74% 39.87% 25.49% 26.27% 18.87% 11.76% 13.36% 12.86%

▶ Less overlap of prefetch phases ⇒ shorter execution time and smaller jitter ▶ Compute phases interfere with each other (shared memory bank confmicts) ⇒ prevents straighformward application of PREM

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 18 / 21

slide-45
SLIDE 45

Experiments and results » Interference characterization

Interference between writeback phases

n s 2 n s 4 n s 6 n s 8 n s 1 n s 500 1000 1500 Phase exec. time [ns]

Writeback execution time

100 200 Jitter relative to

  • avg. phase exec. time [%]

Less overlap of writeback phases shorter execution time and smaller jitter

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 19 / 21

slide-46
SLIDE 46

Experiments and results » Interference characterization

Interference between writeback phases

n s 2 n s 4 n s 6 n s 8 n s 1 n s 500 1000 1500 Phase exec. time [ns]

Writeback execution time Writeback jitter

100 200 Jitter relative to

  • avg. phase exec. time [%]

157.31% 190.27% 82.48% 70.03% 75.87% 54.28%

▶ Less overlap of writeback phases ≈⇒ shorter execution time and smaller jitter

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 19 / 21

slide-47
SLIDE 47

Experiments and results » Interference characterization

Conclusion

▶ Time-triggered scheduling on TX2 GPU is possible GPU globaltimer register has suffjcient resolution (160 ns) after running nvprof Even very simple scheduling (adding ofgset) shows potential to reduce execution time jitter

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 20 / 21

slide-48
SLIDE 48

Experiments and results » Interference characterization

Conclusion

▶ Time-triggered scheduling on TX2 GPU is possible ▶ GPU globaltimer register has suffjcient resolution (160 ns) after running nvprof Even very simple scheduling (adding ofgset) shows potential to reduce execution time jitter

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 20 / 21

slide-49
SLIDE 49

Experiments and results » Interference characterization

Conclusion

▶ Time-triggered scheduling on TX2 GPU is possible ▶ GPU globaltimer register has suffjcient resolution (160 ns) after running nvprof ▶ Even very simple scheduling (adding ofgset) shows potential to reduce execution time jitter

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 20 / 21

slide-50
SLIDE 50

Future work

Future work: Interference-aware scheduling of complex GPU workloads

0 x sqrt norm 1 x sqrt norm 2 x sqrt norm 3 x sqrt norm 0 x sqrt mag 1 x sqrt mag 2 x sqrt mag 3 x sqrt mag 0 x conj 1 x conj 2 x conj 3 x conj 0 x sum ch 1 x sum ch 2 x sum ch 3 x sum ch 0 x mul 1 x mul 2 x mul 3 x mul 0 x div 1 x div 2 x div 3 x div 0 x add 1 x add 2 x add 3 x add 0 x mul c 1 x mul c 2 x mul c 3 x mul c 0 x add c 1 x add c 2 x add c 3 x add c 0 x elemmul 1 x elemmul 2 x elemmul 3 x elemmul

  • Avg. execution time in [us]

sqrt norm sqrt mag conj sum ch mul div add mul c add c elemmul

49.0 55.7 63.1 74.7 49.0 51.4 67.1 75.1 49.0 50.7 65.6 74.5 48.9 48.4 49.3 55.6 48.9 57.7 80.3 98.8 48.9 58.1 78.7 99.1 48.9 53.8 69.9 80.3 48.9 51.0 66.0 73.4 48.9 50.8 66.1 74.2 48.9 59.0 82.3 99.7 10.2 11.6 16.1 21.0 10.2 21.4 31.4 40.2 10.2 19.4 29.0 37.1 10.3 11.3 18.7 26.0 10.2 23.6 35.2 48.5 10.2 22.0 29.8 42.1 10.2 23.6 30.6 38.6 10.2 19.0 28.0 35.7 10.3 19.1 28.5 36.3 10.2 22.7 35.8 48.8 12.0 13.6 19.5 27.5 12.0 25.2 38.9 47.9 12.0 24.1 40.7 51.4 12.0 13.1 22.6 31.0 12.0 39.1 58.4 75.8 12.0 30.6 46.9 67.2 12.0 35.3 48.4 61.7 12.0 24.7 43.4 55.7 12.0 24.1 41.2 53.4 12.0 35.2 58.9 74.9 2.9 3.4 3.9 7.9 2.9 5.3 13.3 17.0 2.9 4.5 11.9 15.1 2.9 4.3 6.1 12.2 2.9 8.6 15.7 19.3 2.9 8.1 13.9 17.7 2.9 7.9 13.7 16.9 2.9 4.4 11.8 14.7 2.9 4.3 11.7 14.5 2.9 8.4 14.8 18.7 26.1 29.4 36.3 38.1 26.1 37.7 45.2 54.9 26.1 39.1 48.9 59.8 26.1 28.6 35.3 39.4 26.1 52.5 78.1 103.6 26.1 46.9 70.0 97.2 26.1 41.8 53.9 69.8 26.1 38.7 47.3 59.8 26.1 39.5 49.3 60.0 26.1 56.4 82.6 105.6 30.4 38.7 49.6 52.8 30.4 46.9 52.6 61.7 30.4 47.4 57.8 67.9 30.4 38.3 45.2 47.7 30.4 56.4 80.5 105.6 30.3 55.7 78.5 104.1 30.4 51.5 61.1 73.8 30.4 48.1 58.1 68.1 30.4 48.1 58.7 68.5 30.4 59.8 84.7 107.9 14.1 20.9 30.3 31.4 14.0 34.3 42.2 50.6 14.0 35.0 44.2 54.2 14.1 23.2 31.8 34.3 14.0 42.0 65.0 86.1 14.0 35.3 54.8 76.8 14.0 38.7 50.0 62.9 14.0 35.0 43.2 54.1 14.0 34.8 44.1 54.9 14.0 43.3 66.5 85.2 12.5 14.2 20.2 28.2 12.5 25.7 39.7 48.8 12.5 24.9 42.0 52.5 12.5 13.8 23.1 31.3 12.5 39.5 60.4 80.6 12.4 30.8 49.0 70.4 12.4 35.1 46.6 60.0 12.4 25.2 41.7 52.1 12.4 24.9 41.8 52.9 12.4 37.4 61.3 79.9 12.2 13.8 19.7 28.1 12.2 25.3 39.3 48.2 12.2 24.1 41.2 51.0 12.2 13.3 22.6 31.0 12.2 39.7 59.2 78.0 12.2 30.8 47.8 68.7 12.2 35.7 46.9 61.0 12.2 25.0 41.9 53.8 12.2 24.7 41.7 53.0 12.2 36.3 60.1 77.3 31.2 34.3 37.7 41.1 31.2 38.0 46.6 55.9 31.2 38.2 48.8 59.0 31.2 33.0 35.3 41.5 31.2 57.3 83.0 108.7 31.2 52.7 78.4 104.6 31.2 45.3 58.8 73.3 31.1 38.1 49.2 61.0 31.1 38.1 49.4 60.3 31.1 61.7 87.9 110.3

1 2 3 4 5 6

  • Avg. execution time ratio

Corresponds to algorithms listed on y-axis

Traditional memory model - Avg. execution time

x s q r t n

  • r

m 1 x s q r t n

  • r

m 2 x s q r t n

  • r

m 3 x s q r t n

  • r

m x s q r t m a g 1 x s q r t m a g 2 x s q r t m a g 3 x s q r t m a g x c

  • n

j 1 x c

  • n

j 2 x c

  • n

j 3 x c

  • n

j x s u m c h 1 x s u m c h 2 x s u m c h 3 x s u m c h x m u l 1 x m u l 2 x m u l 3 x m u l x d i v 1 x d i v 2 x d i v 3 x d i v x a d d 1 x a d d 2 x a d d 3 x a d d x m u l c 1 x m u l c 2 x m u l c 3 x m u l c x a d d c 1 x a d d c 2 x a d d c 3 x a d d c x e l e m m u l 1 x e l e m m u l 2 x e l e m m u l 3 x e l e m m u l Jitter compared to avg. execution time [%] sqrt norm sqrt mag conj sum ch mul div add mul c add c elemmul

0.6 6.9 7.9 17.2 0.5 3.3 3.1 5.5 2.3 2.6 5.8 8.0 0.6 1.3 1.4 2.6 0.6 3.0 3.2 3.4 0.6 3.9 3.4 3.7 0.4 3.0 6.0 9.2 0.3 3.5 5.1 5.5 0.3 3.4 5.0 7.3 0.4 3.3 2.9 2.5 6.6 5.5 9.1 9.9 4.7 7.3 5.6 5.6 7.2 9.9 15.9 17.1 5.9 5.7 10.1 9.9 4.7 7.5 4.5 5.9 4.7 8.0 6.6 7.0 4.7 13.3 14.9 19.5 5.3 8.6 11.0 13.9 5.9 9.9 12.9 15.9 5.0 4.1 4.5 6.2 4.0 4.7 8.2 11.6 5.1 7.6 8.1 8.7 5.1 8.0 14.5 12.8 4.0 4.6 8.5 9.3 4.0 11.0 3.1 4.2 4.0 10.5 5.8 5.2 4.0 15.9 14.0 10.8 4.0 7.9 11.4 10.9 4.0 7.2 12.1 11.7 4.0 2.7 3.2 4.5 9.8 7.6 7.4 10.1 9.8 6.0 14.2 17.8 9.8 10.0 17.8 21.5 9.8 6.7 5.2 12.1 9.8 8.9 11.2 8.4 9.8 11.9 13.6 14.2 9.8 10.1 17.7 15.1 9.8 3.6 17.6 17.4 9.8 3.7 15.0 17.6 9.8 7.2 8.9 10.3 1.1 2.6 9.6 8.4 1.1 6.8 8.1 7.4 1.1 9.9 13.0 9.6 1.6 3.2 6.9 5.6 1.1 0.9 0.6 0.9 1.1 1.7 3.0 3.5 1.1 6.9 2.6 4.6 1.1 10.0 9.5 10.8 1.1 9.7 12.0 11.4 1.1 1.4 1.2 1.5 2.6 3.7 9.9 18.5 2.6 4.8 6.0 5.4 3.1 6.7 10.0 8.7 2.6 4.3 5.4 6.0 3.0 2.6 1.7 1.3 2.7 6.7 0.9 1.2 3.0 10.9 9.6 5.6 2.7 8.1 9.9 9.3 3.0 8.3 10.4 10.9 3.1 1.6 1.1 1.5 3.4 7.7 14.3 13.8 3.4 9.3 9.2 8.3 4.3 15.1 13.7 11.8 3.4 5.5 9.1 7.8 3.4 8.3 2.8 3.4 3.4 12.9 3.5 3.6 3.4 15.7 15.0 10.9 3.4 16.5 11.1 10.7 3.4 15.6 13.8 13.4 3.4 3.3 2.6 3.8 3.8 4.5 8.7 13.5 3.9 7.3 8.5 7.9 3.9 6.4 12.7 10.7 3.9 5.3 6.9 8.7 3.9 9.7 2.9 3.2 4.6 9.7 4.6 4.2 4.6 18.1 14.1 12.8 3.9 6.1 10.1 11.7 3.9 7.4 12.0 12.0 3.9 2.1 2.6 4.3 3.9 5.3 7.8 13.3 3.9 7.6 8.5 8.6 4.7 6.0 13.0 10.5 4.5 4.3 7.8 10.3 3.9 9.8 3.4 4.1 4.7 12.5 4.7 4.9 4.7 18.4 15.7 12.3 4.7 7.7 11.8 13.1 2.9 7.0 13.8 13.4 4.7 3.1 2.9 4.2 0.5 0.9 2.5 4.7 0.5 1.6 3.7 5.2 0.5 2.0 7.5 8.6 0.5 0.8 1.2 4.7 0.7 1.1 0.7 0.8 0.5 1.8 2.7 2.7 0.7 2.8 2.2 3.1 0.8 1.3 7.9 7.6 0.5 1.1 8.9 9.0 0.5 0.8 0.5 0.7

100 101 Jitter ratio Corresponds to algorithms listed on y-axis

Traditional memory model - Jitter compared to avg. execution time

  • F. Kreiliger et al.

Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 21 / 21