Enabling predictable parallelism in single-GPU systems with - - PowerPoint PPT Presentation

enabling predictable parallelism
SMART_READER_LITE
LIVE PREVIEW

Enabling predictable parallelism in single-GPU systems with - - PowerPoint PPT Presentation

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads P A O L O B U R G I O U N I V E R S I T Y O F M O D E N A , I T A L Y P A O L O . B U R G I O @ U N I M O R E . I T Toulouse, 6 july 2016 GP-GPUs /


slide-1
SLIDE 1

P A O L O B U R G I O U N I V E R S I T Y O F M O D E N A , I T A L Y

P A O L O . B U R G I O @ U N I M O R E . I T

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads

Toulouse, 6 july 2016

slide-2
SLIDE 2

2

GP-GPUs / General Purpose GPUs

 Born for graphics, subsequently General Purposes computation

 Massively parallel architectures

 Baseline for next-generation of power efficient embedded devices

 Tremendous Performance/Watt

 Growing interest also for automotive and avionics

 Still, not adoptable within (real-time) industrial settings

Toulouse, 6 july 2016

slide-3
SLIDE 3

3

Why not real-time GPUs?

Toulouse, 6 july 2016

 Complex architecture harnesses analyzability

 Poor predictability

 Non-openness of drivers, firmware..

 Hard to do research

 Typically, GPU treated a "black box"

 Atomic shared resource

Hard to extract timing guarantees

slide-4
SLIDE 4

4

LightKer

Toulouse, 6 july 2016

 Expose GPU architecture at the application level

 Host-accelerator architecture  Clusters of cores  Non-Uniform Memory Access (NUMA) system

 Same as modern accelerators  Pure software approach

 No additional hardware!

Host GPU

MEM MEM MEM MEM

MEM

Cluster GPU core

MEM

Host core L1 local mem/cache L2 (global) Mem/cache

slide-5
SLIDE 5

5

Persistent GPU threads

Toulouse, 6 july 2016

 Run at user-level  Pinned to cores  Continuously spin-wait for work to execute

1 CUDA thread  1 GPU core 1 CUDA block  1 GPU cluster

slide-6
SLIDE 6

6

Host-to-device communication

Toulouse, 6 july 2016

 Lock-free mailbox

 1 mailbox item for each cluster

 Clusters exposed at the application level  Master thread for each cluster

GPU MEM Host MEM

CORE

MEM

R W W R Master core

(per-cluster)

from_GPU to_GPU Triggered by Host

slide-7
SLIDE 7

7

LK vs traditional execution model

Toulouse, 6 july 2016

 LK execution split in

 Init, { Copyin, Trigger, Wait, Copyout}, Dispose

 "Traditional" GPU kernel

 { Alloc, Copyin, Launch, Wait, Copyout, Dispose }

 Testbench

 NVIDIA GTX 980  2048 CUDA cores, 16 clusters

slide-8
SLIDE 8

8

Validation

Toulouse, 6 july 2016

 Synthetic benchmark

 Copyin/out not yet considered  Trigger phase 1000x faster   Synch/Wait is comparable

Single SM LK Init LK Trigger LK Wait LK Dispose 509M 239 190k 30M CUDA Alloc CUDA Spawn CUDA Wait CUDA Dispose 496M 3.9k 175k 274k Full GPU LK Init LK Trigger LK Wait LK Dispose 503M 210 190k 30M CUDA Alloc CUDA Spawn CUDA Wait CUDA Dispose 497M 3.8k 176k 247k

slide-9
SLIDE 9

9

Try it!

Toulouse, 6 july 2016

 LightKernel v0.2

 Open source  http://hipert.mat.unimore.it/LightKer/

 ...and visit our poster 

This Project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement: 688860.