Enabling predictable parallelism in single-GPU systems with - PowerPoint PPT Presentation

Feb 13, 2024 •125 likes •220 views

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads P A O L O B U R G I O U N I V E R S I T Y O F M O D E N A , I T A L Y P A O L O . B U R G I O @ U N I M O R E . I T Toulouse, 6 july 2016 GP-GPUs /

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads P A O L O B U R G I O U N I V E R S I T Y O F M O D E N A , I T A L Y P A O L O . B U R G I O @ U N I M O R E . I T Toulouse, 6 july 2016
GP-GPUs / General Purpose GPUs  Born for graphics, subsequently General Purposes computation  Massively parallel architectures  Baseline for next-generation of power efficient embedded devices  Tremendous Performance/Watt  Growing interest also for automotive and avionics  Still, not adoptable within (real-time) industrial settings 2 Toulouse, 6 july 2016
Why not real-time GPUs?  Complex architecture harnesses analyzability  Poor predictability  Non-openness of drivers, firmware..  Hard to do research  Typically, GPU treated a "black box"  Atomic shared resource Hard to extract timing guarantees 3 Toulouse, 6 july 2016
LightKer  Expose GPU architecture at the application level  Host-accelerator architecture  Clusters of cores  Non-Uniform Memory Access (NUMA) system  Same as modern accelerators Cluster GPU Host GPU core  Pure software approach L1 local MEM  No additional hardware! MEM mem/cache L2 (global) MEM Mem/cache MEM MEM MEM Host core 4 Toulouse, 6 july 2016
Persistent GPU threads  Run at user-level  Pinned to cores  Continuously spin-wait for work to execute 1 CUDA thread  1 GPU core 1 CUDA block  1 GPU cluster 5 Toulouse, 6 july 2016
Host-to-device communication  Lock-free mailbox  1 mailbox item for each cluster  Clusters exposed at the application level  Master thread for each cluster Master core (per-cluster) CORE MEM Triggered by Host R R W W from_GPU to_GPU GPU MEM Host MEM 6 Toulouse, 6 july 2016
LK vs traditional execution model  LK execution split in  Init, { Copyin, Trigger, Wait, Copyout}, Dispose  "Traditional" GPU kernel  { Alloc, Copyin, Launch, Wait, Copyout, Dispose }  Testbench  NVIDIA GTX 980  2048 CUDA cores, 16 clusters 7 Toulouse, 6 july 2016
Validation  Synthetic benchmark  Copyin/out not yet considered  Trigger phase 1000x faster   Synch/Wait is comparable Single SM LK Init LK Trigger LK Wait LK Dispose 509M 239 190k 30M CUDA Alloc CUDA Spawn CUDA Wait CUDA Dispose 496M 3.9k 175k 274k Full GPU LK Init LK Trigger LK Wait LK Dispose 503M 210 190k 30M CUDA Alloc CUDA Spawn CUDA Wait CUDA Dispose 497M 3.8k 176k 247k 8 Toulouse, 6 july 2016
Try it!  LightKernel v0.2  Open source  http://hipert.mat.unimore.it/LightKer/  ...and visit our poster  This Project has received funding from the European Union’s Horizon 2020 research and innovation programme under 9 Toulouse, 6 july 2016 grant agreement : 688860.

Recommend

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March 30, 2009 Billions of transistors Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March 30, 2009 Multicore

469 views • 23 slides

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

' $ Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems & % Database Systems

341 views • 21 slides

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism accessible to all programmers Parallelism is not for the average programmer Too difficult to find parallelism, to debug, maintain

593 views • 40 slides

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Fall 2015 :: CSE 610 Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Overview Data Parallelism vs. Control Parallelism Data Parallelism: parallelism

1.33k views • 59 slides

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is supported in OpenMP. If a PARALLEL directive is encountered within another PARALLEL directive, a new team of threads will be created. This is

242 views • 11 slides

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism among instructions. Instruction-level parallelism INSTRUCTION-LEVEL PARALLELISM Increase depth of pipeline (greater overlap of

646 views • 26 slides

CROSS-LAYER CROSS-LAYER LATENCY-AWARE AND -PREDICTABLE LATENCY-AWARE AND -PREDICTABLE DATA

CROSS-LAYER CROSS-LAYER LATENCY-AWARE AND -PREDICTABLE LATENCY-AWARE AND -PREDICTABLE DATA COMMUNICATION DATA COMMUNICATION Andreas Schmidt Apr 07th, 2020 Anywhere on Earth Doctoral Colloquium 1

702 views • 59 slides

How Predictable is Information Diffusion? Travis Martin, Jake Hofman, Amit Sharma, Ashton

How Predictable is Information Diffusion? Travis Martin, Jake Hofman, Amit Sharma, Ashton Anderson, and Duncan Watts How Predictable is Information Diffusion? 1 / 36 How far will this spread? How Predictable is Information Diffusion? 2 / 36

807 views • 51 slides

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism threads shared-memory architectures Message-Passing Parallelism processes distributed-memory architectures Practicalities

640 views • 34 slides

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1 CPU1 CPU1 CPU1 Operating Systems Process 2 Parallelism CPU1 Parallel Systems Process 1 True (Soon to be basic OS knowledge) CPU2 Process 2

381 views • 3 slides

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Loops and CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to use parallelism. Unfortunately, it is not easy to develop software that can take advantage of parallel machines. Dividing the

659 views • 54 slides

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Overview Implicit Parallelism Programming Languages References Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo Multi-core Programming: Implicit Parallelism Overview Implicit Parallelism

480 views • 34 slides

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk for Loops 2 Marc Moreno Maza Scheduling Theory and Implementation 3 University of Western Ontario, London, Ontario (Canada) Measuring Parallelism

531 views • 16 slides

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you understand by "parallelism" 2. How/where is parallelism in computers? Parallel / parallelism Concurrent / concurrency Many things

827 views • 27 slides

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and locality Real world exhibits parallelism and locality Particles, people, etc function independently Nearby objects interact more strongly

443 views • 25 slides

T T T The CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint:

T T T The CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint: Enabling the C C C Cognitive ognitive ognitive E ognitive E E Enterprise nterprise nterprise nterprise

176 views • 6 slides

Monirul Sharif 1 , Andrea Lanzi 2 , Jonathon Giffin 1 , Wenke Lee 1 1 Georgia Institute of

Monirul Sharif 1 , Andrea Lanzi 2 , Jonathon Giffin 1 , Wenke Lee 1 1 Georgia Institute of Technology 2 Universit`a degli Studi di Milano NDSS 2008 Introduction Introduction We need to understand malware Rootkits Keyloggers Viruses

482 views • 26 slides

B-physics trigger for the ATLAS detector at LHC: recent developments B. Epp, V.M. Ghete and D.

B-physics trigger for the ATLAS detector at LHC: recent developments B. Epp, V.M. Ghete and D. Kuhn Institute for Experimental Physics, Innsbruck ur Bildung, Wissenschaft und Kultur, Arbeit unterst utzt vom Bundesministerium f

356 views • 17 slides

CPSC 481 Tutorial 7 More WPF Brennan Jones bdgjones@ucalgary.ca (based on tutorials by

CPSC 481 Tutorial 7 More WPF Brennan Jones bdgjones@ucalgary.ca (based on tutorials by Alice Thudt, Fateme Rajabiyazdi, and David Ledo) Plan for Today More WPF material, examples, and coding exercises Mostly stuff that will be

1.43k views • 85 slides

Prr t rtt

Prr t rtt r rr r t tss r

692 views • 10 slides

Query Execu:on Declara:ve Query (SQL) We start from

SQL III The Query Language R & G - Chapter 5 Based on Slides from UC Berkeley and book. Query Execu:on Declara:ve Query (SQL) We start from here

1.02k views • 64 slides

WDA waveform feeders ew2wda reads from EW waveform ring cs2wda reads from Comserv

WDA waveform feeders ew2wda reads from EW waveform ring cs2wda reads from Comserv shared memory mcast2wda read from Comserv-style multicast RAD2 Realtime Amplitude Data time-domain estimation of amplitudes

372 views • 14 slides

Status of TDR studies NDK t0 reconstruction J. Soto DPPD consortium 12 th February 2019 Content

Status of TDR studies NDK t0 reconstruction J. Soto DPPD consortium 12 th February 2019 Content Previous status of the analysis. Including MCTruth position. Background reduction. Matching process. Results. 2 J. Soto | TDR

387 views • 10 slides

Outline 1 From the trigger point of view it is convenient to put the same threshold cut for all

HPS Collaboration meeting, Nov. 18-19, Jefferson Lab Outline 1 From the trigger point of view it is convenient to put the same threshold cut for all tiles Gains are chosen such that signal from each distribution to be peaked at 1000 (An almost

319 views • 17 slides