A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, - PowerPoint PPT Presentation

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, Inc. R-Stream Team : Athanasios Konstantinidis Benoit Meister Muthu Baskaran Tom Henretty Benoit Pradelle Tahina Ramananandro Sanket Tavargeri Ann Johnson Richard Lethin Reservoir Labs 1 2.3.15

GPU Programming with CUDA • Massive data parallelism is required • Hides global memory access latency • What if our program is not data-parallel ? Dependence graph (DAG) of an SPMD computation Reservoir Labs 2 2.3.15

GPU Programming with CUDA • Massive data parallelism is required • Hides global memory access latency • What if our program is not data-parallel ? • We find synchronous chunks of data-parallel computations i.e., wavefronts Dependence graph (DAG) of an SPMD computation Global synchronization overhead from repeated kernel invocations Reservoir Labs 3 2.3.15

A GPU Run-Time for Task Parallelism • Implements an Event-Driven Tasks execution model • A single persistent GPU kernel executes the entire DAG (manages thread-block-level parallelism) • On-the-fly dependence resolution • Light-weight synchronization based on atomics • Work-stealing for load-balancing Dependence graph (DAG) of Task an SPMD computation Light-weight atomic synchronization (Event) Reservoir Labs 4 2.3.15

Dependence Resolution – Event-Driven Tasks (EDTs) • Dependence counters • Each task has a dependence counter ( dcount ) • After task completion decrement successors’ dcount • Task becomes active if dcount becomes zero Task (inactive) dcount(1) Task (active) Events dcount(0) Task (inactive) dcount(2) Reservoir Labs 5 2.3.15

Run-Time Architecture Task meta-data work work work queue queue queue work stealing thread thread thread block block block codelets codelets codelets Reservoir Labs 6 2.3.15

Run-Time Architecture Task meta-data work work work queue queue queue work stealing thread thread thread • Defines persistent GPU kernel block block block codelets codelets codelets Reservoir Labs 7 2.3.15

Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing thread thread thread • Defines persistent GPU kernel block block block codelets codelets codelets Reservoir Labs 8 2.3.15

Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing Codelet Prologue thread thread thread block block block Computation codelets codelets codelets Epilogue Reservoir Labs 9 2.3.15

Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing Codelet Unpacks parameters thread thread thread block block block Computation codelets codelets codelets Dependence resolution Reservoir Labs 10 2.3.15

Run-Time Architecture Task meta-data • Global memory work work work queue queue queue Work Queue work stealing Put thread thread thread block block block Get codelets codelets codelets Reservoir Labs 11 2.3.15

Run-Time Architecture Task meta-data work work work queue queue queue work stealing • Workers • Unrestricted amount thread thread thread • Max stealing rounds block block block • Intra-Thread-block configuration agnostic codelets codelets codelets Reservoir Labs 12 2.3.15

Experimental Evaluation • Simple stencil programs from the PolyBench suite • Jacobi-2D 5pt, FDTD-2D, ADI • Compared against best known wavefront implementations • Konstantinidis et al. LCPC 2013 • Rectangular parametric tiling is applied • For run-time tile-size exploration Rectangular Tile Thread-block Task parallelism Reservoir Labs 13 2.3.15

Experimental Evaluation • NVIDIA GTX 670 • Compute Capability: 3.0 • Driver/Runtime Version: 6.5 • Global Memory: 2GB • Multiprocessors: 7 • ECC: OFF Reservoir Labs 14 2.3.15

Experimental Evaluation Reservoir Labs 15 2.3.15

Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 23 workers Worker 16 workers Time Reservoir Labs 16 2.3.15

Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines 10 workers 16 workers 23 workers Reservoir Labs 17 2.3.15

Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 30 workers 23 workers Worker Time Reservoir Labs 18 2.3.15

Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 30 workers Active workers Worker (23) Redundant workers (7) Time Reservoir Labs 19 2.3.15

Experimental Evaluation • FDTD 2D – Execution Timelines Worker 33 workers 22 workers Worker Time Reservoir Labs 20 2.3.15

Experimental Evaluation • ADI – Execution Timelines Worker 33 workers 22 workers Worker Time Reservoir Labs 21 2.3.15

Conclusions • Effective task-parallelism with on-the-fly dependence resolution • Single persistent GPU kernel prevents global synchronization overhead • Evaluated against wavefront parallelism on stencil computations Reservoir Labs 22 2.3.15

The End • Questions ? Reservoir Labs 23 2.3.15

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, - PowerPoint PPT Presentation

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, Inc. R-Stream Team : Athanasios Konstantinidis Benoit Meister Muthu Baskaran Tom Henretty Benoit Pradelle Tahina Ramananandro Sanket Tavargeri Ann Johnson Richard Lethin

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Events Event-driven programming Event loop Event dispatch Event handling Event Driven

Events Event-driven programming Event loop Event dispatch Event handling Event Driven

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

3x1x1 ROOT parser: v3 (December 4, 2017) Metadata (one entry per event): Run: run number

Event Driven Simulation and Test-benches Event Driven Simulation Continuous time and value

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Data-driven time parallelism and model reduction Kevin Carlberg 1 , Lukas Brencher 2 , Bernard

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Re Revie view w of Event nt Based d Trimmed med Meshing ing for In-Cy Cylind linder er

Exhibit 99.1 BROADRIDGE REPORTS SECOND QUARTER 2016 RESULTS Announces Adjusted Diluted EPS Growth

How to build an evolutionary architecture Agenda Giraffes Coffee A guy who knows a

EVENTS-FIRST PROGRAMMING IN APP INVENTOR Mark Sherman and Fred Martin Franklyn Turbak

Customer Case Study Event-Based Systems Integration at QUT E B d S I i QUT Enhancing

COMMUNITY INPUT May 14 th MEETING 4:00pm BRITTANY Riordan Hall GOLF COURSE Livestream

Events Detection, Coreference and Sequencing: Whats next? Overview of TAC KBP 2017 Event

Events Centre Public Engagement Summary of What We Heard Report Presentation to Nanaimo City

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, - PowerPoint PPT Presentation

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, Inc. R-Stream Team : Athanasios Konstantinidis Benoit Meister Muthu Baskaran Tom Henretty Benoit Pradelle Tahina Ramananandro Sanket Tavargeri Ann Johnson Richard Lethin

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Events Event-driven programming Event loop Event dispatch Event handling Event Driven

Events Event-driven programming Event loop Event dispatch Event handling Event Driven

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

3x1x1 ROOT parser: v3 (December 4, 2017) Metadata (one entry per event): Run: run number

Event Driven Simulation and Test-benches Event Driven Simulation Continuous time and value

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Data-driven time parallelism and model reduction Kevin Carlberg 1 , Lukas Brencher 2 , Bernard

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Re Revie view w of Event nt Based d Trimmed med Meshing ing for In-Cy Cylind linder er

Exhibit 99.1 BROADRIDGE REPORTS SECOND QUARTER 2016 RESULTS Announces Adjusted Diluted EPS Growth

How to build an evolutionary architecture Agenda Giraffes Coffee A guy who knows a

EVENTS-FIRST PROGRAMMING IN APP INVENTOR Mark Sherman and Fred Martin Franklyn Turbak

Customer Case Study Event-Based Systems Integration at QUT E B d S I i QUT Enhancing

COMMUNITY INPUT May 14 th MEETING 4:00pm BRITTANY Riordan Hall GOLF COURSE Livestream

Events Detection, Coreference and Sequencing: Whats next? Overview of TAC KBP 2017 Event

Events Centre Public Engagement Summary of What We Heard Report Presentation to Nanaimo City

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team