a gpu run time for event driven task parallelism
play

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, - PowerPoint PPT Presentation

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, Inc. R-Stream Team : Athanasios Konstantinidis Benoit Meister Muthu Baskaran Tom Henretty Benoit Pradelle Tahina Ramananandro Sanket Tavargeri Ann Johnson Richard Lethin


  1. A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, Inc. R-Stream Team : Athanasios Konstantinidis Benoit Meister Muthu Baskaran Tom Henretty Benoit Pradelle Tahina Ramananandro Sanket Tavargeri Ann Johnson Richard Lethin Reservoir Labs 1 2.3.15

  2. GPU Programming with CUDA • Massive data parallelism is required • Hides global memory access latency • What if our program is not data-parallel ? Dependence graph (DAG) of an SPMD computation Reservoir Labs 2 2.3.15

  3. GPU Programming with CUDA • Massive data parallelism is required • Hides global memory access latency • What if our program is not data-parallel ? • We find synchronous chunks of data-parallel computations i.e., wavefronts Dependence graph (DAG) of an SPMD computation Global synchronization overhead from repeated kernel invocations Reservoir Labs 3 2.3.15

  4. A GPU Run-Time for Task Parallelism • Implements an Event-Driven Tasks execution model • A single persistent GPU kernel executes the entire DAG (manages thread-block-level parallelism) • On-the-fly dependence resolution • Light-weight synchronization based on atomics • Work-stealing for load-balancing Dependence graph (DAG) of Task an SPMD computation Light-weight atomic synchronization (Event) Reservoir Labs 4 2.3.15

  5. Dependence Resolution – Event-Driven Tasks (EDTs) • Dependence counters • Each task has a dependence counter ( dcount ) • After task completion decrement successors’ dcount • Task becomes active if dcount becomes zero Task (inactive) dcount(1) Task (active) Events dcount(0) Task (inactive) dcount(2) Reservoir Labs 5 2.3.15

  6. Run-Time Architecture Task meta-data work work work queue queue queue work stealing thread thread thread block block block codelets codelets codelets Reservoir Labs 6 2.3.15

  7. Run-Time Architecture Task meta-data work work work queue queue queue work stealing thread thread thread • Defines persistent GPU kernel block block block codelets codelets codelets Reservoir Labs 7 2.3.15

  8. Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing thread thread thread • Defines persistent GPU kernel block block block codelets codelets codelets Reservoir Labs 8 2.3.15

  9. Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing Codelet Prologue thread thread thread block block block Computation codelets codelets codelets Epilogue Reservoir Labs 9 2.3.15

  10. Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing Codelet Unpacks parameters thread thread thread block block block Computation codelets codelets codelets Dependence resolution Reservoir Labs 10 2.3.15

  11. Run-Time Architecture Task meta-data • Global memory work work work queue queue queue Work Queue work stealing Put thread thread thread block block block Get codelets codelets codelets Reservoir Labs 11 2.3.15

  12. Run-Time Architecture Task meta-data work work work queue queue queue work stealing • Workers • Unrestricted amount thread thread thread • Max stealing rounds block block block • Intra-Thread-block configuration agnostic codelets codelets codelets Reservoir Labs 12 2.3.15

  13. Experimental Evaluation • Simple stencil programs from the PolyBench suite • Jacobi-2D 5pt, FDTD-2D, ADI • Compared against best known wavefront implementations • Konstantinidis et al. LCPC 2013 • Rectangular parametric tiling is applied • For run-time tile-size exploration Rectangular Tile Thread-block Task parallelism Reservoir Labs 13 2.3.15

  14. Experimental Evaluation • NVIDIA GTX 670 • Compute Capability: 3.0 • Driver/Runtime Version: 6.5 • Global Memory: 2GB • Multiprocessors: 7 • ECC: OFF Reservoir Labs 14 2.3.15

  15. Experimental Evaluation Reservoir Labs 15 2.3.15

  16. Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 23 workers Worker 16 workers Time Reservoir Labs 16 2.3.15

  17. Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines 10 workers 16 workers 23 workers Reservoir Labs 17 2.3.15

  18. Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 30 workers 23 workers Worker Time Reservoir Labs 18 2.3.15

  19. Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 30 workers Active workers Worker (23) Redundant workers (7) Time Reservoir Labs 19 2.3.15

  20. Experimental Evaluation • FDTD 2D – Execution Timelines Worker 33 workers 22 workers Worker Time Reservoir Labs 20 2.3.15

  21. Experimental Evaluation • ADI – Execution Timelines Worker 33 workers 22 workers Worker Time Reservoir Labs 21 2.3.15

  22. Conclusions • Effective task-parallelism with on-the-fly dependence resolution • Single persistent GPU kernel prevents global synchronization overhead • Evaluated against wavefront parallelism on stencil computations Reservoir Labs 22 2.3.15

  23. The End • Questions ? Reservoir Labs 23 2.3.15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend