A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, - - PowerPoint PPT Presentation

a gpu run time for event driven task parallelism
SMART_READER_LITE
LIVE PREVIEW

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, - - PowerPoint PPT Presentation

A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, Inc. R-Stream Team : Athanasios Konstantinidis Benoit Meister Muthu Baskaran Tom Henretty Benoit Pradelle Tahina Ramananandro Sanket Tavargeri Ann Johnson Richard Lethin


slide-1
SLIDE 1

Reservoir Labs

2.3.15

Reservoir Labs, Inc.

1

A GPU Run-Time for Event-Driven Task Parallelism

R-Stream Team: Athanasios Konstantinidis Benoit Meister Muthu Baskaran Tom Henretty Benoit Pradelle Tahina Ramananandro Sanket Tavargeri Ann Johnson Richard Lethin

slide-2
SLIDE 2

Reservoir Labs

2.3.15

  • Massive data parallelism is required
  • Hides global memory access latency
  • What if our program is not data-parallel ?

2

GPU Programming with CUDA

Dependence graph (DAG) of an SPMD computation

slide-3
SLIDE 3

Reservoir Labs

2.3.15

  • Massive data parallelism is required
  • Hides global memory access latency
  • What if our program is not data-parallel ?
  • We find synchronous chunks of data-parallel

computations i.e., wavefronts

3

GPU Programming with CUDA

Dependence graph (DAG) of an SPMD computation

Global synchronization

  • verhead from repeated

kernel invocations

slide-4
SLIDE 4

Reservoir Labs

2.3.15

  • Implements an Event-Driven Tasks execution model
  • A single persistent GPU kernel executes the entire

DAG (manages thread-block-level parallelism)

  • On-the-fly dependence resolution
  • Light-weight synchronization based on atomics
  • Work-stealing for load-balancing

4

A GPU Run-Time for Task Parallelism

Dependence graph (DAG) of an SPMD computation Task Light-weight atomic synchronization (Event)

slide-5
SLIDE 5

Reservoir Labs

2.3.15

  • Dependence counters
  • Each task has a dependence counter (dcount)
  • After task completion decrement successors’ dcount
  • Task becomes active if dcount becomes zero

5

Dependence Resolution – Event-Driven Tasks (EDTs)

Task

(active)

Task

(inactive) dcount(0) dcount(1)

Task

(inactive) dcount(2) Events

slide-6
SLIDE 6

Reservoir Labs

2.3.15

6

Run-Time Architecture

work queue work queue work queue thread block thread block thread block work stealing codelets codelets codelets

Task meta-data

slide-7
SLIDE 7

Reservoir Labs

2.3.15

7

Run-Time Architecture

work queue work queue work queue thread block thread block thread block work stealing codelets codelets codelets

Task meta-data

  • Defines persistent GPU kernel
slide-8
SLIDE 8

Reservoir Labs

2.3.15

8

Run-Time Architecture

work queue work queue work queue thread block thread block thread block work stealing codelets codelets codelets

Task meta-data

  • Task parameters
  • dependence counters
  • Codelet type
  • Integer vectors
  • Defines persistent GPU kernel
slide-9
SLIDE 9

Reservoir Labs

2.3.15

9

Run-Time Architecture

work queue work queue work queue thread block thread block thread block work stealing codelets codelets codelets

Task meta-data

  • Task parameters
  • dependence counters
  • Codelet type
  • Integer vectors

Codelet

Prologue Computation Epilogue

slide-10
SLIDE 10

Reservoir Labs

2.3.15

10

Run-Time Architecture

work queue work queue work queue thread block thread block thread block work stealing codelets codelets codelets

Task meta-data

  • Task parameters
  • dependence counters
  • Codelet type
  • Integer vectors

Codelet

Unpacks parameters Computation Dependence resolution

slide-11
SLIDE 11

Reservoir Labs

2.3.15

11

Run-Time Architecture

work queue work queue work queue thread block thread block thread block work stealing codelets codelets codelets

Task meta-data

  • Global memory

Work Queue

Put Get

slide-12
SLIDE 12

Reservoir Labs

2.3.15

  • Workers
  • Unrestricted amount
  • Max stealing rounds
  • Intra-Thread-block

configuration agnostic

12

Run-Time Architecture

work queue work queue work queue thread block thread block thread block work stealing codelets codelets codelets

Task meta-data

slide-13
SLIDE 13

Reservoir Labs

2.3.15

  • Simple stencil programs from the PolyBench suite
  • Jacobi-2D 5pt, FDTD-2D, ADI
  • Compared against best known wavefront

implementations

  • Konstantinidis et al. LCPC 2013
  • Rectangular parametric tiling is applied
  • For run-time tile-size exploration

13

Experimental Evaluation

Rectangular Tile

Task

Thread-block parallelism

slide-14
SLIDE 14

Reservoir Labs

2.3.15

  • NVIDIA GTX 670
  • Compute Capability: 3.0
  • Driver/Runtime Version: 6.5
  • Global Memory: 2GB
  • Multiprocessors: 7
  • ECC: OFF

14

Experimental Evaluation

slide-15
SLIDE 15

Reservoir Labs

2.3.15

15

Experimental Evaluation

slide-16
SLIDE 16

Reservoir Labs

2.3.15

  • Jacobi 2D 5pt – Execution Timelines

16

Experimental Evaluation

Time Worker Worker

23 workers 16 workers

slide-17
SLIDE 17

Reservoir Labs

2.3.15

  • Jacobi 2D 5pt – Execution Timelines

17

Experimental Evaluation

10 workers 16 workers 23 workers

slide-18
SLIDE 18

Reservoir Labs

2.3.15

  • Jacobi 2D 5pt – Execution Timelines

18

Experimental Evaluation

Time Worker Worker

30 workers 23 workers

slide-19
SLIDE 19

Reservoir Labs

2.3.15

  • Jacobi 2D 5pt – Execution Timelines

19

Experimental Evaluation

Time Worker Worker

30 workers Redundant workers (7) Active workers (23)

slide-20
SLIDE 20

Reservoir Labs

2.3.15

  • FDTD 2D – Execution Timelines

20

Experimental Evaluation

Time Worker Worker

33 workers 22 workers

slide-21
SLIDE 21

Reservoir Labs

2.3.15

  • ADI – Execution Timelines

21

Experimental Evaluation

Time Worker Worker

33 workers 22 workers

slide-22
SLIDE 22

Reservoir Labs

2.3.15

  • Effective task-parallelism with on-the-fly dependence

resolution

  • Single persistent GPU kernel prevents global

synchronization overhead

  • Evaluated against wavefront parallelism on stencil

computations

22

Conclusions

slide-23
SLIDE 23

Reservoir Labs

2.3.15

  • Questions ?

23

The End