[PPT] - D YNAMIC F INE -G RAIN S CHEDULING OF P IPELINE P ARALLELISM Daniel PowerPoint Presentation

SLIDE 1

DYNAMIC FINE-GRAIN SCHEDULING

OF PIPELINE PARALLELISM

Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, Christos Kozyrakis Stanford University PACT-20, October 11th 2011

SLIDE 2

Executive Summary

2

 Pipeline-parallel applications are hard to schedule

 Existing techniques either ignore pipeline parallelism, cannot

handle its dependences, or suffer from load imbalance

 Contributions:

 Design a runtime that dynamically schedules pipeline-

parallel applications efficiently

 Show it outperforms typical scheduling techniques from

multicore, GPGPU and Streaming programming models

SLIDE 3

Outline

3

 Introduction  GRAMPS Programming Model  GRAMPS Runtime  Evaluation

SLIDE 4

High-Level Programming Models

4

 High-level parallel programming models provide:

 Simple, safe constructs to express parallelism  Automatic resource management and scheduling

 Many aspects; we focus on scheduling

 Model, scheduler and architecture often intimately related

 In terms of scheduling, three main types of models:

 Task-parallel models, typical in multicore (Cilk, X10)  Data-parallel models, typical in GPU (CUDA, OpenCL)  Streaming models, typical in streaming architectures

(StreamIt, StreamC)

SLIDE 5

Pipeline-Parallel Applications

5

 Some models (e.g. streaming) define applications as a graph of

stages that communicate explicitly through queues

 Each stage can be sequential or data-parallel  Arbitrary graphs allowed (multiple inputs/outputs, loops)

 Well suited to many algorithms  Producer-consumer communication is explicit  Easier to exploit

to improve locality

 Traditional scheduling techniques have issues dynamically

scheduling pipeline-parallel applications

Camera Camera Camera Tiler Sampler Camera Camera Intersect Camera Camera Shade Camera Camera Shadow Intersect Frame Buffer

Ray tracing pipeline

SLIDE 6

Task-Parallel – Task-Stealing

6

 Model: Task-parallel with fork-join dependences or

independent tasks (Cilk, X10, TBB, OpenMP , …)

 Task-Stealing Scheduler:

 Worker threads enqueue/dequeue tasks from local queue  Steal from another queue if out of tasks  Efficient load-balancing  Unable to handle dependences

f pipeline-parallel programs

Dequeue

T0 T1 Tn

Enqueue Steal

SLIDE 7

Data-Parallel – Breadth-First

7

 Model: Sequence of data-parallel kernels (CUDA,

OpenCL)

 Breadth-First Scheduler: Execute one stage at a time in

breadth-first order (source to sink)

 Very simple model  Ignores pipeline parallelism  works poorly with sequential

stages, worst-case memory footprint

Stage 3 Stage 1 Camera Camera Stage 2

T0 T1 T2 T3

1 3 2 2 2 2

SLIDE 8

Streaming – Static Scheduling

8

 Model: Graph of stages communicating through streams  Static Scheduler:

 Assume app and architecture are regular, known in advance  Use sophisticated compile-time analysis and scheduling to

minimize inter-core communication and memory footprint

 Very efficient if application and architecture are regular  Load imbalance with irregular applications or non-

predictable architectures (DVFS, multi-threading …)

SLIDE 9

Summary of Scheduling Techniques

9 Task-Stealing Breadth-First Static

     

Supports pipeline- parallel apps Supports irregular apps/archs

 

GRAMPS

SLIDE 10

Outline

10

 Introduction  GRAMPS Programming Model  GRAMPS Runtime  Evaluation

SLIDE 11

GRAMPS Programming Model

11

 Programming model for dynamic scheduling of irregular

pipeline-parallel workloads

 Brief overview here, details in [Sugerman 2010]

 Shader (data-parallel) and Thread (sequential) stages  Stages send packets through fixed-size data queues

 Queues can be ordered or unordered  Can enqueue full packets or push elements (coalesced by runtime)

Camera Camera Camera Tiler Sampler Camera Camera Intersect Camera Camera Shade Camera Camera Shadow Intersect Frame Buffer

Thread Stage Shader Stage Queue Push Queue

SLIDE 12

GRAMPS: Threads vs Shaders

12

 Threads are stateful, instanced by the programmer

 Arbitrary number of input and output queues  Blocks on empty input/full output queue  Can be preempted by the scheduler

 Shaders are stateless, automatically instanced

 Single input queue, one or more outputs  Each instance processes an input packet  Does not block

Thread Stage Camera Camera Shader Stage

SLIDE 13

GRAMPS Scheduling

13

 Similar model to Streaming, but features ease dynamic

scheduling of irregular applications:

 Packet granularity  reduce scheduling overheads  Stages can produce variable output (e.g., push queues)  Data parallel stages, queue ordering are explicit

 Static requires applications to have a steady state;

GRAMPS can schedule apps with no steady state

 GRAMPS was evaluated with an idealized scheduler

when proposed; we implement a real multicore runtime

SLIDE 14

Outline

14

 Introduction  GRAMPS Programming Model  GRAMPS Runtime  Evaluation

SLIDE 15

GRAMPS Runtime Overview

15

 Runtime = Scheduler + Buffer Manager  Scheduler: Decide what to run where

 Dynamic, low-overhead, keeps bounded footprint  Based on task-stealing with multiple task queues/thread

 Buffer Manager: Provide dynamic allocation of packets

 Generic memory allocators are too slow for communication-

intensive applications

 Low-overhead solution, based on packet-stealing

SLIDE 16

Scheduler organization

16

 As many worker pthreads as hardware threads  Work is represented with tasks  Shader stages are function calls (stateless, non-

preemptive)

 One task per runnable shader instance  Thread stages are user-level threads (stateful, preemptive)  User-level threads enable fast context-switching (100 cycles)  One task per runnable thread

SLIDE 17

Scheduler: Task Queues

17

 Load-balancing with task stealing

 Each thread has one LIFO task queue per stage  Stages sorted by breadth-first order (higher priority to consumers)  Dequeue from high-priority first, steal low-priority first

 Higher priority tasks drain the pipeline, improve locality  Lower priority tasks produce more work (less stealing)

Camera Camera Camera Camera

3 2 1 4

2 2 2 2 3 3 4 1

Dequeue order Steal order

SLIDE 18

Scheduler: Data Queues

18

 Thread input queues maintained as linked lists  Shader input queues implicitly maintained in task queues

 Each shader task includes a pointer to its input packet

 Queue occupancy tracked for all queues  Backpressure: When a queue fills up, disable dequeues

and steals from queue producers

 Producers remain stalled until packets are consumed, workers

shift to other stages

 Queues never exceed capacity  bounded footprint

 Queues are optionally ordered (see paper for details)

SLIDE 19

Example

19

Thread 3 Thread 1 Camera Camera Shader 2

Q2 Q1

T0 T1 T2 T3

1

0/20

1 2 2 2 2 2 2 2 3 2 2 2 2 2 2

Queue 1

ccupancy

0/10

Queue 2

ccupancy

3

4/20 10/20 9/20 1/10

SLIDE 20

Example (cont.)

20

Thread 3 Thread 1 Camera Camera Shader 2

Q2 Q1

T0 T1 T2 T3

1

8/20

2 2 2 2 2 2 2 2

Queue 1

ccupancy

0/10

Queue 2

ccupancy

3

9/10 10/10

   

2

7/20

Queue 2 full  disable dequeues and steals from Stage 2

SLIDE 21

Packet-Stealing Buffer Manager

21

 Packets pre-allocated to a set of pools

 Each pool has packets of a specific size

 Each worker thread maintains a LIFO queue per pool

 Release used input packets to local queue  Allocate new output packets from local queue, if empty, steal  Due to bounded queue sizes, no need to dynamically

allocate packets

 LIFO policy results in high locality and reuse

SLIDE 22

Outline

22

 Introduction  GRAMPS Programming Model  GRAMPS Runtime  Evaluation

SLIDE 23

Methodology

23

 Test system: 2-socket, 12-core, 24-thread Westmere

 32KB L1I+D, 256KB private L2, 12MB per-socket L3  48GB 1333MHz DDR3 memory, 21GB/s peak BW

 Benchmarks from different programming models:

 GRAMPS: raytracer  MapReduce: histogram, lr, pca  Cilk: mergesort  StreamIt: fm, tde, fft2, serpent  CUDA: srad, recursiveGaussian

Split Camer a Camer a Map Camera Camera Combine (opt) Reduce Part Camer a Camer a Serial Sort Camera Camera Combine Camera Camera Merge

SLIDE 24

Alternative Schedulers

24

 GRAMPS scheduler can be substituted with other

implementations to compare scheduling approaches

 Task-Stealing: Single LIFO task queue per thread, no

backpressure

 Breadth-First: One stage at a time, may do multiple

passes due to loops, no backpressure

 Static: Application is profiled first, then partitioned using

METIS, and scheduled using a min-latency schedule, using per-thread data queues

SLIDE 25

GRAMPS Scheduler Scalability

25

 All applications scale well  Knee at 12 threads due to HW multithreading  Sublinear scaling due to memory bandwidth (hist, CUDA)

Numbers…fucking PowerPoint import…

SLIDE 26

Performance Comparison

26 GRAMPS MapReduce Cilk StreamIt CUDA

SLIDE 27

Performance Comparison

27

 Dynamic runtime overheads are small in GRAMPS  Task-Stealing performs worse on complex graphs (fm, tde, fft2)  Breadth-First does poorly when parallelism comes from pipelining  Static has no overheads and better locality, but higher stalled

time due to load imbalance

SLIDE 28

Footprint Comparison

28

 Task-Stealing fails to keep footprint bounded (tde)  Breadth-First has worst-case footprints  much higher

footprint, memory bandwidth requirements

SLIDE 29

Buffer Manager Performance

29

 Dynamic: Allocate packets using malloc/free (tcmalloc)  Per-Queue: Use per-queue, shared packet buffers

SLIDE 30

Buffer Manager Performance

30

 Generic dynamic memory allocator causes up to 6x

slowdown on buffer-intensive applications

 Per-queue allocator degrades locality, performance with lots

f stages (tde)

 Packet-stealing has low overheads, maintains locality

SLIDE 31

Conclusions

31

 Traditional scheduling techniques have problems with

pipeline-parallel applications

 Task-Stealing: fails on complex graphs , ordered queues  Breadth-First: no pipeline overlap, terrible footprints  Static: load imbalance with any irregularity

 GRAMPS runtime performs dynamic fine-grain

scheduling of pipeline-parallel applications efficiently

 Low scheduler and buffer manager overheads  Good locality

SLIDE 32

DYNAMIC FINE-GRAIN SCHEDULING

OF PIPELINE PARALLELISM

Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, Christos Kozyrakis Stanford University PACT-20, October 11th 2011

Executive Summary

handle its dependences, or suffer from load imbalance

parallel applications efficiently

multicore, GPGPU and Streaming programming models

Outline

High-Level Programming Models

(StreamIt, StreamC)

Pipeline-Parallel Applications

stages that communicate explicitly through queues

to improve locality

scheduling pipeline-parallel applications

Task-Parallel – Task-Stealing

independent tasks (Cilk, X10, TBB, OpenMP , …)

T0 T1 Tn

Data-Parallel – Breadth-First

OpenCL)

breadth-first order (source to sink)

stages, worst-case memory footprint

T0 T1 T2 T3

Streaming – Static Scheduling

minimize inter-core communication and memory footprint

predictable architectures (DVFS, multi-threading …)

Summary of Scheduling Techniques

     

 

Outline

GRAMPS Programming Model

pipeline-parallel workloads

GRAMPS: Threads vs Shaders

GRAMPS Scheduling

scheduling of irregular applications:

GRAMPS can schedule apps with no steady state

when proposed; we implement a real multicore runtime

Outline

GRAMPS Runtime Overview

intensive applications

Scheduler organization

preemptive)

Scheduler: Task Queues

3 2 1 4

Scheduler: Data Queues

and steals from queue producers

shift to other stages

Example

T0 T1 T2 T3

Example (cont.)

T0 T1 T2 T3

   

Packet-Stealing Buffer Manager

allocate packets

Outline

Methodology

Alternative Schedulers

implementations to compare scheduling approaches

backpressure

passes due to loops, no backpressure

METIS, and scheduled using a min-latency schedule, using per-thread data queues

GRAMPS Scheduler Scalability

Performance Comparison

Performance Comparison

time due to load imbalance

Footprint Comparison

footprint, memory bandwidth requirements

Buffer Manager Performance

Buffer Manager Performance

slowdown on buffer-intensive applications

Conclusions

pipeline-parallel applications

scheduling of pipeline-parallel applications efficiently

THANK YOU FOR YOUR ATTENTION QUESTIONS?