Flexible Flexible Architectural Architectural Support Support for - - PowerPoint PPT Presentation

flexible flexible architectural architectural support
SMART_READER_LITE
LIVE PREVIEW

Flexible Flexible Architectural Architectural Support Support for - - PowerPoint PPT Presentation

Flexible Flexible Architectural Architectural Support Support for for Fine Fine Grain rain Scheduling Scheduling Daniel Daniel Sanchez Sanchez Richard M. Richard Richard M Yoo Richard M. Yoo Yoo Yoo Christos Christos Kozyrakis


slide-1
SLIDE 1

Flexible Flexible Architectural Architectural Support Support for for Fine Fine‐Grain rain Scheduling Scheduling

Daniel Daniel Sanchez Sanchez

Richard Richard M Yoo Yoo Richard Richard M.

  • M. Yoo

Yoo Christos Christos Kozyrakis Kozyrakis

March March 16 16th

th 2010

2010 Stanford Stanford University University

slide-2
SLIDE 2

Overview Overview

  • Our focus: User‐level schedulers for parallel runtimes

– Cilk, TBB, OpenMP, …

  • Trends:

– More cores/chip – Deeper memory hierarchies

Need to exploit finer‐grain parallelism C i ti th h h d

– Deeper memory hierarchies – Costlier cache coherence

  • Existing fine‐grain schedulers:

Communication through shared memory increasingly inefficient

g g

– Software‐only: Slow, do not scale – Hardware‐only: Fast, but inflexible

  • Our contribution: Hardware‐aided approach

– HW: Fast, asynchronous messages between threads (ADM) SW: Scalable message passing schedulers

2

– SW: Scalable message‐passing schedulers – ADM schedulers scale like HW, flexible like SW schedulers

slide-3
SLIDE 3

Outline Outline

  • Introduction
  • Asynchronous Direct Messages (ADM)
  • ADM schedulers
  • Evaluation

3

slide-4
SLIDE 4

Fine Fine‐grain parallelism grain parallelism

  • Fine‐grain parallelism: Divide work in parallel phase in

small tasks (~1K‐10K instructions) ( )

  • Potential advantages:

– Expose more parallelism p p – Reduce load imbalance – Adapt to a dynamic environment (e.g. changing # cores)

  • Potential disadvantages:

– Large scheduling overheads g g – Poor locality (if application has inter‐task locality)

4

slide-5
SLIDE 5

Task Task‐stealing schedulers stealing schedulers

T0 T1 Tn Threads

Dequeue

Task

Enqueue

Task Queues

Steal

  • One task queue per thread
  • Threads dequeue and enqueue tasks from queues
  • When a thread runs out of work, it tries to steal tasks

5

, from another thread

slide-6
SLIDE 6

Task Task‐stealing: Components stealing: Components

  • 1. Queues

T0 T1 Tn

  • 2. Policies

Enq/deq

T0 T1 Tn

Steal

  • 3. Communication
  • In software schedulers:

Starved

—Queues and policies are cheap —Communication through shared memory increasingly expensive!

Queues Stealing Starved 6

memory increasingly expensive!

App

slide-7
SLIDE 7

Hardware schedulers: Carbon Hardware schedulers: Carbon

  • Carbon [ISCA ‘07]: HW queues, policies, communication

– One hardware LIFO task queue per core – Special instructions to enqueue/dequeue tasks

  • Implementation:

– Centralized queues for fast stealing (Global Task Unit) – One small task buffer per core to hide GTU latency (Local Task Units)

l Starved 31x 26x Queues App Stealing Large benefits if app matches HW policies Useless if app doesn’t match HW policies 7 matches HW policies match HW policies

slide-8
SLIDE 8

Approaches to fine Approaches to fine‐grain scheduling grain scheduling

Fine‐grain scheduling

Hardware‐only Hardware‐aided OpenMP Cilk Carbon GPUs Asynchronous Direct Messages Software‐only TBB X10 … ... SW queues & policies SW communication HW queues & policies HW communication SW queues & policies HW communication

  • High‐overhead

Flexible Low‐overhead

  • Inflexible

Low‐overhead Flexible 8 Flexible No extra HW

  • Inflexible
  • Special‐purpose HW

Flexible General‐purpose HW

slide-9
SLIDE 9

Outline Outline

  • Introduction
  • Asynchronous Direct Messages (ADM)
  • ADM schedulers
  • Evaluation

9

slide-10
SLIDE 10

Asynchronous Direct Messages Asynchronous Direct Messages

  • ADM: Messaging between threads tailored to scheduling

and control needs:

—Low‐overhead —Short messages Send from/receive to registers Independent from coherence —Overlap communication Asynchronous messages with user level interrupts and computation G l with user‐level interrupts Generic interface —General‐purpose Allows reuse

10

slide-11
SLIDE 11

ADM ADM Microarchitecture Microarchitecture

  • One ADM unit per core:
  • One ADM unit per core:

– Receive buffer holds messages until dequeued by thread Send buffer holds sent messages pending acknowledgement – Send buffer holds sent messages pending acknowledgement – Thread ID Translation Buffer translates TID → core ID on sends – Small structures (16‐32 entries), don't grow with # cores

11

Small structures (16 32 entries), don t grow with # cores

slide-12
SLIDE 12

ADM ISA ADM ISA

Instruction Instruction Description Description adm_send r1, r2 Sends a message of (r1) words (0‐6) to thread with ID (r2) adm_peek r1, r2 Returns source and message length at head of rx buffer adm_rx r1, r2 Dequeues message at head of rx buffer adm ei / adm di Enable / disable receive interrupts

  • Send and receive are atomic (single instruction)

– Send completes when message is copied to send buffer adm_ei / adm_di Enable / disable receive interrupts Send completes when message is copied to send buffer – Receive blocks if buffer is empty – Peek doesn't block, enables polling

  • ADM unit generates an user‐level interrupt on the running thread when a

message is received

– No stack switching, handler code partially saves context (used registers) → fast 12 – Interrupts can be disabled to preserve atomicity w.r.t. message reception

slide-13
SLIDE 13

Outline Outline

  • Introduction
  • Asynchronous Direct Messages (ADM)
  • ADM schedulers
  • Evaluation

13

slide-14
SLIDE 14

ADM ADM Schedulers Schedulers

  • Message‐passing schedulers
  • Replace parallel runtime’s (e.g. TBB) scheduler

— Application programmer is oblivious to this

  • Threads can perform two roles:

– Worker: Execute parallel phase, enqueue & dequeue tasks – Manager: Coordinate task stealing & parallel phase termination

  • Centralized scheduler: Single manager coordinates all

T0

T0 is manager d k ! Manager

!

and worker!

!

14

T0 T1 T2 T3

Workers

slide-15
SLIDE 15

Centralized Scheduler: Updates Centralized Scheduler: Updates

Manager

16

4 2 4 6

18 22

4 4 4 6 4 8 4 6 Approx task counts T0 UPDATE <4> UPDATE <8>

6

Workers

2 3 5 3 4 5 6 7 8

T0 T1 T2 T3 Task Queues

  • Manager keeps approximate task counts of each worker

15

  • Workers only notify manager at exponential thresholds
slide-16
SLIDE 16

Centralized Scheduler: Steals Centralized Scheduler: Steals

Manager STEAL_REQ T0 UPDATE <1> _ Q <T1‐>T2, 1> TASK

6

Workers

8 2 5 7 1 2 6 5 3 4

T0 T1 T2 T3 Task Queues

  • Manager requests a steal from the worker with most tasks

16

slide-17
SLIDE 17

Hierarchical Scheduler Hierarchical Scheduler

  • Centralized scheduler:

Does all communication through messages Enables directed stealing, task prefetching Does not scale beyond ~16 threads

  • Solution: Hierarchical scheduler

—Workers and managers form a tree

T1

2nd Level Manager

T0

1st Level Managers

T4

17

T0 T1 T2 T3

Workers

T4 T5 T6 T7

slide-18
SLIDE 18

Hierarchical Scheduler: Steals Hierarchical Scheduler: Steals

22

21

2nd Level Manager

18 4

3

1st Level Managers

5

1

4

2 7 1 1 1

Workers

  • Steals can span multiple levels

TASK (x2) TASK TASK (x4) 18

p p

— A single steal rebalances two partitions at once — Scales to hundreds of threads

slide-19
SLIDE 19

Outline Outline

  • Introduction
  • Asynchronous Direct Messages (ADM)
  • ADM schedulers
  • Evaluation

19

slide-20
SLIDE 20

Evaluation Evaluation

  • Simulated machine: Tiled CMP

– 32, 64, 128 in‐order dual‐thread SPARC cores (64 256 h d )

CMP tile

(64 – 256 threads) – 3‐level cache hierarchy, directory coherence

  • Benchmarks:

Loop parallel: canneal cg gtfold – Loop‐parallel: canneal, cg, gtfold – Task‐parallel: maxflow, mergesort, ced, hashjoin – Focus on representative subset of results, p , see paper for full set

20

64‐core, 16‐tile CMP

slide-21
SLIDE 21

Results Results

Queues App Stealing Starved

  • SW scalability limited by scheduling overheads
  • SW scalability limited by scheduling overheads
  • Carbon and ADM: Small overheads that scale
  • 21
  • ADM matches Carbon No need for HW scheduler
slide-22
SLIDE 22

Flexible policies: Flexible policies: gtfold gtfold case study case study

  • In gtfold, FIFO queues allow tasks to clear critical

dependences faster p

—FIFO queues trivial in SW and ADM —Carbon (HW) stuck with LIFO

31x 26x

  • ADM achieves 40x speedup
  • ver Carbon
  • Can’t implement all

scheduling policies in HW!

22