flexible flexible architectural architectural support
play

Flexible Flexible Architectural Architectural Support Support for - PowerPoint PPT Presentation

Flexible Flexible Architectural Architectural Support Support for for Fine Fine Grain rain Scheduling Scheduling Daniel Daniel Sanchez Sanchez Richard M. Richard Richard M Yoo Richard M. Yoo Yoo Yoo Christos Christos Kozyrakis


  1. Flexible Flexible Architectural Architectural Support Support for for Fine Fine ‐ Grain rain Scheduling Scheduling Daniel Daniel Sanchez Sanchez Richard M. Richard Richard M Yoo Richard M. Yoo Yoo Yoo Christos Christos Kozyrakis Kozyrakis th 2010 16 th March March 16 2010 Stanford Stanford University University

  2. Overview Overview • Our focus: User ‐ level schedulers for parallel runtimes – Cilk, TBB, OpenMP, … • Trends: – More cores/chip Need to exploit finer ‐ grain parallelism – Deeper memory hierarchies – Deeper memory hierarchies C Communication through shared i ti th h h d memory increasingly inefficient – Costlier cache coherence • Existing fine ‐ grain schedulers: g g – Software ‐ only: Slow, do not scale – Hardware ‐ only: Fast, but inflexible • Our contribution: Hardware ‐ aided approach – HW: Fast, asynchronous messages between threads (ADM) – SW: Scalable message ‐ passing schedulers SW: Scalable message passing schedulers – ADM schedulers scale like HW, flexible like SW schedulers 2

  3. Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 3

  4. Fine Fine ‐ grain parallelism grain parallelism • Fine ‐ grain parallelism: Divide work in parallel phase in small tasks (~1K ‐ 10K instructions) ( ) • Potential advantages: – Expose more parallelism p p – Reduce load imbalance – Adapt to a dynamic environment (e.g. changing # cores) • Potential disadvantages: – Large scheduling overheads g g – Poor locality (if application has inter ‐ task locality) 4

  5. Task Task ‐ stealing schedulers stealing schedulers T 0 T 1 T n Threads Dequeue Enqueue Task Task Queues Steal • One task queue per thread • Threads dequeue and enqueue tasks from queues • When a thread runs out of work, it tries to steal tasks , from another thread 5

  6. Task Task ‐ stealing: Components stealing: Components 1. Queues Enq/deq 2. Policies T 0 T 0 T 1 T 1 T n T n Steal 3. Communication • In software schedulers: Starved Starved —Queues and policies are cheap Stealing —Communication through shared Queues memory increasingly expensive! memory increasingly expensive! App 6

  7. Hardware schedulers: Carbon Hardware schedulers: Carbon • Carbon [ISCA ‘07]: HW queues, policies, communication – One hardware LIFO task queue per core – Special instructions to enqueue/dequeue tasks • Implementation: – Centralized queues for fast stealing (Global Task Unit) – One small task buffer per core to hide GTU latency (Local Task Units) 31x 26x Starved Stealing l Queues Large benefits if app Useless if app doesn’t App matches HW policies matches HW policies match HW policies match HW policies 7

  8. Approaches to fine Approaches to fine ‐ grain scheduling grain scheduling Fine ‐ grain scheduling Software ‐ only Hardware ‐ only Hardware ‐ aided OpenMP TBB Carbon Asynchronous Direct Cilk GPUs X10 Messages … ... SW queues & policies HW queues & policies SW queues & policies SW communication HW communication HW communication � � Low ‐ overhead � Low ‐ overhead High ‐ overhead � Flexible � Flexible � � � Flexible � Flexible Inflexible Inflexible � No extra HW � � General ‐ purpose HW Special ‐ purpose HW 8

  9. Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 9

  10. Asynchronous Direct Messages Asynchronous Direct Messages • ADM: Messaging between threads tailored to scheduling and control needs: —Low ‐ overhead Send from/receive to registers Independent from coherence —Short messages Asynchronous messages —Overlap communication with user level interrupts with user ‐ level interrupts and computation Generic interface —General ‐ purpose G l Allows reuse 10

  11. ADM ADM Microarchitecture Microarchitecture • One ADM unit per core: • One ADM unit per core: – Receive buffer holds messages until dequeued by thread – Send buffer holds sent messages pending acknowledgement Send buffer holds sent messages pending acknowledgement – Thread ID Translation Buffer translates TID → core ID on sends – Small structures (16 ‐ 32 entries), don't grow with # cores Small structures (16 32 entries), don t grow with # cores 11

  12. ADM ISA ADM ISA Instruction Instruction Description Description Sends a message of (r1) words (0 ‐ 6) to thread with ID (r2) adm_send r1, r2 Returns source and message length at head of rx buffer adm_peek r1, r2 Dequeues message at head of rx buffer adm_rx r1, r2 Enable / disable receive interrupts Enable / disable receive interrupts adm ei / adm di adm_ei / adm_di Send and receive are atomic (single instruction) • – Send completes when message is copied to send buffer Send completes when message is copied to send buffer – Receive blocks if buffer is empty – Peek doesn't block, enables polling • ADM unit generates an user ‐ level interrupt on the running thread when a message is received – No stack switching, handler code partially saves context (used registers) → fast – Interrupts can be disabled to preserve atomicity w.r.t. message reception 12

  13. Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 13

  14. ADM ADM Schedulers Schedulers • Message ‐ passing schedulers • Replace parallel runtime’s (e.g. TBB) scheduler — Application programmer is oblivious to this • Threads can perform two roles: – Worker: Execute parallel phase, enqueue & dequeue tasks – Manager: Coordinate task stealing & parallel phase termination • Centralized scheduler: Single manager coordinates all T 0 is manager Manager T 0 ! ! 0 and worker! d k ! T 0 T 1 T 2 T 3 Workers 14

  15. Centralized Scheduler: Updates Centralized Scheduler: Updates Approx task counts T 0 Manager 22 16 18 4 4 4 4 8 2 4 4 4 6 6 6 UPDATE <4> UPDATE <8> 6 4 7 6 5 8 3 2 3 5 Workers T 0 T 1 T 2 T 3 Task Queues • Manager keeps approximate task counts of each worker • Workers only notify manager at exponential thresholds 15

  16. Centralized Scheduler: Steals Centralized Scheduler: Steals T 0 Manager STEAL_REQ _ Q <T1 ‐ >T2, 1> UPDATE <1> TASK 6 5 6 7 8 2 1 2 3 4 5 Workers T 0 T 1 T 2 T 3 Task Queues • Manager requests a steal from the worker with most tasks 16

  17. Hierarchical Scheduler Hierarchical Scheduler • Centralized scheduler: � Does all communication through messages � Enables directed stealing, task prefetching � Does not scale beyond ~16 threads • Solution: Hierarchical scheduler —Workers and managers form a tree 2 nd Level Manager T 1 1 st Level Managers T 0 T 4 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Workers 17

  18. Hierarchical Scheduler: Steals Hierarchical Scheduler: Steals 2 nd Level Manager 21 22 1 st Level Managers 4 3 18 1 1 0 1 1 5 4 2 7 Workers TASK (x2) TASK (x4) TASK • Steals can span multiple levels p p — A single steal rebalances two partitions at once — Scales to hundreds of threads 18

  19. Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 19

  20. Evaluation Evaluation CMP tile • Simulated machine: Tiled CMP – 32, 64, 128 in ‐ order dual ‐ thread SPARC cores (64 (64 – 256 threads) 256 h d ) – 3 ‐ level cache hierarchy, directory coherence • Benchmarks: – Loop ‐ parallel: canneal, cg, gtfold Loop parallel: canneal cg gtfold – Task ‐ parallel: maxflow, mergesort, ced, hashjoin – Focus on representative subset of results, p , see paper for full set 64 ‐ core, 16 ‐ tile CMP 20

  21. Results Results App Queues Stealing Starved • SW scalability limited by scheduling overheads • SW scalability limited by scheduling overheads • Carbon and ADM: Small overheads that scale • ADM matches Carbon � No need for HW scheduler � 21

  22. Flexible policies: Flexible policies: gtfold gtfold case study case study • In gtfold, FIFO queues allow tasks to clear critical dependences faster p —FIFO queues trivial in SW and ADM —Carbon (HW) stuck with LIFO 31x 26x • ADM achieves 40x speedup over Carbon • Can’t implement all scheduling policies in HW! 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend