Flexible Flexible Architectural Architectural Support Support for - PowerPoint PPT Presentation

Flexible Flexible Architectural Architectural Support Support for for Fine Fine ‐ Grain rain Scheduling Scheduling Daniel Daniel Sanchez Sanchez Richard M. Richard Richard M Yoo Richard M. Yoo Yoo Yoo Christos Christos Kozyrakis Kozyrakis th 2010 16 th March March 16 2010 Stanford Stanford University University

Overview Overview • Our focus: User ‐ level schedulers for parallel runtimes – Cilk, TBB, OpenMP, … • Trends: – More cores/chip Need to exploit finer ‐ grain parallelism – Deeper memory hierarchies – Deeper memory hierarchies C Communication through shared i ti th h h d memory increasingly inefficient – Costlier cache coherence • Existing fine ‐ grain schedulers: g g – Software ‐ only: Slow, do not scale – Hardware ‐ only: Fast, but inflexible • Our contribution: Hardware ‐ aided approach – HW: Fast, asynchronous messages between threads (ADM) – SW: Scalable message ‐ passing schedulers SW: Scalable message passing schedulers – ADM schedulers scale like HW, flexible like SW schedulers 2

Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 3

Fine Fine ‐ grain parallelism grain parallelism • Fine ‐ grain parallelism: Divide work in parallel phase in small tasks (~1K ‐ 10K instructions) ( ) • Potential advantages: – Expose more parallelism p p – Reduce load imbalance – Adapt to a dynamic environment (e.g. changing # cores) • Potential disadvantages: – Large scheduling overheads g g – Poor locality (if application has inter ‐ task locality) 4

Task Task ‐ stealing schedulers stealing schedulers T 0 T 1 T n Threads Dequeue Enqueue Task Task Queues Steal • One task queue per thread • Threads dequeue and enqueue tasks from queues • When a thread runs out of work, it tries to steal tasks , from another thread 5

Task Task ‐ stealing: Components stealing: Components 1. Queues Enq/deq 2. Policies T 0 T 0 T 1 T 1 T n T n Steal 3. Communication • In software schedulers: Starved Starved —Queues and policies are cheap Stealing —Communication through shared Queues memory increasingly expensive! memory increasingly expensive! App 6

Hardware schedulers: Carbon Hardware schedulers: Carbon • Carbon [ISCA ‘07]: HW queues, policies, communication – One hardware LIFO task queue per core – Special instructions to enqueue/dequeue tasks • Implementation: – Centralized queues for fast stealing (Global Task Unit) – One small task buffer per core to hide GTU latency (Local Task Units) 31x 26x Starved Stealing l Queues Large benefits if app Useless if app doesn’t App matches HW policies matches HW policies match HW policies match HW policies 7

Approaches to fine Approaches to fine ‐ grain scheduling grain scheduling Fine ‐ grain scheduling Software ‐ only Hardware ‐ only Hardware ‐ aided OpenMP TBB Carbon Asynchronous Direct Cilk GPUs X10 Messages … ... SW queues & policies HW queues & policies SW queues & policies SW communication HW communication HW communication � � Low ‐ overhead � Low ‐ overhead High ‐ overhead � Flexible � Flexible � � � Flexible � Flexible Inflexible Inflexible � No extra HW � � General ‐ purpose HW Special ‐ purpose HW 8

Asynchronous Direct Messages Asynchronous Direct Messages • ADM: Messaging between threads tailored to scheduling and control needs: —Low ‐ overhead Send from/receive to registers Independent from coherence —Short messages Asynchronous messages —Overlap communication with user level interrupts with user ‐ level interrupts and computation Generic interface —General ‐ purpose G l Allows reuse 10

ADM ADM Microarchitecture Microarchitecture • One ADM unit per core: • One ADM unit per core: – Receive buffer holds messages until dequeued by thread – Send buffer holds sent messages pending acknowledgement Send buffer holds sent messages pending acknowledgement – Thread ID Translation Buffer translates TID → core ID on sends – Small structures (16 ‐ 32 entries), don't grow with # cores Small structures (16 32 entries), don t grow with # cores 11

ADM ISA ADM ISA Instruction Instruction Description Description Sends a message of (r1) words (0 ‐ 6) to thread with ID (r2) adm_send r1, r2 Returns source and message length at head of rx buffer adm_peek r1, r2 Dequeues message at head of rx buffer adm_rx r1, r2 Enable / disable receive interrupts Enable / disable receive interrupts adm ei / adm di adm_ei / adm_di Send and receive are atomic (single instruction) • – Send completes when message is copied to send buffer Send completes when message is copied to send buffer – Receive blocks if buffer is empty – Peek doesn't block, enables polling • ADM unit generates an user ‐ level interrupt on the running thread when a message is received – No stack switching, handler code partially saves context (used registers) → fast – Interrupts can be disabled to preserve atomicity w.r.t. message reception 12

ADM ADM Schedulers Schedulers • Message ‐ passing schedulers • Replace parallel runtime’s (e.g. TBB) scheduler — Application programmer is oblivious to this • Threads can perform two roles: – Worker: Execute parallel phase, enqueue & dequeue tasks – Manager: Coordinate task stealing & parallel phase termination • Centralized scheduler: Single manager coordinates all T 0 is manager Manager T 0 ! ! 0 and worker! d k ! T 0 T 1 T 2 T 3 Workers 14

Centralized Scheduler: Updates Centralized Scheduler: Updates Approx task counts T 0 Manager 22 16 18 4 4 4 4 8 2 4 4 4 6 6 6 UPDATE <4> UPDATE <8> 6 4 7 6 5 8 3 2 3 5 Workers T 0 T 1 T 2 T 3 Task Queues • Manager keeps approximate task counts of each worker • Workers only notify manager at exponential thresholds 15

Centralized Scheduler: Steals Centralized Scheduler: Steals T 0 Manager STEAL_REQ _ Q <T1 ‐ >T2, 1> UPDATE <1> TASK 6 5 6 7 8 2 1 2 3 4 5 Workers T 0 T 1 T 2 T 3 Task Queues • Manager requests a steal from the worker with most tasks 16

Hierarchical Scheduler Hierarchical Scheduler • Centralized scheduler: � Does all communication through messages � Enables directed stealing, task prefetching � Does not scale beyond ~16 threads • Solution: Hierarchical scheduler —Workers and managers form a tree 2 nd Level Manager T 1 1 st Level Managers T 0 T 4 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Workers 17

Hierarchical Scheduler: Steals Hierarchical Scheduler: Steals 2 nd Level Manager 21 22 1 st Level Managers 4 3 18 1 1 0 1 1 5 4 2 7 Workers TASK (x2) TASK (x4) TASK • Steals can span multiple levels p p — A single steal rebalances two partitions at once — Scales to hundreds of threads 18

Evaluation Evaluation CMP tile • Simulated machine: Tiled CMP – 32, 64, 128 in ‐ order dual ‐ thread SPARC cores (64 (64 – 256 threads) 256 h d ) – 3 ‐ level cache hierarchy, directory coherence • Benchmarks: – Loop ‐ parallel: canneal, cg, gtfold Loop parallel: canneal cg gtfold – Task ‐ parallel: maxflow, mergesort, ced, hashjoin – Focus on representative subset of results, p , see paper for full set 64 ‐ core, 16 ‐ tile CMP 20

Results Results App Queues Stealing Starved • SW scalability limited by scheduling overheads • SW scalability limited by scheduling overheads • Carbon and ADM: Small overheads that scale • ADM matches Carbon � No need for HW scheduler � 21

Flexible policies: Flexible policies: gtfold gtfold case study case study • In gtfold, FIFO queues allow tasks to clear critical dependences faster p —FIFO queues trivial in SW and ADM —Carbon (HW) stuck with LIFO 31x 26x • ADM achieves 40x speedup over Carbon • Can’t implement all scheduling policies in HW! 22

Flexible Flexible Architectural Architectural Support Support for - PowerPoint PPT Presentation

Flexible Flexible Architectural Architectural Support Support for for Fine Fine Grain rain Scheduling Scheduling Daniel Daniel Sanchez Sanchez Richard M. Richard Richard M Yoo Richard M. Yoo Yoo Yoo Christos Christos Kozyrakis

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

Support IS Support and Maintenance Help Desk 1 Support issues What do we need from system

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Architectural Support for Speculative Precomputation Dean Tullsen UCSD on sabbatical at UPC

FSA - HSA - HRA Spending & Savings Accounts Flexible Spending Account (FSA) Flexible

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 Recap: Hopfield network

Adventures in the Exposome with Ecological Momentary Assessment Jeremy Mennis, Ph.D., GISP

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

Profiling the Memory Usage of OCaml Applications without Changing their Behavior OCaml 2013

Lecture 5: Memory Management 1 / 54 Memory Management Administrivia Assignment 1 is due on

Segmentation and Paging CS 416: Operating Systems Design, Spring 2011 Department of Computer

Module 10: Virtual Memory Background Demand Paging Performance of Demand Paging

This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR

Flexible Flexible Architectural Architectural Support Support for - PowerPoint PPT Presentation

Flexible Flexible Architectural Architectural Support Support for for Fine Fine Grain rain Scheduling Scheduling Daniel Daniel Sanchez Sanchez Richard M. Richard Richard M Yoo Richard M. Yoo Yoo Yoo Christos Christos Kozyrakis

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

Support IS Support and Maintenance Help Desk 1 Support issues What do we need from system

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Architectural Support for Speculative Precomputation Dean Tullsen UCSD on sabbatical at UPC

FSA - HSA - HRA Spending &amp; Savings Accounts Flexible Spending Account (FSA) Flexible

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 Recap: Hopfield network

Adventures in the Exposome with Ecological Momentary Assessment Jeremy Mennis, Ph.D., GISP

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

Profiling the Memory Usage of OCaml Applications without Changing their Behavior OCaml 2013

Lecture 5: Memory Management 1 / 54 Memory Management Administrivia Assignment 1 is due on

Segmentation and Paging CS 416: Operating Systems Design, Spring 2011 Department of Computer

Module 10: Virtual Memory Background Demand Paging Performance of Demand Paging

This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR

FSA - HSA - HRA Spending & Savings Accounts Flexible Spending Account (FSA) Flexible