Design Principles for End-to-End Multicore Schedulers Paul Barham - - PowerPoint PPT Presentation

design principles for end to end multicore schedulers
SMART_READER_LITE
LIVE PREVIEW

Design Principles for End-to-End Multicore Schedulers Paul Barham - - PowerPoint PPT Presentation

Design Principles for End-to-End Multicore Schedulers Paul Barham Simon Peter Adrian Schpbach Rebecca Isaacs Tim Harris Andrew Baumann Timothy Roscoe Microsoft Research Systems Group, ETH Zurich HotPar10 c


slide-1
SLIDE 1

Design Principles for End-to-End Multicore Schedulers

Simon Peter⋆ Adrian Schüpbach⋆ Paul Barham† Andrew Baumann⋆ Rebecca Isaacs† Tim Harris† Timothy Roscoe⋆

⋆Systems Group, ETH Zurich † Microsoft Research

HotPar’10

c Systems Group | Department of Computer Science | ETH Zürich HotPar’10

slide-2
SLIDE 2

Context: Barrelfish Multikernel operating system

◮ Developed at ETHZ and Microsoft Research ◮ Scalable research OS on heterogeneous multicore hardware

◮ Operating system principles and structure ◮ Programming models and language runtime systems

◮ Other scalable OS approaches are similar

◮ Tessellation, Corey, ROS, fos, ... ◮ Ideas in this talk more widely applicable HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 2

slide-3
SLIDE 3

Today’s talk topic

OS Scheduler architecture for today’s (and tomorrow’s) multicore machines

◮ General-purpose setting:

◮ Dynamic workload mix ◮ Multiple parallel apps ◮ Interactive parallel apps HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 3

slide-4
SLIDE 4

Why this is a problem A simple example

◮ Run 2 OpenMP applications concurrently ◮ On 16-core AMD Shanghai system ◮ Intel OpenMP library ◮ Linux OS

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 4

slide-5
SLIDE 5

Why this is a problem Example: 2x OpenMP on 16-core Linux

◮ One app is CPU-Bound:

#pragma omp parallel for(;;) iterations[omp_get_thread_num()]++;

◮ Other is synchronization intensive (eg. BARRIER):

#pragma omp parallel for(;;) { #pragma omp barrier iterations[omp_get_thread_num()]++; }

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 5

slide-6
SLIDE 6

Why this is a problem Example: 2x OpenMP on 16-core Linux

◮ Run for x in [2..16]:

◮ OMP_NUM_THREADS=x ./BARRIER & ◮ OMP_NUM_THREADS=8 ./cpu_bound & ◮ sleep 20 ◮ killall BARRIER cpu_bound

◮ Plot average iterations/thread/s over 20s

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 6

slide-7
SLIDE 7

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-8
SLIDE 8

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-9
SLIDE 9

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

Until 8 BARRIER threads

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-10
SLIDE 10

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

Until 8 BARRIER threads CPU-Bound stays at 1 (same thread allocation)

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-11
SLIDE 11

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

Until 8 BARRIER threads CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost)

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-12
SLIDE 12

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

Until 8 BARRIER threads CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) Space-partitioning

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-13
SLIDE 13

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

From 9 threads (threads > cores) Time-multiplexing

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-14
SLIDE 14

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-15
SLIDE 15

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly BARRIER drops sharply (only makes progress when all threads run concurrently)

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

slide-16
SLIDE 16

Why this is a problem Example: 2x OpenMP on 16-core Linux

◮ Gang scheduling or smart core allocation would help ◮ Gang scheduling:

◮ OS unaware of apps’ requirements ◮ The run-time system could’ve known ◮ Eg. via annotations or compiler

◮ Smart core allocation:

◮ OS knows general system state ◮ Run-time system chooses number of threads

◮ Information and mechanisms in the wrong place

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 8

slide-17
SLIDE 17

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

Huge error bars (min/max over 20 runs) Random placement of threads to cores

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 9

slide-18
SLIDE 18

Why this is a problem 16-core AMD Shanghai system

Core Core Core Core L3 Core Core Core Core L3 Core Core Core Core L3 Core Core Core Core L3 HT HT HT HT

◮ Same-die L3 access twice as fast as cross-die ◮ OpenMP run-time does not know about this machine

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 10

slide-19
SLIDE 19

Why this is a problem 16-core AMD Shanghai system

Core Core Core Core L3 Core Core Core Core L3 Core Core Core Core L3 Core Core Core Core L3 HT HT HT HT

◮ Same-die L3 access twice as fast as cross-die ◮ OpenMP run-time does not know about this machine

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 10

slide-20
SLIDE 20

Why this is a problem 16-core AMD Shanghai system

Core Core Core Core L3 Core Core Core Core L3 Core Core Core Core L3 Core Core Core Core L3 HT HT HT HT

◮ Same-die L3 access twice as fast as cross-die ◮ OpenMP run-time does not know about this machine

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 10

slide-21
SLIDE 21

Why this is a problem Example: 2x OpenMP on 16-core Linux

0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12 14 16 Relative Rate of Progress Number of BARRIER Threads CPU-Bound BARRIER

2 threads case: Performance difference of 0.4

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 11

slide-22
SLIDE 22

Why this is a problem System diversity

FB DIMM FB DIMM FB DIMM FB DIMM

SPU SPU SPU SPU SPU SPU SPU SPU FPU FPU FPU FPU FPU FPU FPU FPU

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ C0 C1 C2 C3 C4 C5 C6 C7 MCU Full Cross Bar MCU MCU MCU

Sun Niagara T2

◮ Flat, fast cache hierarchy

Core Core Core Core Core Core L3 Core Core Core Core Core Core L3 HT3 HT3

AMD Opteron (Magny-Cours)

◮ On-chip interconnect

Intel Nehalem (Beckton)

◮ On-die ring network

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 12

slide-23
SLIDE 23

Why this is a problem System diversity

FB DIMM FB DIMM FB DIMM FB DIMM

SPU SPU SPU SPU SPU SPU SPU SPU FPU FPU FPU FPU FPU FPU FPU FPU

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ C0 C1 C2 C3 C4 C5 C6 C7 MCU Full Cross Bar MCU MCU MCU

Sun Niagara T2

◮ Flat, fast cache hierarchy

Core Core Core Core Core Core L3 Core Core Core Core Core Core L3 HT3 HT3

AMD Opteron (Magny-Cours)

◮ On-chip interconnect

Manual tuning increasingly difficult Architectures change too quickly Offline auto-tuning (eg. ATLAS) limited Intel Nehalem (Beckton)

◮ On-die ring network

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 12

slide-24
SLIDE 24

Online adaptation

◮ Online adaptation remains viable ◮ Easier with contemporary runtime systems

◮ OpenMP, Grand Central Dispatch, ConcRT, MPI, ... ◮ Synchronization patterns are more explicit

◮ But needs information at right places

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 13

slide-25
SLIDE 25

The end-to-end approach

◮ The system stack:

Component Related work Hardware Heterogeneous, ... OS scheduler CAMP, HASS, ... Runtime systems OpenMP, MPI, ConcRT, McRT, ... Compilers Auto-parallel., ... Programming paradigms MapReduce, ICC, ... Applications annotations, ...

◮ Involve all components, top to bottom ◮ Need to cut through classical OS abstractions ◮ Here we focus on OS / runtime system integration

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 14

slide-26
SLIDE 26

Design Principles

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 15

slide-27
SLIDE 27

Design principles

  • 1. Time-multiplexing cores is still needed

◮ Resource abundance = scheduler freedom ◮ Asymmetric multi-core architectures

◮ Contention for “big” cores

◮ Provide real-time QoS to interactive apps, not wasting cores

◮ Avoid power wasted through over-provisioning HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 16

slide-28
SLIDE 28

Design principles

  • 2. Schedule at multiple timescales

◮ Interactive workloads are now parallel

◮ Requirements might change abruptly ◮ Eg. parallel web browser

◮ Much shorter, interactive time scales ◮ Thus need small overhead when scheduling

◮ Synchronized scheduling on every time-slice won’t scale HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 17

slide-29
SLIDE 29

Implementation in Barrelfish

◮ Combination of techniques at different time granularities

◮ Long-term placement of apps on cores ◮ Medium-term resource allocation ◮ Short-term per-core scheduling HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 18

slide-30
SLIDE 30

Implementation in Barrelfish

◮ Combination of techniques at different time granularities

◮ Long-term placement of apps on cores ◮ Medium-term resource allocation ◮ Short-term per-core scheduling

◮ Phase-locked gang scheduling

◮ Gang scheduling over interactive timescales HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 18

slide-31
SLIDE 31

Phase-locked gang scheduling

◮ Decouple schedule synchronization from dispatch

Best-effort (actual trace): Phase-locked gang scheduling (actual trace):

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 19

slide-32
SLIDE 32

Phase-locked gang scheduling

◮ Decouple schedule synchronization from dispatch

Best-effort (actual trace): Progress only in small time windows Phase-locked gang scheduling (actual trace):

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 19

slide-33
SLIDE 33

Phase-locked gang scheduling

◮ Decouple schedule synchronization from dispatch

Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Synchronize core-local clocks

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 19

slide-34
SLIDE 34

Phase-locked gang scheduling

◮ Decouple schedule synchronization from dispatch

Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Agree on future gang start time

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 19

slide-35
SLIDE 35

Phase-locked gang scheduling

◮ Decouple schedule synchronization from dispatch

Best-effort (actual trace): Phase-locked gang scheduling (actual trace):

...and gang period

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 19

slide-36
SLIDE 36

Phase-locked gang scheduling

◮ Decouple schedule synchronization from dispatch

Best-effort (actual trace): Phase-locked gang scheduling (actual trace):

Resync in future when necessary

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 19

slide-37
SLIDE 37

Design principles

  • 3. Reason online about the hardware

◮ We employ a system knowledge base

◮ Contains rich representation of the hardware ◮ Queries in subset of first-order logic ◮ Logical unification aids dealing with diversity

◮ Both OS and apps use it

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 20

slide-38
SLIDE 38

Design principles

  • 4. Reason online about each application

◮ OS should exploit knowledge about apps for efficiency

◮ Eg. gang schedule threads in an OpenMP team ◮ But no sense in gang scheduling unrelated threads

◮ A single app might go through different phases

◮ Optimal allocation of resources changes over time

Implementation:

◮ Apps submit scheduling manifests to planner

◮ Contain predicted long-term resource requirements ◮ Expressed as constrained cost-functions ◮ May make use of any information in the SKB HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 21

slide-39
SLIDE 39

Design principles

  • 5. Applications and OS must communicate

◮ Implementing the end-to-end principle ◮ Resource allocation may be renegotiated during runtime

Implementation:

◮ Hardware threads run user-level dispatchers

◮ Cf. Psyche, inheritance scheduling

◮ Related dispatchers are grouped into dispatcher groups

◮ Derived from RTIDs of McRT ◮ Used as handles when renegotiating

◮ Scheduler activations [Anderson 1992] to inform app

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 22

slide-40
SLIDE 40

Implementation in the Barrelfish OS

Disp Disp Disp Disp D1 Disp Disp D2

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 23

slide-41
SLIDE 41

Open questions

◮ What are appropriate mechanisms and timescales for

inter-core phase synchronization?

◮ How can programmers provide useful concurrency

information to the runtime?

◮ How efficiently can runtime specify requirements to OS? ◮ Hidden cost (if any) of decoupling scheduling timescales? ◮ Tradeoffs between centralized and distributed planners? ◮ Appropriate level of expressivity for the SKB?

HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 24