Design Principles for End-to-End Multicore Schedulers Paul Barham - PowerPoint PPT Presentation

Design Principles for End-to-End Multicore Schedulers Paul Barham † Simon Peter ⋆ Adrian Schüpbach ⋆ Rebecca Isaacs † Tim Harris † Andrew Baumann ⋆ Timothy Roscoe ⋆ † Microsoft Research ⋆ Systems Group, ETH Zurich HotPar’10 c � Systems Group | Department of Computer Science | ETH Zürich HotPar’10

Context: Barrelfish Multikernel operating system ◮ Developed at ETHZ and Microsoft Research ◮ Scalable research OS on heterogeneous multicore hardware ◮ Operating system principles and structure ◮ Programming models and language runtime systems ◮ Other scalable OS approaches are similar ◮ Tessellation, Corey, ROS, fos, ... ◮ Ideas in this talk more widely applicable HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 2

Today’s talk topic OS Scheduler architecture for today’s (and tomorrow’s) multicore machines ◮ General-purpose setting: ◮ Dynamic workload mix ◮ Multiple parallel apps ◮ Interactive parallel apps HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 3

Why this is a problem A simple example ◮ Run 2 OpenMP applications concurrently ◮ On 16-core AMD Shanghai system ◮ Intel OpenMP library ◮ Linux OS HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 4

Why this is a problem Example: 2x OpenMP on 16-core Linux ◮ One app is CPU-Bound: #pragma omp parallel for(;;) iterations[omp_get_thread_num()]++; ◮ Other is synchronization intensive (eg. BARRIER): #pragma omp parallel for(;;) { #pragma omp barrier iterations[omp_get_thread_num()]++; } HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 5

Why this is a problem Example: 2x OpenMP on 16-core Linux ◮ Run for x in [ 2 .. 16 ] : ◮ OMP_NUM_THREADS= x ./BARRIER & ◮ OMP_NUM_THREADS=8 ./cpu_bound & ◮ sleep 20 ◮ killall BARRIER cpu_bound ◮ Plot average iterations /thread/s over 20s HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 6

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound BARRIER 1 Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound Until 8 BARRIER threads BARRIER 1 Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound Until 8 BARRIER threads BARRIER 1 CPU-Bound stays at 1 Relative Rate of Progress (same thread allocation) 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound Until 8 BARRIER threads BARRIER 1 CPU-Bound stays at 1 Relative Rate of Progress (same thread allocation) 0.8 BARRIER degrades 0.6 (due to increasing cost) 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound Until 8 BARRIER threads BARRIER 1 CPU-Bound stays at 1 Relative Rate of Progress (same thread allocation) 0.8 BARRIER degrades 0.6 (due to increasing cost) 0.4 Space-partitioning 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound From 9 threads BARRIER (threads > cores) 1 Time-multiplexing Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound From 9 threads BARRIER (threads > cores) 1 Time-multiplexing Relative Rate of Progress 0.8 CPU-Bound degrades linearly 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound From 9 threads BARRIER (threads > cores) 1 Time-multiplexing Relative Rate of Progress 0.8 CPU-Bound degrades linearly 0.6 BARRIER drops sharply 0.4 (only makes progress when all threads run 0.2 concurrently) 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7

Why this is a problem Example: 2x OpenMP on 16-core Linux ◮ Gang scheduling or smart core allocation would help ◮ Gang scheduling: ◮ OS unaware of apps’ requirements ◮ The run-time system could’ve known ◮ Eg. via annotations or compiler ◮ Smart core allocation: ◮ OS knows general system state ◮ Run-time system chooses number of threads ◮ Information and mechanisms in the wrong place HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 8

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound BARRIER 1 Relative Rate of Progress 0.8 Huge error bars 0.6 (min/max over 20 runs) 0.4 Random placement of threads to cores 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 9

Why this is a problem 16-core AMD Shanghai system Core Core Core Core HT Core Core Core Core L3 L3 HT HT Core Core Core Core Core HT Core Core Core L3 L3 ◮ Same-die L3 access twice as fast as cross-die ◮ OpenMP run-time does not know about this machine HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 10

Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound BARRIER 1 Relative Rate of Progress 0.8 0.6 2 threads case: 0.4 Performance difference of 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 11

Why this is a problem System diversity Core Core Core Core HT3 Core Core Core Core Core Core Core Core HT3 L3 L3 FB DIMM FB DIMM FB DIMM FB DIMM AMD Opteron (Magny-Cours) MCU MCU MCU MCU ◮ On-chip interconnect L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU Sun Niagara T2 ◮ Flat, fast cache hierarchy Intel Nehalem (Beckton) ◮ On-die ring network HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 12

Why this is a problem System diversity Core Core Core Core HT3 Core Core Core Core Core Core Core Core HT3 L3 L3 FB DIMM FB DIMM FB DIMM FB DIMM AMD Opteron (Magny-Cours) MCU MCU MCU MCU ◮ On-chip interconnect Manual tuning increasingly difficult L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Architectures change too quickly Full Cross Bar Offline auto-tuning (eg. ATLAS) limited C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU Sun Niagara T2 ◮ Flat, fast cache hierarchy Intel Nehalem (Beckton) ◮ On-die ring network HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 12

Online adaptation ◮ Online adaptation remains viable ◮ Easier with contemporary runtime systems ◮ OpenMP, Grand Central Dispatch, ConcRT, MPI, ... ◮ Synchronization patterns are more explicit ◮ But needs information at right places HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 13

The end-to-end approach ◮ The system stack: Component Related work Hardware Heterogeneous, ... OS scheduler CAMP, HASS, ... Runtime systems OpenMP, MPI, ConcRT, McRT, ... Compilers Auto-parallel., ... Programming paradigms MapReduce, ICC, ... Applications annotations, ... ◮ Involve all components, top to bottom ◮ Need to cut through classical OS abstractions ◮ Here we focus on OS / runtime system integration HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 14

Design Principles HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 15

Design Principles for End-to-End Multicore Schedulers Paul Barham - PowerPoint PPT Presentation

Design Principles for End-to-End Multicore Schedulers Paul Barham Simon Peter Adrian Schpbach Rebecca Isaacs Tim Harris Andrew Baumann Timothy Roscoe Microsoft Research Systems Group, ETH Zurich HotPar10 c

Brett Ayoob, PSP Best Practices for CPM Schedulers // Introduction The Corporate Teams Plan

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

[537] Schedulers Tyler Harter 9/10/14 Overview Review processes Workloads, schedulers, and

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Provable Multicore Schedulers with Ipanema: Application to Work-Conservation Baptiste Lepers

Lets Build Provable Multicore Schedulers! Redha GOUICEM Whisper team, Sorbonne Universits,

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Integrated Schedulers for a Predictable Interrupt Management on Real-Time Kernels A. Crespo S.

CPU Scheduling Schedulers Structure of a CPU scheduler Criteria for scheduling

Learning Automatic Schedulers through Projective Reparameterization Ajay Jain Saman Amarasinghe

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

Resources and Strategies for NSF Broader Impacts and Educational Design for Impacts May 6, 2020

The Psychological Anatomy of Users & The Future Internet Dr. Karmen Guevara Computer

From Sounds to Music: Learning the Bohlen-Pierce Scale Psyche Loui Beth Israel Deaconess Medical

Status of the ENDF project M. Herman, P. Oblozinsky National Nuclear Data Center, BNL

Analyzing Jet Substructure with Energy Flow Elementary Particle Physics Journal Club Eric M.

Recent Advances In Stimulated Electromagnetic Emission Observations With HAARP HF Heating Stan

Flash Cards Exam Review V CHAPTER 18 The International Typographical Style MAX BILL

Chaos, Fractals, and the Arts Ralph Abraham www.ralph-abraham.org Porter College 34B, UCSC

Design Principles for End-to-End Multicore Schedulers Paul Barham - PowerPoint PPT Presentation

Design Principles for End-to-End Multicore Schedulers Paul Barham Simon Peter Adrian Schpbach Rebecca Isaacs Tim Harris Andrew Baumann Timothy Roscoe Microsoft Research Systems Group, ETH Zurich HotPar10 c

Brett Ayoob, PSP Best Practices for CPM Schedulers // Introduction The Corporate Teams Plan

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

[537] Schedulers Tyler Harter 9/10/14 Overview Review processes Workloads, schedulers, and

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Provable Multicore Schedulers with Ipanema: Application to Work-Conservation Baptiste Lepers

Lets Build Provable Multicore Schedulers! Redha GOUICEM Whisper team, Sorbonne Universits,

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Integrated Schedulers for a Predictable Interrupt Management on Real-Time Kernels A. Crespo S.

CPU Scheduling Schedulers Structure of a CPU scheduler Criteria for scheduling

Learning Automatic Schedulers through Projective Reparameterization Ajay Jain Saman Amarasinghe

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

Resources and Strategies for NSF Broader Impacts and Educational Design for Impacts May 6, 2020

The Psychological Anatomy of Users &amp; The Future Internet Dr. Karmen Guevara Computer

From Sounds to Music: Learning the Bohlen-Pierce Scale Psyche Loui Beth Israel Deaconess Medical

Status of the ENDF project M. Herman, P. Oblozinsky National Nuclear Data Center, BNL

Analyzing Jet Substructure with Energy Flow Elementary Particle Physics Journal Club Eric M.

Recent Advances In Stimulated Electromagnetic Emission Observations With HAARP HF Heating Stan

Flash Cards Exam Review V CHAPTER 18 The International Typographical Style MAX BILL

Chaos, Fractals, and the Arts Ralph Abraham www.ralph-abraham.org Porter College 34B, UCSC

The Psychological Anatomy of Users & The Future Internet Dr. Karmen Guevara Computer