The Barrelfish operating system for CMPs: research issues Tim - PowerPoint PPT Presentation

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich.

The Barrelfish project • Collaboration between ETH Zurich and MSRC Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario Simone, Animesh Trivedi

Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

Do we need a new OS? SunSPARC Enterprise M9000 server M9000-64, up to 64 CPUs, 256 cores 180cm x 167.4cm x 126cm 1880kg

Do we need a new OS? SGI Origin 3000 Up to 512 processors Up to 1TB memory

Do we need a new OS? • How might the design of a CMP differ from these existing systems? • How might the workloads for a CMP differ from those of existing multi-processor machines?

The cliched single-threaded perf graph The things that would have used this “lost” perf must now be written to use cores/accel Historical 1-thread perf Log (seq. perf) gains via improved clock rate and transistors used to extract ILP #transistors still growing, but delivered as additional cores and accelerators Year

time time Output Output Interactive perf User input User input Output Output User input User input

CC-NUMA architecture Adding more CPUs brings more of most other things CPU1 CPU2 CPU3 CPU4 CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM RAM & directory To interconnect CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM Locality property: only go to interconnect for real I/O or sharing

Machine architecture More cores brings more cycles 1 Core1 2 3 Core2 4 5 Core3 6 Core4 7 8 ...not necessarily proportionately more cache L2 L2 ...nor more off- To RAM chip b/w or total RAM capacity

Machine diversity: AMD 4-core

...Sun Niagara-2

...Sun Rock

IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE 2010 J. Howard1, S. Dighe1, Y. Hoskote1, S. Vangal1, D. Finan1, G. Ruhl1, D. Jenkins1, H. Wilson1, N. Borkar1, G. Schrom1, F. Pailet1, S. Jain2, T. Jacob2, S. Yada2, S. Marella2, P. Salihundam2, V. Erraguntla2, M. Konow3, M. Riepen3, G. Droege3, J. Lindemann3, M. Gries3, T. Apel3, K. Henriss3, T. Lund-Larsen3, S. Steibl3, S. Borkar1, V. De1, R. Van Der Wijngaart4, T. Mattson5 1 Intel, Hillsboro, OR, 2 Intel, Bangalore, India, 3 Intel, Braunschweig, Germany 4 Intel, Santa Clara, CA, 5 Intel, DuPont, WA A 567mm2 processor on 45nm CMOS integrates 48 IA-32 cores and 4 DDR3 channels in a 6×4 2D-mesh network. Cores communicate through message passing using 384KB of on-die shared memory. Fine-grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. As performance scales, the processor dissipates between 25W and 125W.

The multikernel model App App App App OS node OS node OS node OS node Async messages State State State State replica replica replica replica x86 x64 ARM GPU Hardware interconnect

Barrelfish: a multikernel OS • A new OS architecture for scalable multicore systems • Approach: structure the OS as a distributed system • Design principles: – Make inter-core communication explicit – Make OS structure hardware-neutral – View state as replicated

#1 Explicit inter-core communication • All communication with messages • Decouples system structure from inter-core communication mechanism • Communication patterns explicitly expressed • Better match for future hardware – Naturally supports heterogeneous cores, non- coherent interconnects (PCIe) – with cheap explicit message passing – without cache-coherence (e.g. Intel 80-core) • Allows split-phase operations

Communication latency

Message passing vs shared memory • Shared memory (move the data to the operation): – Each core updates the same memory locations – Cache-coherence migrates modified cache lines

Shared memory scaling & latency

Message passing • Message passing (move operation to the data): – A single server core updates the memory locations – Each client core sends RPCs to the server

Message passing

#2 Hardware-neutral structure • Separate OS structure from hardware • Only hardware-specific parts: – Message transports (highly optimised / specialised) – CPU / device drivers • Adaptability to changing performance characteristics – Late-bind protocol and message transport implementations

#3 Replicate common state • Potentially-shared state accessed as if it were a local replica – Scheduler queues, process control blocks, etc. – Required by message-passing model • Naturally supports domains that do not share memory • Naturally supports changes to the set of running cores – Hotplug, power management

Replication vs sharing as the default • Replicas used as an optimisation in other systems • In a multikernel, sharing is a local optimisation – Shared (locked) replica on closely-coupled cores – Only when faster, as decided at runtime • Basic model remains split-phase

Applications running on Barrelfish • Slide viewer (but not today...) • Webserver (www.barrelfish.org) • Virtual machine monitor (runs unmodified Linux) • Parallel benchmarks: – SPLASH-2 – OpenMP • SQLite • ECLiPSe (constraint engine) • more. . .

1-way URPC message costs Cycles Msg / K-Cycle 2*4-core Intel Shared 180 11.97 Non-shared 570 3.78 2*2-core AMD Same die 450 3.42 1 hop 532 3.19 4*4-core AMD Shared 448 3.57 1 hop 545 3.53 2 hop 659 3.19 8*4-core AMD Shared 538 2.77 1 hop 613 2.79 2 hop 682 2.71 • Two hyper-transport requests on AMD

Local vs remote messaging Cycles Msg / K- I-cache D-cache Cycle lines used lines used 2*2-core AMD URPC 450 3.42 9 8 L4 IPC 424 2.36 25 13 • URPC to a remote core compares favourably with IPC • No context switch: TLB unaffected • Lower cache impact • Higher throughput for pipelined messages

Communication perf: IP loopback Barrelfish Linux Throughput (Mbit/s) 2154 1823 D-cache misses per packet 21 77 Source->Sink HT bytes per packet 1868 2628 Sink->Source HT bytes per packet 752 2200 Source->Sink HT link utilization 8% 11% Sink->Source HT link utilization 3% 9% • 2*2-core AMD system, 1000-byte packets – Linux: copy in / out of shared kernel buffers – Barrelfish: point-to-point URPC channel

Case study: TLB shoot-down • Send a message to every core with a mapping • Wait for acks • Linux/Windows: – Send IPI – Spin on shared ack count • Barrelfish: – Request to local monitor domain – 1-phase commit to remote cores – Plug in different communication mechanisms

TLB shoot-down: n*unicast cache-lines ... ... read write

TLB shoot-down: 1*broadcast ...

Messaging costs

TLB shoot-down: multicast ... ... Same package (shared L3)

TLB shoot-down: NUMA-aware m’cast More hyper-transport hops ... ... Same package (shared L3)

Messaging costs

End-to-end comparative latency

2-PC pipelining

Terminology • Domain – Protection domain/address space (“process”) • Dispatcher – One per domain per core – Scheduled by local CPU driver • Invokes upcall, which then typically runs a core-local user- level thread scheduler • Domain spanning – Start instances of a domain on multiple cores • cf start affinitized threads

Programming example: domain spanning 1 for i = 1..num_cores-1: 2 create a new dispatcher on core i 3 while (num_dispatchers < num_cores-1): 4 wait for the next message and handle it 1 dispatcher_create_callback: 2 num_dispatchers++

Domain spanning: baseline monitor working monitor blocked monitor polling • Centralized: monitor bzero spantest.exe – Poor scalability, but correct name service • 1021 messages, 487 alloc. RPCs memory server 50 million cycles (40ms)

Domain spanning: v2 monitor working monitor blocked monitor polling • Per-core memory servers monitor bzero spantest.exe • Better memset(!) name service memory server Was 50M, now 9M

Domain spanning: v3 monitor working monitor blocked monitor polling • Monitors use per-core mem. server monitor bzero spantest.exe • Move zeroing off the critical path name service memory server Was 9M, now 4M

Domain spanning: v4 monitor working monitor blocked monitor polling • Change the API monitor bzero spantest.exe • Create domains on all cores at once name service memory server • 76 messages Was 4M, now 2.5M

The Barrelfish operating system for CMPs: research issues Tim - PowerPoint PPT Presentation

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich. The Barrelfish project Collaboration between ETH Zurich

The State of the Barrelfish Project The Project Members & Contributors How Do We Organise?

Chapter 3: Operating-System Structures System Components Operating System Services

Chapter 3: Operating-System Structures System Components Operating System Services

Module 3: Operating-System Structures System Components Operating-System Services

Module 3: Operating-System Structures System Components Operating System Services

CPS 210: Operating Systems CPS 210: Operating Systems Operating Systems: The Big Picture

Introduction Outline What is an operating system? History of operating systems

Operating System Labs Yuanbin Wu cs@ecnu Operating System Labs Introduction to Unix (*nix)

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Operating Systems Structure Operating

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Operating Systems Structure Operating

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Operating Systems Structure Operating

Operating Systems WT 2019/20 Abridged History of Operating Systems Something to Ponder What is

Operating System Basics CS 111 Operating Systems Peter Reiher Lecture 2 CS 111 Page 1 Spring

What is an Operating System? CS 450 : Operating Systems Michael Saelee <saelee@iit.edu>

The Operating System Computer Literacy1 Lecture 6 02/10/08 Topics Firmware Operating

Operating System Basics CS 111 Operating Systems Peter Reiher Lecture 2 CS 111 Page 1 Fall

Vi Visual S Studio Cod o Code e Shipping One of the Largest Microso3 JavaScript Applica8ons

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, AMD, Wisconsin

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

SoK: A Study of Using Hardware- assisted Isolated Execu<on Environments for Security Fengwei

"On the Efficacy of a Fused CPU + GPU Processor (or APU) for Parallel Computing"

Industrial Hardware and Software Verification with ACL2 Warren A. Hunt, Jr. 1 , Matt Kaufmann 1 ,

Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta , Karsten Schwan @ Georgia Tech

The Barrelfish operating system for CMPs: research issues Tim - PowerPoint PPT Presentation

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich. The Barrelfish project Collaboration between ETH Zurich

The State of the Barrelfish Project The Project Members &amp; Contributors How Do We Organise?

Chapter 3: Operating-System Structures System Components Operating System Services

Chapter 3: Operating-System Structures System Components Operating System Services

Module 3: Operating-System Structures System Components Operating-System Services

Module 3: Operating-System Structures System Components Operating System Services

CPS 210: Operating Systems CPS 210: Operating Systems Operating Systems: The Big Picture

Introduction Outline What is an operating system? History of operating systems

Operating System Labs Yuanbin Wu cs@ecnu Operating System Labs Introduction to Unix (*nix)

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Operating Systems Structure Operating

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Operating Systems Structure Operating

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Operating Systems Structure Operating

Operating Systems WT 2019/20 Abridged History of Operating Systems Something to Ponder What is

Operating System Basics CS 111 Operating Systems Peter Reiher Lecture 2 CS 111 Page 1 Spring

What is an Operating System? CS 450 : Operating Systems Michael Saelee &lt;saelee@iit.edu&gt;

The Operating System Computer Literacy1 Lecture 6 02/10/08 Topics Firmware Operating

Operating System Basics CS 111 Operating Systems Peter Reiher Lecture 2 CS 111 Page 1 Fall

Vi Visual S Studio Cod o Code e Shipping One of the Largest Microso3 JavaScript Applica8ons

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, AMD, Wisconsin

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

SoK: A Study of Using Hardware- assisted Isolated Execu&lt;on Environments for Security Fengwei

&quot;On the Efficacy of a Fused CPU + GPU Processor (or APU) for Parallel Computing&quot;

Industrial Hardware and Software Verification with ACL2 Warren A. Hunt, Jr. 1 , Matt Kaufmann 1 ,

Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta , Karsten Schwan @ Georgia Tech

The State of the Barrelfish Project The Project Members & Contributors How Do We Organise?

What is an Operating System? CS 450 : Operating Systems Michael Saelee <saelee@iit.edu>

Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, AMD, Wisconsin

SoK: A Study of Using Hardware- assisted Isolated Execu<on Environments for Security Fengwei

"On the Efficacy of a Fused CPU + GPU Processor (or APU) for Parallel Computing"