The Multikernel A new OS architecture for scalable multicore systems - - PowerPoint PPT Presentation

the multikernel
SMART_READER_LITE
LIVE PREVIEW

The Multikernel A new OS architecture for scalable multicore systems - - PowerPoint PPT Presentation

The Multikernel A new OS architecture for scalable multicore systems Andrew Baumann 1 Paul Barham 2 Pierre-Evariste Dagand 3 Tim Harris 2 Rebecca Isaacs 2 Simon Peter 1 Timothy Roscoe 1 Adrian Schpbach 1 Akhilesh Singhania 1 1 Systems Group, ETH


slide-1
SLIDE 1

The Multikernel

A new OS architecture for scalable multicore systems

Andrew Baumann1 Paul Barham2 Pierre-Evariste Dagand3 Tim Harris2 Rebecca Isaacs2 Simon Peter1 Timothy Roscoe1 Adrian Schüpbach1 Akhilesh Singhania1

1 Systems Group, ETH Zurich 2 Microsoft Research, Cambridge 3 ENS Cachan Bretagne Systems Group | Department of Computer Science | ETH Zurich SOSP, 12th October 2009

slide-2
SLIDE 2

Introduction

How should we structure an OS for future multicore systems?

◮ Scalability to many cores ◮ Heterogeneity and hardware diversity

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 2

slide-3
SLIDE 3

System diversity

FB DIMM FB DIMM FB DIMM FB DIMM

SPU SPU SPU SPU SPU SPU SPU SPU FPU FPU FPU FPU FPU FPU FPU FPU

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ C0 C1 C2 C3 C4 C5 C6 C7 MCU Full Cross Bar MCU MCU MCU

Sun Niagara T2 AMD Opteron (Istanbul) Intel Nehalem (Beckton)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 3

slide-4
SLIDE 4

The interconnect matters

Today’s 8-socket Opteron

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 4

slide-5
SLIDE 5

The interconnect matters

Tomorrow’s 8-socket Nehalem

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 5

slide-6
SLIDE 6

The interconnect matters

On-chip interconnects

Coherent L2 Cache Coherent L2 Cache

System Interface System Interface

Memory Controller Memory Controller Memory Controller Memory Controller

Display Interface Display Interface Texture Logic Texture Logic Fixed Function Fixed Function In-Order Multi-threaded Wide SIMD D$ D$ I$ I$ D$ D$ I$ I$ D$ D$ I$ I$ D$ D$ I$ I$

Coherent L2 Cache Coherent L2 Cache

System Interface System Interface

Memory Controller Memory Controller Memory Controller Memory Controller

Display Interface Display Interface Texture Logic Texture Logic Fixed Function Fixed Function In-Order Multi-threaded Wide SIMD In-Order Multi-threaded Wide SIMD D$ D$ I$ I$ D$ D$ I$ I$ D$ D$ I$ I$ D$ D$ I$ I$ D$ D$ I$ I$ D$ D$ I$ I$ In-Order Multi-threaded Wide SIMD In-Order Multi-threaded Wide SIMD In-Order Multi-threaded Wide SIMD PCIe 1 MAC/ PHY SerDes GbE GbE 1 Flexible I/O Flexible I/O UART, HPI, I2C, JTAG,SPI

DDR2 Controller 3 DDR2 Controller 2 DDR2 Controller 1 DDR2 Controller 0

XAUI 1

MAC/ PHY

SerDes PCIe 0

MAC/ PHY

SerDes SerDes

Reg File P 2 P 1 P L2 CACHE PROCESSOR CACHE SWITCH 2D DMA L-1I MDN TDN UDN IDN STN L-1D I-TLB D-TLB

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 6

slide-7
SLIDE 7

Core diversity

◮ Within a system:

◮ Programmable NICs ◮ GPUs ◮ FPGAs (in CPU sockets)

◮ On a single die:

◮ Performance asymmetry ◮ Streaming instructions (SIMD, SSE, etc.) ◮ Virtualisation support 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 7

slide-8
SLIDE 8

Summary

◮ Increasing core counts, increasing diversity ◮ Unlike HPC systems, cannot optimise at design time

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 8

slide-9
SLIDE 9

The multikernel model

◮ It’s time to rethink the default structure of an OS

◮ Shared-memory kernel on every core ◮ Data structures protected by locks ◮ Anything else is a device 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 9

slide-10
SLIDE 10

The multikernel model

◮ It’s time to rethink the default structure of an OS

◮ Shared-memory kernel on every core ◮ Data structures protected by locks ◮ Anything else is a device

◮ Proposal: structure the OS as a distributed system ◮ Design principles:

  • 1. Make inter-core communication explicit
  • 2. Make OS structure hardware-neutral
  • 3. View state as replicated

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 9

slide-11
SLIDE 11

Outline

Introduction Motivation Hardware diversity The multikernel model Design principles The model Barrelfish Evaluation Case study: Unmap

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 10

slide-12
SLIDE 12
  • 1. Make inter-core communication explicit

◮ All communication with messages (no shared state)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 11

slide-13
SLIDE 13
  • 1. Make inter-core communication explicit

◮ All communication with messages (no shared state) ◮ Decouples system structure from

inter-core communication mechanism

◮ Communication patterns explicitly expressed

◮ Naturally supports heterogeneous cores,

non-coherent interconnects (PCIe)

◮ Better match for future hardware

◮ ...with cheap explicit message passing (e.g. Tile64) ◮ ...without cache-coherence (e.g. Intel 80-core)

◮ Allows split-phase operations

◮ Decouple requests and responses for concurrency

◮ We can reason about it

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 11

slide-14
SLIDE 14

Message passing vs. shared memory: experiment

Shared memory (move the data to the operation):

◮ Each core updates the same memory locations (no locking) ◮ Cache-coherence protocol migrates modified cache lines

◮ Processor stalled while line is fetched or invalidated ◮ Limited by latency of interconnect round-trips ◮ Performance depends on data size (cache lines)

and contention (number of cores)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 12

slide-15
SLIDE 15

Shared memory results

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM1

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13

slide-16
SLIDE 16

Shared memory results

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM2 SHM1

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13

slide-17
SLIDE 17

Shared memory results

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM4 SHM2 SHM1

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13

slide-18
SLIDE 18

Shared memory results

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM8 SHM4 SHM2 SHM1

Stalled cycles (no locking!)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13

slide-19
SLIDE 19

Message passing vs. shared memory: experiment

Message passing (move the operation to the data):

◮ A single server core updates the memory locations ◮ Each client core sends RPCs to the server

◮ Operation and results described in a single cache line ◮ Block while waiting for a response (in this experiment) 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 14

slide-20
SLIDE 20

Message passing vs. shared memory: tradeoff

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM8 SHM4 SHM2 SHM1 MSG1

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15

slide-21
SLIDE 21

Message passing vs. shared memory: tradeoff

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM8 SHM4 SHM2 SHM1 MSG8 MSG1

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15

slide-22
SLIDE 22

Message passing vs. shared memory: tradeoff

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM8 SHM4 SHM2 SHM1 MSG8 MSG1

Messaging faster for: ≥4 cores ≥4 cache lines

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15

slide-23
SLIDE 23

Message passing vs. shared memory: tradeoff

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM8 SHM4 SHM2 SHM1 MSG8 MSG1 Server

Actual cost of update at server

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15

slide-24
SLIDE 24

Message passing vs. shared memory: tradeoff

4×4-core AMD system 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Latency (cycles × 1000) Cores SHM8 SHM4 SHM2 SHM1 MSG8 MSG1 Server

Actual cost of update at server “spare” cycles if RPC was split-phase

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15

slide-25
SLIDE 25
  • 2. Make OS structure hardware-neutral

◮ Separate OS structure from hardware ◮ Only hardware-specific parts:

◮ Message transports (highly optimised / specialised) ◮ CPU / device drivers 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 16

slide-26
SLIDE 26
  • 2. Make OS structure hardware-neutral

◮ Separate OS structure from hardware ◮ Only hardware-specific parts:

◮ Message transports (highly optimised / specialised) ◮ CPU / device drivers

◮ Adaptability to changing performance characteristics ◮ Late-bind protocol and message transport implementations

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 16

slide-27
SLIDE 27
  • 3. View state as replicated

◮ Potentially-shared state accessed as if it were a local replica

◮ Scheduler queues, process control blocks, etc. 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 17

slide-28
SLIDE 28
  • 3. View state as replicated

◮ Potentially-shared state accessed as if it were a local replica

◮ Scheduler queues, process control blocks, etc.

◮ Required by message-passing model ◮ Naturally supports domains that do not share memory ◮ Naturally supports changes to the set of running cores

◮ Hotplug, power management 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 17

slide-29
SLIDE 29

Replication vs. sharing as default

◮ Replicas used as an optimisation in previous systems:

Tornado, K42 clustered objects Linux read-only data, kernel text

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 18

slide-30
SLIDE 30

Replication vs. sharing as default

◮ Replicas used as an optimisation in previous systems:

Tornado, K42 clustered objects Linux read-only data, kernel text

◮ In a multikernel, sharing is a local optimisation

◮ Shared (locked) replica for threads or closely-coupled cores ◮ Hidden, local ◮ Only when faster, as decided at runtime ◮ Basic model remains split-phase 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 18

slide-31
SLIDE 31

The multikernel model

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 19

slide-32
SLIDE 32

Outline

Introduction Motivation Hardware diversity The multikernel model Design principles The model Barrelfish Evaluation Case study: Unmap

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 20

slide-33
SLIDE 33

Barrelfish

◮ From-scratch implementation of a multikernel ◮ Supports x86-64 multiprocessors (ARM soon) ◮ Open source (BSD licensed)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 21

slide-34
SLIDE 34

Barrelfish structure

Monitors and CPU drivers

◮ CPU driver serially handles traps and exceptions ◮ Monitor mediates local operations on global state ◮ URPC inter-core (shared memory) message transport

  • n current (cache-coherent) x86 HW

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 22

slide-35
SLIDE 35

Non-original ideas in Barrelfish

Multiprocessor techniques:

◮ Minimise shared state (Tornado, K42, Corey) ◮ User-space messaging decoupled from IPIs (URPC) ◮ Single-threaded non-preemptive kernel per core (K42)

Other ideas we liked:

◮ Capabilities for all resource management (seL4) ◮ Upcall processor dispatch (Psyche, Sched. Activations, K42) ◮ Push policy into application domains (Exokernel, Nemesis) ◮ Lots of information (Infokernel) ◮ Run drivers in their own domains (µkernels) ◮ EDF as per-core CPU scheduler (RBED) ◮ Specify device registers in a little language (Devil)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 23

slide-36
SLIDE 36

Applications running on Barrelfish

◮ Slide viewer (this one!) ◮ Webserver (www.barrelfish.org) ◮ Virtual machine monitor (runs unmodified Linux) ◮ SPLASH-2, OpenMP (benchmarks) ◮ SQLite ◮ ECLiPSe (constraint engine) ◮ more...

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 24

slide-37
SLIDE 37

Outline

Introduction Motivation Hardware diversity The multikernel model Design principles The model Barrelfish Evaluation Case study: Unmap

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 25

slide-38
SLIDE 38

Evaluation goals

How do we evaluate an alternative OS structure?

◮ Good baseline performance

◮ Comparable to existing systems on current hardware

◮ Scalability with cores ◮ Adapability to different hardware ◮ Ability to exploit message-passing for performance

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 26

slide-39
SLIDE 39

Case study: Unmap (TLB shootdown)

◮ Send a message to every core with a mapping,

wait for all to be acknowledged

◮ Linux/Windows:

  • 1. Kernel sends IPIs
  • 2. Spins on shared acknowledgement count/event

◮ Barrelfish:

  • 1. User request to local monitor domain
  • 2. Single-phase commit to remote cores

◮ How to implement communication?

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 27

slide-40
SLIDE 40

Unmap communication protocols

Unicast

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 28

slide-41
SLIDE 41

Unmap communication protocols

Unicast Broadcast

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 28

slide-42
SLIDE 42

Unmap communication protocols

Raw messaging cost 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Latency (cycles × 1000) Cores Broadcast Unicast

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 29

slide-43
SLIDE 43

Why use multicast

8×4-core AMD system

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 30

slide-44
SLIDE 44

Why use multicast

8×4-core AMD system

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 30

slide-45
SLIDE 45

Multicast communication

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 31

slide-46
SLIDE 46

Multicast communication

◮ “NUMA-aware” multicast

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 31

slide-47
SLIDE 47

Unmap communication protocols

Raw messaging cost 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Latency (cycles × 1000) Cores Broadcast Unicast Multicast NUMA-Aware Multicast

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 32

slide-48
SLIDE 48

System knowledge base

◮ Constructing multicast tree requires hardware knowledge

◮ Mapping of cores to sockets (CPUID data) ◮ Messaging latency (online measurements)

◮ More generally, Barrelfish needs a way to reason

about diverse system resources

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 33

slide-49
SLIDE 49

System knowledge base

◮ Constructing multicast tree requires hardware knowledge

◮ Mapping of cores to sockets (CPUID data) ◮ Messaging latency (online measurements)

◮ More generally, Barrelfish needs a way to reason

about diverse system resources

◮ We tackle this with constraint logic programming

[Schüpbach et al., MMCS’08]

◮ System knowledge base stores rich, detailed representation

  • f hardware, performs online reasoning

◮ Initial implementation: port of the ECLiPSe constraint solver

◮ Prolog query used to construct multicast routing tree

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 33

slide-50
SLIDE 50

Unmap latency

10 20 30 40 50 60 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Latency (cycles × 1000) Cores Windows Linux Barrelfish

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 34

slide-51
SLIDE 51

Summary of other results

◮ No penalty for shared-memory (SPLASH, OpenMP) ◮ Network throughput: 951.7Mbit/s (same as Linux) ◮ Pipelined web server

◮ Static: 640 Mbit/s vs. 316 Mbit/s for lighttpd/Linux ◮ Dynamic:3417 requests/s (17.1Mbit/s) bottlenecked on SQL 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 35

slide-52
SLIDE 52

Conclusion

◮ Modern computers are inherently distributed systems ◮ It’s time to rethink OS structure to match ◮ The Multikernel: model of the OS as a distributed system

  • 1. Explicit communication, replicated state
  • 2. Hardware-neutral OS structure

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 36

slide-53
SLIDE 53

Conclusion

◮ Modern computers are inherently distributed systems ◮ It’s time to rethink OS structure to match ◮ The Multikernel: model of the OS as a distributed system

  • 1. Explicit communication, replicated state
  • 2. Hardware-neutral OS structure

◮ Barrelfish: our concrete implementation

◮ Reasonable performance

  • n current hardware

◮ Better scalability/adaptability

for future hardware

◮ Promising approach 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 36

www.barrelfish.org

slide-54
SLIDE 54

Backup slides

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 37

slide-55
SLIDE 55

URPC implementation

◮ Current hardware provides one communication mechanism:

cache-coherent shared memory

◮ Can we “trick” cache-coherence protocol to send messages?

◮ User-level RPC (URPC) [Bershad et al., 1991]

◮ Channel is shared ring buffer ◮ Messages are cache-line sized ◮ Sender writes message into next line ◮ Receiver polls on last word ◮ Marshalling/demarshalling, naming,

binding all implemented above

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 38

slide-56
SLIDE 56

Polling for receive

Tradeoff vs. IPIs

◮ Polling is cheap: line is local to receiver until message arrives ◮ Hardware-imposed costs for IPI (on 4×4-core AMD):

◮ ≈800 cycles to send (from user-mode) ◮ ≈1200 cycles lost in receive (to user-mode) 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 39

slide-57
SLIDE 57

Polling for receive

Tradeoff vs. IPIs

◮ Polling is cheap: line is local to receiver until message arrives ◮ Hardware-imposed costs for IPI (on 4×4-core AMD):

◮ ≈800 cycles to send (from user-mode) ◮ ≈1200 cycles lost in receive (to user-mode)

◮ There is a tradeoff here! ◮ IPIs are decoupled from fast-path messaging, used only for:

  • 1. Specific (batches of) operations that require low latency,

even when other tasks are executing

  • 2. Awakening cores that have blocked to save power

(alternatively, MONITOR/MWAIT)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 39