The Barrelfish operating system for CMPs: research issues Tim - - PowerPoint PPT Presentation

the barrelfish operating system for
SMART_READER_LITE
LIVE PREVIEW

The Barrelfish operating system for CMPs: research issues Tim - - PowerPoint PPT Presentation

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich. The Barrelfish project Collaboration between ETH Zurich


slide-1
SLIDE 1

The Barrelfish operating system for CMPs: research issues

Tim Harris

Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich.

slide-2
SLIDE 2

The Barrelfish project

  • Collaboration between ETH Zurich and MSRC

Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario Simone, Animesh Trivedi

slide-3
SLIDE 3

Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

slide-4
SLIDE 4

Do we need a new OS?

SunSPARC Enterprise M9000 server M9000-64, up to 64 CPUs, 256 cores 180cm x 167.4cm x 126cm 1880kg

slide-5
SLIDE 5

Do we need a new OS?

SGI Origin 3000 Up to 512 processors Up to 1TB memory

slide-6
SLIDE 6

Do we need a new OS?

  • How might the design of a CMP differ from these

existing systems?

  • How might the workloads for a CMP differ from

those of existing multi-processor machines?

slide-7
SLIDE 7

The cliched single-threaded perf graph

Year Log (seq. perf)

Historical 1-thread perf gains via improved clock rate and transistors used to extract ILP #transistors still growing, but delivered as additional cores and accelerators The things that would have used this “lost” perf must now be written to use cores/accel

slide-8
SLIDE 8

Interactive perf

time time User input Output User input Output User input Output User input Output

slide-9
SLIDE 9

CC-NUMA architecture

CPU1 CPU2 CPU3 RAM & directory CPU4 To interconnect CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM Adding more CPUs brings more of most

  • ther things

Locality property: only go to interconnect for real I/O or sharing

slide-10
SLIDE 10

Machine architecture

Core1 Core2 Core3 Core4 To RAM L2 L2 1 3 5 7 2 4 6 8 More cores brings more cycles ...not necessarily proportionately more cache ...nor more off- chip b/w or total RAM capacity

slide-11
SLIDE 11

Machine diversity: AMD 4-core

slide-12
SLIDE 12

...Sun Niagara-2

slide-13
SLIDE 13

...Sun Rock

slide-14
SLIDE 14

IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE 2010

  • J. Howard1, S. Dighe1, Y. Hoskote1, S. Vangal1, D. Finan1, G. Ruhl1, D. Jenkins1, H.

Wilson1, N. Borkar1, G. Schrom1, F. Pailet1, S. Jain2, T. Jacob2, S. Yada2, S. Marella2, P. Salihundam2, V. Erraguntla2, M. Konow3, M. Riepen3, G. Droege3, J. Lindemann3, M. Gries3, T. Apel3, K. Henriss3, T. Lund-Larsen3, S. Steibl3, S. Borkar1, V. De1, R. Van Der Wijngaart4, T. Mattson5 1 Intel, Hillsboro, OR, 2 Intel, Bangalore, India, 3 Intel, Braunschweig, Germany 4 Intel, Santa Clara, CA, 5 Intel, DuPont, WA

A 567mm2 processor on 45nm CMOS integrates 48 IA-32 cores and 4 DDR3 channels in a 6×4 2D-mesh network. Cores communicate through message passing using 384KB of on-die shared memory. Fine-grain power management takes advantage

  • f 8 voltage and 28 frequency islands to allow independent

DVFS of cores and mesh. As performance scales, the processor dissipates between 25W and 125W.

slide-15
SLIDE 15

Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

slide-16
SLIDE 16

The multikernel model

x86 Async messages App x64 ARM GPU App App OS node OS node OS node OS node State replica State replica State replica State replica App Hardware interconnect

slide-17
SLIDE 17

Barrelfish: a multikernel OS

  • A new OS architecture for scalable multicore

systems

  • Approach: structure the OS as a distributed

system

  • Design principles:

– Make inter-core communication explicit – Make OS structure hardware-neutral – View state as replicated

slide-18
SLIDE 18

#1 Explicit inter-core communication

  • All communication with messages
  • Decouples system structure from inter-core

communication mechanism

  • Communication patterns explicitly expressed
  • Better match for future hardware

– Naturally supports heterogeneous cores, non- coherent interconnects (PCIe) – with cheap explicit message passing – without cache-coherence (e.g. Intel 80-core)

  • Allows split-phase operations
slide-19
SLIDE 19

Communication latency

slide-20
SLIDE 20

Communication latency

slide-21
SLIDE 21

Message passing vs shared memory

  • Shared memory (move the data to the operation):

– Each core updates the same memory locations – Cache-coherence migrates modified cache lines

slide-22
SLIDE 22

Shared memory scaling & latency

slide-23
SLIDE 23

Message passing

  • Message passing (move operation to the data):

– A single server core updates the memory locations – Each client core sends RPCs to the server

slide-24
SLIDE 24

Message passing

slide-25
SLIDE 25

Message passing

slide-26
SLIDE 26

#2 Hardware-neutral structure

  • Separate OS structure from hardware
  • Only hardware-specific parts:

– Message transports (highly optimised / specialised) – CPU / device drivers

  • Adaptability to changing performance

characteristics

– Late-bind protocol and message transport implementations

slide-27
SLIDE 27

#3 Replicate common state

  • Potentially-shared state accessed as if it were a

local replica

– Scheduler queues, process control blocks, etc. – Required by message-passing model

  • Naturally supports domains that do not share

memory

  • Naturally supports changes to the set of running

cores

– Hotplug, power management

slide-28
SLIDE 28

Replication vs sharing as the default

  • Replicas used as an optimisation in other systems
  • In a multikernel, sharing is a local optimisation

– Shared (locked) replica on closely-coupled cores – Only when faster, as decided at runtime

  • Basic model remains split-phase
slide-29
SLIDE 29

Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

slide-30
SLIDE 30

Applications running on Barrelfish

  • Slide viewer (but not today...)
  • Webserver (www.barrelfish.org)
  • Virtual machine monitor (runs unmodified Linux)
  • Parallel benchmarks:

– SPLASH-2 – OpenMP

  • SQLite
  • ECLiPSe (constraint engine)
  • more. . .
slide-31
SLIDE 31
  • Two hyper-transport requests on AMD

Cycles Msg / K-Cycle 2*4-core Intel Shared 180 11.97 Non-shared 570 3.78 2*2-core AMD Same die 450 3.42 1 hop 532 3.19 4*4-core AMD Shared 448 3.57 1 hop 545 3.53 2 hop 659 3.19 8*4-core AMD Shared 538 2.77 1 hop 613 2.79 2 hop 682 2.71

1-way URPC message costs

slide-32
SLIDE 32

Local vs remote messaging

  • URPC to a remote core compares favourably

with IPC

  • No context switch: TLB unaffected
  • Lower cache impact
  • Higher throughput for pipelined messages

Cycles Msg / K- Cycle I-cache lines used D-cache lines used 2*2-core AMD URPC 450 3.42 9 8 L4 IPC 424 2.36 25 13

slide-33
SLIDE 33

Communication perf: IP loopback

  • 2*2-core AMD system, 1000-byte packets

– Linux: copy in / out of shared kernel buffers – Barrelfish: point-to-point URPC channel

Barrelfish Linux Throughput (Mbit/s) 2154 1823 D-cache misses per packet 21 77 Source->Sink HT bytes per packet 1868 2628 Sink->Source HT bytes per packet 752 2200 Source->Sink HT link utilization 8% 11% Sink->Source HT link utilization 3% 9%

slide-34
SLIDE 34

Case study: TLB shoot-down

  • Send a message to every core with a mapping
  • Wait for acks
  • Linux/Windows:

– Send IPI – Spin on shared ack count

  • Barrelfish:

– Request to local monitor domain – 1-phase commit to remote cores – Plug in different communication mechanisms

slide-35
SLIDE 35

... ...

cache-lines write read

TLB shoot-down: n*unicast

slide-36
SLIDE 36

...

TLB shoot-down: 1*broadcast

slide-37
SLIDE 37

Messaging costs

slide-38
SLIDE 38

... ... Same package (shared L3)

TLB shoot-down: multicast

slide-39
SLIDE 39

TLB shoot-down: NUMA-aware m’cast

... ... Same package (shared L3) More hyper-transport hops

slide-40
SLIDE 40

Messaging costs

slide-41
SLIDE 41

End-to-end comparative latency

slide-42
SLIDE 42

2-PC pipelining

slide-43
SLIDE 43

Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

slide-44
SLIDE 44

Terminology

  • Domain

– Protection domain/address space (“process”)

  • Dispatcher

– One per domain per core – Scheduled by local CPU driver

  • Invokes upcall, which then typically runs a core-local user-

level thread scheduler

  • Domain spanning

– Start instances of a domain on multiple cores

  • cf start affinitized threads
slide-45
SLIDE 45

Programming example: domain spanning

1 for i = 1..num_cores-1: 2 create a new dispatcher on core i 3 while (num_dispatchers < num_cores-1): 4 wait for the next message and handle it 1 dispatcher_create_callback: 2 num_dispatchers++

slide-46
SLIDE 46

Domain spanning: baseline

  • Centralized:

– Poor scalability, but correct

  • 1021 messages, 487 alloc. RPCs

50 million cycles (40ms)

monitor working monitor blocked monitor polling monitor bzero spantest.exe name service memory server

slide-47
SLIDE 47

Domain spanning: v2

  • Per-core memory servers
  • Better memset(!)

Was 50M, now 9M

monitor working monitor blocked monitor polling monitor bzero spantest.exe name service memory server

slide-48
SLIDE 48

Domain spanning: v3

  • Monitors use per-core mem. server
  • Move zeroing off the critical path

Was 9M, now 4M

monitor working monitor blocked monitor polling monitor bzero spantest.exe name service memory server

slide-49
SLIDE 49

Domain spanning: v4

  • Change the API
  • Create domains on all cores at once
  • 76 messages

Was 4M, now 2.5M

monitor working monitor blocked monitor polling monitor bzero spantest.exe name service memory server

slide-50
SLIDE 50

Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

slide-51
SLIDE 51

Current activity

  • Ports to other platforms

– ARM (32 bit), ongoing – Bee3 FPGA platform

  • Better tracing infrastructure
  • Parallel file system
  • Exploration of 1-machine distributed algorithms
  • Programming model
  • Papers and source code

– http://www.barrelfish.org