The Barrelfish operating system for CMPs: research issues Tim - - PowerPoint PPT Presentation
The Barrelfish operating system for CMPs: research issues Tim - - PowerPoint PPT Presentation
The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich. The Barrelfish project Collaboration between ETH Zurich
The Barrelfish project
- Collaboration between ETH Zurich and MSRC
Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario Simone, Animesh Trivedi
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
Do we need a new OS?
SunSPARC Enterprise M9000 server M9000-64, up to 64 CPUs, 256 cores 180cm x 167.4cm x 126cm 1880kg
Do we need a new OS?
SGI Origin 3000 Up to 512 processors Up to 1TB memory
Do we need a new OS?
- How might the design of a CMP differ from these
existing systems?
- How might the workloads for a CMP differ from
those of existing multi-processor machines?
The cliched single-threaded perf graph
Year Log (seq. perf)
Historical 1-thread perf gains via improved clock rate and transistors used to extract ILP #transistors still growing, but delivered as additional cores and accelerators The things that would have used this “lost” perf must now be written to use cores/accel
Interactive perf
time time User input Output User input Output User input Output User input Output
CC-NUMA architecture
CPU1 CPU2 CPU3 RAM & directory CPU4 To interconnect CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM Adding more CPUs brings more of most
- ther things
Locality property: only go to interconnect for real I/O or sharing
Machine architecture
Core1 Core2 Core3 Core4 To RAM L2 L2 1 3 5 7 2 4 6 8 More cores brings more cycles ...not necessarily proportionately more cache ...nor more off- chip b/w or total RAM capacity
Machine diversity: AMD 4-core
...Sun Niagara-2
...Sun Rock
IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE 2010
- J. Howard1, S. Dighe1, Y. Hoskote1, S. Vangal1, D. Finan1, G. Ruhl1, D. Jenkins1, H.
Wilson1, N. Borkar1, G. Schrom1, F. Pailet1, S. Jain2, T. Jacob2, S. Yada2, S. Marella2, P. Salihundam2, V. Erraguntla2, M. Konow3, M. Riepen3, G. Droege3, J. Lindemann3, M. Gries3, T. Apel3, K. Henriss3, T. Lund-Larsen3, S. Steibl3, S. Borkar1, V. De1, R. Van Der Wijngaart4, T. Mattson5 1 Intel, Hillsboro, OR, 2 Intel, Bangalore, India, 3 Intel, Braunschweig, Germany 4 Intel, Santa Clara, CA, 5 Intel, DuPont, WA
A 567mm2 processor on 45nm CMOS integrates 48 IA-32 cores and 4 DDR3 channels in a 6×4 2D-mesh network. Cores communicate through message passing using 384KB of on-die shared memory. Fine-grain power management takes advantage
- f 8 voltage and 28 frequency islands to allow independent
DVFS of cores and mesh. As performance scales, the processor dissipates between 25W and 125W.
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
The multikernel model
x86 Async messages App x64 ARM GPU App App OS node OS node OS node OS node State replica State replica State replica State replica App Hardware interconnect
Barrelfish: a multikernel OS
- A new OS architecture for scalable multicore
systems
- Approach: structure the OS as a distributed
system
- Design principles:
– Make inter-core communication explicit – Make OS structure hardware-neutral – View state as replicated
#1 Explicit inter-core communication
- All communication with messages
- Decouples system structure from inter-core
communication mechanism
- Communication patterns explicitly expressed
- Better match for future hardware
– Naturally supports heterogeneous cores, non- coherent interconnects (PCIe) – with cheap explicit message passing – without cache-coherence (e.g. Intel 80-core)
- Allows split-phase operations
Communication latency
Communication latency
Message passing vs shared memory
- Shared memory (move the data to the operation):
– Each core updates the same memory locations – Cache-coherence migrates modified cache lines
Shared memory scaling & latency
Message passing
- Message passing (move operation to the data):
– A single server core updates the memory locations – Each client core sends RPCs to the server
Message passing
Message passing
#2 Hardware-neutral structure
- Separate OS structure from hardware
- Only hardware-specific parts:
– Message transports (highly optimised / specialised) – CPU / device drivers
- Adaptability to changing performance
characteristics
– Late-bind protocol and message transport implementations
#3 Replicate common state
- Potentially-shared state accessed as if it were a
local replica
– Scheduler queues, process control blocks, etc. – Required by message-passing model
- Naturally supports domains that do not share
memory
- Naturally supports changes to the set of running
cores
– Hotplug, power management
Replication vs sharing as the default
- Replicas used as an optimisation in other systems
- In a multikernel, sharing is a local optimisation
– Shared (locked) replica on closely-coupled cores – Only when faster, as decided at runtime
- Basic model remains split-phase
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
Applications running on Barrelfish
- Slide viewer (but not today...)
- Webserver (www.barrelfish.org)
- Virtual machine monitor (runs unmodified Linux)
- Parallel benchmarks:
– SPLASH-2 – OpenMP
- SQLite
- ECLiPSe (constraint engine)
- more. . .
- Two hyper-transport requests on AMD
Cycles Msg / K-Cycle 2*4-core Intel Shared 180 11.97 Non-shared 570 3.78 2*2-core AMD Same die 450 3.42 1 hop 532 3.19 4*4-core AMD Shared 448 3.57 1 hop 545 3.53 2 hop 659 3.19 8*4-core AMD Shared 538 2.77 1 hop 613 2.79 2 hop 682 2.71
1-way URPC message costs
Local vs remote messaging
- URPC to a remote core compares favourably
with IPC
- No context switch: TLB unaffected
- Lower cache impact
- Higher throughput for pipelined messages
Cycles Msg / K- Cycle I-cache lines used D-cache lines used 2*2-core AMD URPC 450 3.42 9 8 L4 IPC 424 2.36 25 13
Communication perf: IP loopback
- 2*2-core AMD system, 1000-byte packets
– Linux: copy in / out of shared kernel buffers – Barrelfish: point-to-point URPC channel
Barrelfish Linux Throughput (Mbit/s) 2154 1823 D-cache misses per packet 21 77 Source->Sink HT bytes per packet 1868 2628 Sink->Source HT bytes per packet 752 2200 Source->Sink HT link utilization 8% 11% Sink->Source HT link utilization 3% 9%
Case study: TLB shoot-down
- Send a message to every core with a mapping
- Wait for acks
- Linux/Windows:
– Send IPI – Spin on shared ack count
- Barrelfish:
– Request to local monitor domain – 1-phase commit to remote cores – Plug in different communication mechanisms
... ...
cache-lines write read
TLB shoot-down: n*unicast
...
TLB shoot-down: 1*broadcast
Messaging costs
... ... Same package (shared L3)
TLB shoot-down: multicast
TLB shoot-down: NUMA-aware m’cast
... ... Same package (shared L3) More hyper-transport hops
Messaging costs
End-to-end comparative latency
2-PC pipelining
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
Terminology
- Domain
– Protection domain/address space (“process”)
- Dispatcher
– One per domain per core – Scheduled by local CPU driver
- Invokes upcall, which then typically runs a core-local user-
level thread scheduler
- Domain spanning
– Start instances of a domain on multiple cores
- cf start affinitized threads
Programming example: domain spanning
1 for i = 1..num_cores-1: 2 create a new dispatcher on core i 3 while (num_dispatchers < num_cores-1): 4 wait for the next message and handle it 1 dispatcher_create_callback: 2 num_dispatchers++
Domain spanning: baseline
- Centralized:
– Poor scalability, but correct
- 1021 messages, 487 alloc. RPCs
50 million cycles (40ms)
monitor working monitor blocked monitor polling monitor bzero spantest.exe name service memory server
Domain spanning: v2
- Per-core memory servers
- Better memset(!)
Was 50M, now 9M
monitor working monitor blocked monitor polling monitor bzero spantest.exe name service memory server
Domain spanning: v3
- Monitors use per-core mem. server
- Move zeroing off the critical path
Was 9M, now 4M
monitor working monitor blocked monitor polling monitor bzero spantest.exe name service memory server
Domain spanning: v4
- Change the API
- Create domains on all cores at once
- 76 messages
Was 4M, now 2.5M
monitor working monitor blocked monitor polling monitor bzero spantest.exe name service memory server
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
Current activity
- Ports to other platforms
– ARM (32 bit), ongoing – Bee3 FPGA platform
- Better tracing infrastructure
- Parallel file system
- Exploration of 1-machine distributed algorithms
- Programming model
- Papers and source code