the barrelfish operating system for
play

The Barrelfish operating system for CMPs: research issues Tim - PowerPoint PPT Presentation

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich. The Barrelfish project Collaboration between ETH Zurich


  1. The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich.

  2. The Barrelfish project • Collaboration between ETH Zurich and MSRC Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario Simone, Animesh Trivedi

  3. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

  4. Do we need a new OS? SunSPARC Enterprise M9000 server M9000-64, up to 64 CPUs, 256 cores 180cm x 167.4cm x 126cm 1880kg

  5. Do we need a new OS? SGI Origin 3000 Up to 512 processors Up to 1TB memory

  6. Do we need a new OS? • How might the design of a CMP differ from these existing systems? • How might the workloads for a CMP differ from those of existing multi-processor machines?

  7. The cliched single-threaded perf graph The things that would have used this “lost” perf must now be written to use cores/accel Historical 1-thread perf Log (seq. perf) gains via improved clock rate and transistors used to extract ILP #transistors still growing, but delivered as additional cores and accelerators Year

  8. time time Output Output Interactive perf User input User input Output Output User input User input

  9. CC-NUMA architecture Adding more CPUs brings more of most other things CPU1 CPU2 CPU3 CPU4 CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM RAM & directory To interconnect CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM Locality property: only go to interconnect for real I/O or sharing

  10. Machine architecture More cores brings more cycles 1 Core1 2 3 Core2 4 5 Core3 6 Core4 7 8 ...not necessarily proportionately more cache L2 L2 ...nor more off- To RAM chip b/w or total RAM capacity

  11. Machine diversity: AMD 4-core

  12. ...Sun Niagara-2

  13. ...Sun Rock

  14. IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE 2010 J. Howard1, S. Dighe1, Y. Hoskote1, S. Vangal1, D. Finan1, G. Ruhl1, D. Jenkins1, H. Wilson1, N. Borkar1, G. Schrom1, F. Pailet1, S. Jain2, T. Jacob2, S. Yada2, S. Marella2, P. Salihundam2, V. Erraguntla2, M. Konow3, M. Riepen3, G. Droege3, J. Lindemann3, M. Gries3, T. Apel3, K. Henriss3, T. Lund-Larsen3, S. Steibl3, S. Borkar1, V. De1, R. Van Der Wijngaart4, T. Mattson5 1 Intel, Hillsboro, OR, 2 Intel, Bangalore, India, 3 Intel, Braunschweig, Germany 4 Intel, Santa Clara, CA, 5 Intel, DuPont, WA A 567mm2 processor on 45nm CMOS integrates 48 IA-32 cores and 4 DDR3 channels in a 6×4 2D-mesh network. Cores communicate through message passing using 384KB of on-die shared memory. Fine-grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. As performance scales, the processor dissipates between 25W and 125W.

  15. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

  16. The multikernel model App App App App OS node OS node OS node OS node Async messages State State State State replica replica replica replica x86 x64 ARM GPU Hardware interconnect

  17. Barrelfish: a multikernel OS • A new OS architecture for scalable multicore systems • Approach: structure the OS as a distributed system • Design principles: – Make inter-core communication explicit – Make OS structure hardware-neutral – View state as replicated

  18. #1 Explicit inter-core communication • All communication with messages • Decouples system structure from inter-core communication mechanism • Communication patterns explicitly expressed • Better match for future hardware – Naturally supports heterogeneous cores, non- coherent interconnects (PCIe) – with cheap explicit message passing – without cache-coherence (e.g. Intel 80-core) • Allows split-phase operations

  19. Communication latency

  20. Communication latency

  21. Message passing vs shared memory • Shared memory (move the data to the operation): – Each core updates the same memory locations – Cache-coherence migrates modified cache lines

  22. Shared memory scaling & latency

  23. Message passing • Message passing (move operation to the data): – A single server core updates the memory locations – Each client core sends RPCs to the server

  24. Message passing

  25. Message passing

  26. #2 Hardware-neutral structure • Separate OS structure from hardware • Only hardware-specific parts: – Message transports (highly optimised / specialised) – CPU / device drivers • Adaptability to changing performance characteristics – Late-bind protocol and message transport implementations

  27. #3 Replicate common state • Potentially-shared state accessed as if it were a local replica – Scheduler queues, process control blocks, etc. – Required by message-passing model • Naturally supports domains that do not share memory • Naturally supports changes to the set of running cores – Hotplug, power management

  28. Replication vs sharing as the default • Replicas used as an optimisation in other systems • In a multikernel, sharing is a local optimisation – Shared (locked) replica on closely-coupled cores – Only when faster, as decided at runtime • Basic model remains split-phase

  29. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

  30. Applications running on Barrelfish • Slide viewer (but not today...) • Webserver (www.barrelfish.org) • Virtual machine monitor (runs unmodified Linux) • Parallel benchmarks: – SPLASH-2 – OpenMP • SQLite • ECLiPSe (constraint engine) • more. . .

  31. 1-way URPC message costs Cycles Msg / K-Cycle 2*4-core Intel Shared 180 11.97 Non-shared 570 3.78 2*2-core AMD Same die 450 3.42 1 hop 532 3.19 4*4-core AMD Shared 448 3.57 1 hop 545 3.53 2 hop 659 3.19 8*4-core AMD Shared 538 2.77 1 hop 613 2.79 2 hop 682 2.71 • Two hyper-transport requests on AMD

  32. Local vs remote messaging Cycles Msg / K- I-cache D-cache Cycle lines used lines used 2*2-core AMD URPC 450 3.42 9 8 L4 IPC 424 2.36 25 13 • URPC to a remote core compares favourably with IPC • No context switch: TLB unaffected • Lower cache impact • Higher throughput for pipelined messages

  33. Communication perf: IP loopback Barrelfish Linux Throughput (Mbit/s) 2154 1823 D-cache misses per packet 21 77 Source->Sink HT bytes per packet 1868 2628 Sink->Source HT bytes per packet 752 2200 Source->Sink HT link utilization 8% 11% Sink->Source HT link utilization 3% 9% • 2*2-core AMD system, 1000-byte packets – Linux: copy in / out of shared kernel buffers – Barrelfish: point-to-point URPC channel

  34. Case study: TLB shoot-down • Send a message to every core with a mapping • Wait for acks • Linux/Windows: – Send IPI – Spin on shared ack count • Barrelfish: – Request to local monitor domain – 1-phase commit to remote cores – Plug in different communication mechanisms

  35. TLB shoot-down: n*unicast cache-lines ... ... read write

  36. TLB shoot-down: 1*broadcast ...

  37. Messaging costs

  38. TLB shoot-down: multicast ... ... Same package (shared L3)

  39. TLB shoot-down: NUMA-aware m’cast More hyper-transport hops ... ... Same package (shared L3)

  40. Messaging costs

  41. End-to-end comparative latency

  42. 2-PC pipelining

  43. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

  44. Terminology • Domain – Protection domain/address space (“process”) • Dispatcher – One per domain per core – Scheduled by local CPU driver • Invokes upcall, which then typically runs a core-local user- level thread scheduler • Domain spanning – Start instances of a domain on multiple cores • cf start affinitized threads

  45. Programming example: domain spanning 1 for i = 1..num_cores-1: 2 create a new dispatcher on core i 3 while (num_dispatchers < num_cores-1): 4 wait for the next message and handle it 1 dispatcher_create_callback: 2 num_dispatchers++

  46. Domain spanning: baseline monitor working monitor blocked monitor polling • Centralized: monitor bzero spantest.exe – Poor scalability, but correct name service • 1021 messages, 487 alloc. RPCs memory server 50 million cycles (40ms)

  47. Domain spanning: v2 monitor working monitor blocked monitor polling • Per-core memory servers monitor bzero spantest.exe • Better memset(!) name service memory server Was 50M, now 9M

  48. Domain spanning: v3 monitor working monitor blocked monitor polling • Monitors use per-core mem. server monitor bzero spantest.exe • Move zeroing off the critical path name service memory server Was 9M, now 4M

  49. Domain spanning: v4 monitor working monitor blocked monitor polling • Change the API monitor bzero spantest.exe • Create domains on all cores at once name service memory server • 76 messages Was 4M, now 2.5M

  50. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend