Supercomputing Operating Systems: A Naive View from Over the Fence - - PowerPoint PPT Presentation

supercomputing operating systems a naive view from over
SMART_READER_LITE
LIVE PREVIEW

Supercomputing Operating Systems: A Naive View from Over the Fence - - PowerPoint PPT Presentation

Supercomputing Operating Systems: A Naive View from Over the Fence Timothy Roscoe (Mothy) Systems Group, ETH Zurich Disclaimer: I am a stranger in a strange land Thank you for inviting me! Im assuming your field is Supercomputing


slide-1
SLIDE 1

Supercomputing Operating Systems: A Naive View from Over the Fence

Timothy Roscoe (Mothy) Systems Group, ETH Zurich

slide-2
SLIDE 2

Disclaimer: I am a stranger in a strange land

Thank you for inviting me!

  • I’m assuming your field is “Supercomputing”
  • Mine isn’t: I’m a “mainstream” OS researcher

– Expect considerable naïveté on my part

  • This talk is about the possible intersection and

interaction of “Supercomputing” and “OS research”

  • I will exaggerate for effect.

– Please don’t take it the wrong way.

22nd June 2012 ROSS Workshop 2

slide-3
SLIDE 3

Disclaimer: I am a stranger in a strange land

Thank you for inviting me!

  • I’m assuming your field is “Supercomputing”
  • Mine isn’t: I’m a “mainstream” OS researcher

– Expect considerable naïveté on my part

  • This talk is about the possible intersection and

interaction of “Supercomputing” and “OS research”

  • I will exaggerate for effect.

– Please don’t take it the wrong way.

22nd June 2012 ROSS Workshop 3

slide-4
SLIDE 4

Traditionally…

  • Supercomputing people built and

programmed their own machines

– Wrote their own operating systems and/or complained about the existing ones

22nd June 2012 ROSS Workshop 4

slide-5
SLIDE 5

Traditionally…

  • Supercomputing people built and

programmed their own machines

– Wrote their own operating systems and/or complained about the existing ones

  • Mainstream OS people ignored them

– Insignificant market, no real users – Weird, expensive hardware (too many cores)

22nd June 2012 ROSS Workshop 5

slide-6
SLIDE 6

Traditionally…

  • Supercomputing people built and

programmed their own machines

– Wrote their own operating systems and/or complained about the existing ones

  • Mainstream OS people ignored them

– Insignificant market, no real users – Weird, expensive hardware (too many cores)

22nd June 2012 ROSS Workshop 6

This is, of course, changing.

slide-7
SLIDE 7

What’s happening in general-purpose computing?

slide-8
SLIDE 8

Lots more cores per chip

  • Core counts now follow Moore’s Law
  • Cores will come and go

– Energy!

  • Diversity of system and processor

configurations will grow

  • Cache coherence may not

scale to whole machine

22nd June 2012 ROSS Workshop 8

slide-9
SLIDE 9

Parallelism

  • “End of the free lunch”:

cores are not getting faster!

  • Higher performance

 better parallelism

  • New applications

 parallel applications

– Mining – Recognition – Synthesis

22nd June 2012 ROSS Workshop 9

slide-10
SLIDE 10

Cores will be heterogeneous

  • NUMA is the norm today
  • Heterogeneous cores for power reduction
  • Dark silicon, specialized cores
  • Integrated GPUs / Crypto / NPUs etc.
  • Programmable peripherals

22nd June 2012 ROSS Workshop 10

slide-11
SLIDE 11

Communication latency really matters

Example: 8 * quad-core AMD Opteron

PCIe PCIe 2 4 6 1 3 5 7 RAM L3 CPU L1 L2 CPU L1 L2 CPU L1 L2 CPU L1 L2

Access cycles normalized to L1 per-hop cost L1 cache 2 1

  • L2 cache

15 7.5

  • L3 cache

75 37.5

  • Other L1/L2

130 65

  • 1-hop cache

190 95 60 2-hop cache 260 130 70

22nd June 2012 ROSS Workshop 11

slide-12
SLIDE 12

Implications

  • Computers are systems of cores and other devices

which:

– Are connected by highly complex interconnects – Entail significant communication latency between nodes – Consist of heterogeneous cores – Show unpredictable diversity of system configurations – Have dynamic core set membership – Provide only limited shared memory or cache coherence

22nd June 2012 ROSS Workshop 12

slide-13
SLIDE 13

Implications

  • Computers are systems of cores and other devices

which:

– Are connected by highly complex interconnects – Entail significant communication latency between nodes – Consist of heterogeneous cores – Show unpredictable diversity of system configurations – Have dynamic core set membership – Provide only limited shared memory or cache coherence

22nd June 2012 ROSS Workshop 13

The OS model of cooperating processes over a shared-memory multithreaded kernel is dead.

slide-14
SLIDE 14

What’s really new?

  • Actually, multiprocessors are nothing new in

general purpose computing

  • Neither are threads: people have been

building systems with threads for a long time.

– Word, databases, games, servers, browsers, etc.

  • Concurrency is old. We understand it.
  • Parallelism is new.

22nd June 2012 ROSS Workshop 14

slide-15
SLIDE 15

Parallels with Supercomputing

  • Lots of cores
  • Implies parallelism should be used!
  • Message passing predominates
  • Heterogeneous cores (GPUs, CellBE, etc.)
  • Lots of algorithms highly tuned to complex

interconnects, memory hierarchies, etc.

22nd June 2012 ROSS Workshop 15

Surely we can use all the cool ideas in supercomputing for our new OS!

slide-16
SLIDE 16

Barrelfish: our multikernel

  • ETH Zurich + Microsoft Research
  • Open source (MIT Licence)
  • Published 2009
  • Under active development
  • External user community
  • See www.barrelfish.org....

22nd June 2012 ROSS Workshop 16

slide-17
SLIDE 17

Non-original ideas in Barrelfish Techniques we liked

  • Capabilities for resource management (seL4)‏
  • Minimize shared state (Tornado, K42)‏
  • Upcall processor dispatch (Psyche, Sched. Activations)‏
  • Push policy into user space domains (Exokernel,

Nemesis)‏

  • User-space RPC decoupled from IPIs (URPC)‏
  • Lots of information (Infokernel)‏
  • Single-threaded non-preemptive kernel per core (K42)‏
  • Run drivers in their own domains (µkernels, Xen)‏
  • Specify device registers in a little language (Devil)‏

ROSS Workshop 22nd June 2012 17

slide-18
SLIDE 18

What things does it run on?

  • PCs: 32-bit and 64-bit x86 architectures

– Including mixture of the two!

  • Intel SCC
  • Intel MIC platform
  • Various ARM platforms
  • Beehive

– Experimental Microsoft Research softcore

22nd June 2012 ROSS Workshop 18

Seamlessly with x86 host PCs!

slide-19
SLIDE 19

What things run on it?

  • Many microbenchmarks
  • Webserver: http://www.barrelfish.org/
  • Databases: SQLite, PostgreSQL, etc.
  • Virtual machine monitor

– Linux kernel binary

  • Microsoft Office 2010!

– via Drawbridge

  • Parallel benchmarks:

– Parsec, SPLASH-2, NAS

22nd June 2012 ROSS Workshop 19

More on this later…

slide-20
SLIDE 20

Rethinking OS Design #1: the Multikernel Architecture

22nd June 2012 ROSS Workshop 20

slide-21
SLIDE 21

The Multikernel Architecture

  • Computers are systems of cores and other devices which:

– Are connected by highly complex interconnects – Entail significant communication latency between nodes – Consist of heterogeneous cores – Show unpredictable diversity of system configurations – Have dynamic core set membership – Provide only limited shared memory or cache coherence

 Forget about shared memory. The OS is a distributed system based on message passing

22nd June 2012 ROSS Workshop 21

slide-22
SLIDE 22

Multikernel principles

  • Share no data between cores

– All inter-core communication is via explicit messages – Each core can have its own implementation

  • OS state partitioned if possible, replicated if not

– State is accessed as if it were a local replica

  • Invariants enforced by distributed algorithms,

not locks

– Many operations become split-phase and asynchronous

22nd June 2012 ROSS Workshop 22

slide-23
SLIDE 23

The multikernel model

x86_64 CPU X86_64 CPU ARM NIC GPU w/ CPU features

Interconnect(s) OS node state replica OS node state replica OS node state replica OS node state replica

  • Async. msgs

App App Application Application User space: Operating System: Hardware:

Arch-specific code: 22nd June 2012 ROSS Workshop 23

slide-24
SLIDE 24

x86 x86 x86 x86 App

...vs a monolithic OS on multicore

Interconnect kernel Main memory holds global data structures

22nd June 2012 ROSS Workshop 24

slide-25
SLIDE 25

x86 x86 x86 x86 Server

...vs a kernel OS on multicore

Interconnect user mode kernel mode

22nd June 2012 ROSS Workshop 25

App App App App Server Server kernel state state state state state

slide-26
SLIDE 26

Replication vs sharing as the default

  • Replicas used as an optimization in other systems

Traditional OSes

Shared state , One-big-lock Finer-grained locking Clustered objects partitioning

22nd June 2012 ROSS Workshop 26

slide-27
SLIDE 27

Replication vs sharing as the default

  • Replicas used as an optimization in other systems

Traditional OSes

Shared state , One-big-lock Finer-grained locking Clustered objects partitioning Distributed state, Replica maintenance

Multikernel

22nd June 2012 ROSS Workshop 27

slide-28
SLIDE 28

Replication vs sharing as the default

  • Replicas used as an optimization in other systems
  • In a multikernel, sharing is a local optimisation

– Shared (locked) replica on closely-coupled cores – Only when faster, as decided at runtime

  • Basic model remains split-phase messaging

Traditional OSes Multikernel

Shared state , One-big-lock Finer-grained locking Clustered objects partitioning Distributed state, Replica maintenance

22nd June 2012 ROSS Workshop 28

slide-29
SLIDE 29

Rethinking OS Design #2: the System Knowledge Base

22nd June 2012 ROSS Workshop 29

slide-30
SLIDE 30

System knowledge base

  • Computers are systems of cores and other devices which:

– Are connected by highly complex interconnects – Entail significant communication latency between nodes – Consist of heterogeneous cores – Show unpredictable diversity of system configurations – Have dynamic core set membership – Provide only limited shared memory or cache coherence

 Give the OS advanced reasoning techniques to make sense of the hardware and workload at runtime.

22nd June 2012 ROSS Workshop 30

slide-31
SLIDE 31

System knowledge base

  • Fundamental operating system service
  • Knowledge-representation framework

– Database – RDF – Logic Programming and inference – Description Logics – Satisfiability Modulo Theories – Constraint Satisfaction – Optimization

22nd June 2012 ROSS Workshop 31

slide-32
SLIDE 32

What goes in?

  • 1. Resource discovery

– E.g. PCI enumeration, ACPI, CPUID…

  • 2. Online hardware profiling

– Inter-core all-pairs latency, cache measurements…

  • 3. Operating system state

– Locks, process placement, etc.

  • 4. “Things we just know”

– Assertions from data sheets, etc.

22nd June 2012 ROSS Workshop 32

slide-33
SLIDE 33

What is it used for?

  • Name service and registry
  • Locking/coordination service
  • Device management
  • Hardware configuration
  • Spatial scheduling and thread placement
  • Optimization for hardware platform
  • Intra-machine routing

etc.

22nd June 2012 ROSS Workshop 33

slide-34
SLIDE 34

So what happened?

slide-35
SLIDE 35

What happened?

  • Barrelfish achieved some of its goals

– Showed scalability, adaptability, support for heterogeneous machines – More work in the pipeline

  • HPC people contacted us because, apparently,

they wanted a new OS

– We couldn’t understand why.

  • Much of what we borrowed from

supercomputing turned out to be of limited use.

– Why?

22nd June 2012 ROSS Workshop 35

slide-36
SLIDE 36

General-purpose computing  Supercomputing

slide-37
SLIDE 37

The hardware is different.

slide-38
SLIDE 38

These are supercomputers.

22nd June 2012 ROSS Workshop 38

Artistic case design! Plenty of custom hardware!

slide-39
SLIDE 39

Supercomputers don’t just look cool

  • Supercomputers have cool hardware!

– Message passing networks – In-network collection and reduction primitives – Fault-tolerance & partial failure – Vector units – Etc.

22nd June 2012 ROSS Workshop 39

slide-40
SLIDE 40

This is not a supercomputer.

22nd June 2012 ROSS Workshop 40

slide-41
SLIDE 41

This is not a supercomputer.

22nd June 2012 ROSS Workshop 41

This is Facebook.

slide-42
SLIDE 42

Neither is this.

22nd June 2012 ROSS Workshop 42

slide-43
SLIDE 43

Neither is this.

22nd June 2012 ROSS Workshop 43

This is actually a Microsoft 40-foot shipping container

slide-44
SLIDE 44

Not very glamorous case design.

22nd June 2012 ROSS Workshop 44

slide-45
SLIDE 45

These aren’t supercomputers either

22nd June 2012 ROSS Workshop 45

slide-46
SLIDE 46

The software is different

slide-47
SLIDE 47

This is not a supercomputing application.

22nd June 2012 ROSS Workshop 47

slide-48
SLIDE 48

Computationally intensive, highly parallelizable

  • Vision and depth-cam

processing

  • Skeletal body tracking
  • Facial feature and

gesture recognition

  • Audio beamforming
  • Speech and phoneme

recognition

  • 3D mesh construction

22nd June 2012 ROSS Workshop 48

slide-49
SLIDE 49

These are also not supercomputing applications.

  • Facebook
  • Google
  • Bing
  • Second Life
  • World of Warcraft
  • Twitter
  • Youtube
  • etc.

22nd June 2012 ROSS Workshop 49

slide-50
SLIDE 50

General-purpose software is…

  • Parallel (increasingly)

– But complex, dynamic structure!

  • Continuous

– Long-running services

  • Soft real-time

– Bounded response time, interactivity

  • Imprecise

– Sometimes it’s better to be wrong than late

  • Bursty, dynamic, interactive

– No clear execution cycle, load changes unexpectedly

22nd June 2012 ROSS Workshop 50

slide-51
SLIDE 51

Overall workload is different.

slide-52
SLIDE 52

Workload assumptions

  • General purpose OS target:

– Many concurrent tasks – Diverse performance requirements – Unpredictable mix – Goal: satisfy SLAs and then optimize power, throughput, responsiveness, etc.

  • Supercomputing:

– Serial jobs. Complete each one ASAP.

22nd June 2012 ROSS Workshop 52

slide-53
SLIDE 53

Example: how long should a thread spin?

  • Operating Systems answer:
  • 1. It depends

(on the workload)

  • 2. The time taken to context switch

(If you know nothing about the workload)

  • HPC Answer:

– As long as it takes for something to happen. – Intel OpenMP default spinwait time: 200ms

22nd June 2012 ROSS Workshop 53

slide-54
SLIDE 54

Example: how long should a thread spin?

  • Operating Systems answer:
  • 1. It depends

(on the workload)

  • 2. The time taken to context switch

(If you know nothing about the workload)

  • HPC Answer:

– As long as it takes for something to happen. – Intel OpenMP default spinwait time: 200ms

22nd June 2012 ROSS Workshop 54

slide-55
SLIDE 55

Example: how long should a thread spin?

  • Operating Systems answer:
  • 1. It depends

(on the workload)

  • 2. The time taken to context switch

(If you know nothing about the workload)

  • HPC Answer:

– As long as it takes for something to happen. – Intel OpenMP default spinwait time: 200ms

22nd June 2012 ROSS Workshop 55

slide-56
SLIDE 56

Example: how long should a thread spin?

  • Operating Systems answer:
  • 1. It depends

(on the workload)

  • 2. The time taken to context switch

(If you know nothing about the workload)

  • HPC Answer:

– As long as it takes for something to happen. – Intel OpenMP default spinwait time: 200ms

22nd June 2012 ROSS Workshop 56

600,000,000 cycles @ 3GHz!

slide-57
SLIDE 57

Consequences

slide-58
SLIDE 58
  • 1. Hardware optimization

techniques not directly applicable

  • Good performance  careful use of hardware

– Caches and memory hierarchy – Microarchitecture dependencies – Interconnect topology

  • But:

– Current hardware changes faster than software can – Commodity hardware already massively diverse – Dynamic sharing changes the problem

22nd June 2012 ROSS Workshop 58

slide-59
SLIDE 59
  • 1. Hardware optimization

techniques not directly applicable

  • Good performance  careful use of hardware

– Caches and memory hierarchy – Microarchitecture dependencies – Interconnect topology

  • But:

– Current hardware changes faster than software can – Commodity hardware already massively diverse – Dynamic sharing changes the problem

22nd June 2012 ROSS Workshop 59

 Cannot tune OS or any other program to hardware at design time

slide-60
SLIDE 60
  • 1. Hardware optimization

techniques not directly applicable

  • Techniques can be used (and already are), but:

– Can’t be baked into the software – Have to adapt dynamically to current hardware

  • We use SKB to optimize spatial placement, cache

awareness, etc.

– Must interact with the OS scheduler

  • Use Scheduler Activations, SKB state, user-level

threads, etc.

  • Much ongoing research!

22nd June 2012 ROSS Workshop 60

slide-61
SLIDE 61
  • 2. Benchmarks of limited use
  • PARSEC-2, etc. are highly stylized

– For good reason: highlight a range of execution patterns – Focus on performance of “simple” codes – Very little I/O

  • Don’t stress OS (or even runtime)
  • A general-purpose job mix would have:

– Concurrent programs w/ diverse requirements – Multiple parallel tasks within a program – Copious I/O and asynchronicity

22nd June 2012 ROSS Workshop 61

slide-62
SLIDE 62
  • 2. Benchmarks of limited use
  • Still may be useful for

– Characterizing some execution patterns – As synthetic load generators – Building blocks for larger workloads?

  • Open question: how to benchmark

general-purpose system software?

– C.f. Avatar Kinect, etc.

22nd June 2012 ROSS Workshop 62

slide-63
SLIDE 63
  • 3. Co-scheduling doesn’t work

(yet)

  • Almost nothing benefits from gang scheduling

– Competitive spinning  backfilling makes more efficient use of the machine – If one app needs it  schedule with priority – More than one app  spatially partition

  • r greedily schedule as best-effort

– Only of benefit when compute phase ≈ context switch time

  • Impact for turnaround time on one job is

negligible.

22nd June 2012 ROSS Workshop 63

slide-64
SLIDE 64
  • 3. Co-scheduling doesn’t work

(yet)

  • Some kind of coordinated scheduling might be

useful:

– Multiple, parallel database joins – SMP virtual machines

  • Needs to understand:

– I/O operations – IPC – Etc.

22nd June 2012 ROSS Workshop 64

slide-65
SLIDE 65

HPC folks were worried about OS “noise”

  • Two problems:
  • 1. Message latency
  • 2. CPU “jitter”
  • Message latency:

– Custom MP hardware is rarely user-safe – Map device into user space (VIA, etc.) – More recent tricks: abuse SR-IOV!

22nd June 2012 ROSS Workshop 65

slide-66
SLIDE 66

CPU jitter

  • CPU jitter is a spatial scheduling

non-problem

– At least in the OS research community – If you perform I/O, it’s game over anyway – If you don’t, your problem is caches and interrupts

  • So, if you really want performance isolation:

– Steer all your interrupts to different cores – Place applications to avoid cache crosstalk

22nd June 2012 ROSS Workshop 66

slide-67
SLIDE 67
  • Q. Why does no general-

purpose OS do this?

  • A. Nobody cares.

– Plenty of tasks that you to run anyway – Applications aren’t sensitive to jitter – Most spend lots of time in the kernel

  • However, Barrelfish can isolate applications…

– Potentially useful for future applications – Investigate when Torsten Höfler arrives at ETHZ!

22nd June 2012 ROSS Workshop 67

slide-68
SLIDE 68
  • Q. Why does no general-

purpose OS do this?

  • A. Nobody cares.

– Plenty of tasks that you to run anyway – Applications aren’t sensitive to jitter – Most spend lots of time in the kernel

  • However, Barrelfish can isolate applications…

– Potentially useful for future applications – Investigate when Torsten Höfler arrives at ETHZ!

22nd June 2012 ROSS Workshop 68

slide-69
SLIDE 69
  • Q. Why does no general-

purpose OS do this?

  • A. Nobody cares.

– Plenty of tasks that you to run anyway – Applications aren’t sensitive to jitter – Most spend lots of time in the kernel

  • However, Barrelfish can isolate applications…

– Potentially useful for future applications – Investigate when Torsten Höfler arrives at ETHZ!

22nd June 2012 ROSS Workshop 69

slide-70
SLIDE 70
  • Q. Why does no general-

purpose OS do this?

  • A. Nobody cares.

– Plenty of tasks that you to run anyway – Applications aren’t sensitive to jitter – Most spend lots of time in the kernel

  • However, Barrelfish can isolate applications…

– Potentially useful for future applications – Investigate when Torsten Höfler arrives at ETHZ!

22nd June 2012 ROSS Workshop 70

slide-71
SLIDE 71
  • 4. Messaging hardware

isn’t useful (yet)

  • HPC-inspired proposals appearing for commodity

hardware

– E.g. Intel SCC message buffers

  • Tailored to a single user

– Can’t be multiplexed efficiently – Requires kernel mediation for protection  prohibitively expensive to use

  • Tailored to a single application

– Small, bounded buffers  expensive flow control – Hard to context switch

22nd June 2012 ROSS Workshop 71

slide-72
SLIDE 72
  • 4. Messaging hardware

isn’t useful (yet)

  • Design of useful hardware support for general-

purpose messages is an open research area

– User-level multiplexing – Decoupling notification from delivery – Flow control and congestion avoidance – API design

  • Many ideas from MPI, Blue Gene, etc. are

highly relevant

– But they require considerable changes!

22nd June 2012 ROSS Workshop 72

slide-73
SLIDE 73

Conclusion

  • Supercomputing and OS research:

Traditionally disjoint areas

– Things are changing in both areas – Each side has ideas useful to the other

  • Problems and assumptions remain very

different

– Cross-fertilization of fields is difficult (but interesting!)

22nd June 2012 ROSS Workshop 73

slide-74
SLIDE 74

Open questions

  • What ideas from supercomputing might be

important to the design of general-purpose

  • perating systems?
  • Are there concepts and challenges from

general-purpose operating systems which are becoming a concern in supercomputing?

22nd June 2012 ROSS Workshop 74