SLIDE 1 Improving the Reliability of Commodity Operating Systems
Mike Swift, Brian Bershad, Hank Levy University of Washington
Slides courtesy of Michael Swift University of Wisconsin-Madison
SLIDE 2 Outline
- Introduction
- Vision
- Design
- Evaluation
- Summary
SLIDE 3 The Problem
- Operating system crashes are a huge problem
today
– 5% of Windows systems crash every day
- Device drivers are the biggest cause of
crashes
– Drivers cause 85% of Windows XP crashes – Drivers are 7 times buggier than the kernel in Linux
- We built Nooks, a system that prevents drivers
from crashing the OS
– We can prevent 99% of faults in our tests that crash native Linux
SLIDE 4
Crashes Today
User Program Kernel Driver User Program
SLIDE 5
Crashes Today
User Program Kernel Driver User Program
SLIDE 6
Crashes Today
User Program Kernel Driver User Program
SLIDE 7 Outline
- Introduction
- Vision
- Design
- Evaluation
- Summary
SLIDE 8
Vision
User Program Kernel Driver User Program
SLIDE 9
Vision
User Program Kernel Driver User Program
SLIDE 10 Reality
– 113 million copies sold in 2002 – 40 million lines of code – $1 billion development cost – 35,000 drivers available
– 18 million users – 30 million lines of code – Equivalent $1 billion development cost
SLIDE 11 Vision Requirements
- 1. Isolation
- 2. Recovery
- 3. Compatibility
- No code changes
- No new languages
- No new OS
- No new hardware
- No new perspective
SLIDE 12 Outline
- Introduction
- Vision
- Design
- Evaluation
- Summary
SLIDE 13 Assumptions and Principles
– Drivers are generally well behaved – Don’t need to prevent every crash to be useful
– Design for fault resistance (not fault tolerance) – Design for mistakes (not abuse)
SLIDE 14 Goal
We want a practical, “best-effort” solution
- Prevents many crashes
- Good performance
- Works with today’s operating systems and
drivers
SLIDE 15 Design of Nooks
- Standard Linux kernel and drivers
- Plus:
– Isolation – Recovery
- Compatible with existing code
SLIDE 16
Existing Kernels
User Program Kernel Driver User Program
SLIDE 17
Isolation - Memory
User Program Kernel Driver User Program Stack Heap
Lightweight Kernel Protection Domains
SLIDE 18
Isolation - Control Transfer
User Program Kernel Driver User Program
SLIDE 19 Isolation - Control Transfer
User Program Kernel Driver User Program
XPC XPC
eXtension Procedure Call
SLIDE 20
Isolation - Data Access
User Program Kernel Driver User Program
SLIDE 21
Isolation - Data Access
User Program Kernel Driver User Program
Copy-in / Copy-out
SLIDE 22
Isolation - Interposition
User Program Kernel Driver User Program
SLIDE 23 Isolation - Interposition
User Program Kernel Driver User Program
XPC XPC
Wrappers
SLIDE 24 Design Summary
– Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Copy-in/Copy-out – Wrappers
SLIDE 25
Recovery - Fault Detection
User Program Kernel Driver User Program Processor Recovery
SLIDE 26
Recovery - Fault Detection
User Program Kernel Driver User Program Recovery
SLIDE 27
Recovery - Fault Detection
User Program Kernel Driver User Program Recovery Detector
SLIDE 28 Recovery
User Program Kernel Driver User Program Recovery
STOP
Stop
SLIDE 29
Recovery
User Program Kernel User Program Recovery
Stop / Unload
SLIDE 30 Recovery
User Program Kernel Driver User Program Recovery
Stop / Unload / Reload
GO
SLIDE 31 Design Summary
– Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Copy-in/Copy-out – Wrappers
– Hardware and software checks – Stop / Unload and GC / Reload
SLIDE 32 Some Limitations
- Blame the processor
- Blame the operating system
- Blame us
SLIDE 33 Outline
– Reliability – Performance – Implementation Cost
SLIDE 34 Tested Drivers
– SoundBlaster 16 (sb) – Ensoniq 1371
– Intel Pro/1000 Gigabit Ethernet (e1000) – AMD PCnet32 10/100 Mb Ethernet (pcnet32) – 3COM 3c90x 10/100 Mb Ethernet – 3Com 3c59x 10/100 Mb Ethernet
– VFAT Windows-compatible filesystem (vfat)
– kHTTPd in-kernel web server (khttpd)
SLIDE 35
Reliability Test Methodology
Test Inject bugs Reboot Load driver Nothing Failure
SLIDE 36
Reliability Test Methodology
Test Inject bugs Reboot Load driver Nothing Failure Recovery
SLIDE 37 Nooks Stops Crashes
50 100 150 200 pcnet32 Extension Number of crashes
No Nooks Nooks 119
SLIDE 38 Nooks Stops Crashes
50 100 150 200 pcnet32 Extension Number of crashes
No Nooks Nooks 119
SLIDE 39 Nooks Stops Crashes
50 100 150 200 pcnet32 e1000 Extension Number of crashes
No Nooks Nooks 119 52
SLIDE 40 Nooks Stops Crashes
50 100 150 200 pcnet32 e1000 Extension Number of crashes
No Nooks Nooks 119 52
SLIDE 41 Nooks Stops Crashes
50 100 150 200 pcnet32 e1000 sb Extension Number of crashes
No Nooks Nooks 119 52 10 1
SLIDE 42 Nooks Stops Crashes
50 100 150 200 pcnet32 e1000 sb kHTTPd VFAT Extension Number of crashes
No Nooks Nooks 119 52 10 1 175 2 10 2
SLIDE 43 Performance
– Performance depends frequency of interaction with kernel
SLIDE 44 0.2 0.4 0.6 0.8 1 Play MP3 Receive Stream Send Stream Apache SpecWeb Compile Local Simple Web Workload
- Perf. Relative to Native Linux
Relative Performance
sb 150 XPC/sec
SLIDE 45 0.2 0.4 0.6 0.8 1 Play MP3 Receive Stream Send Stream Apache SpecWeb Compile Local Simple Web Workload
- Perf. Relative to Native Linux
Relative Performance
sb e1000 e1000 8,923 60,352 150 XPC/sec
SLIDE 46 0.2 0.4 0.6 0.8 1 Play MP3 Receive Stream Send Stream Apache SpecWeb Compile Local Simple Web Workload
- Perf. Relative to Native Linux
Relative Performance
sb e1000 e1000 e1000 8,923 60,352 1,960 150 XPC/sec
SLIDE 47 0.2 0.4 0.6 0.8 1 Play MP3 Receive Stream Send Stream Apace SpecWeb Compile Local Simple Web Workload
- Perf. Relative to Native Linux
Relative Performance
sb e1000 e1000 e1000 VFAT
kHTTPd
61,183 22,653 8,923 60,352 1,960 150 XPC/sec
SLIDE 48 Implementation Cost
– Kernel: 924 out of 1.1 million lines – Device drivers+VFAT: 0 out of 33,000 lines – kHTTPd: 13 out of 2,000 lines
– Nooks reliability layer: 22,266 lines
SLIDE 49 Summary
- Nooks provides a new reliability layer
between drivers and the OS
- Nooks prevents 99% of tested faults
that cause Linux to crash
- Nooks imposes a modest performance
cost
SLIDE 50 Why didn’t we use a microkernel?
- Doesn’t address our limitations
– Isolation not much better – Fault detection not much better – Recovery not much better – Doesn’t improve performance
- Requires more changes to the kernel
- Makes compatibility more difficult
SLIDE 51 Recovery
– Restore driver state so it can process requests as if it had never failed – Conceal failure from applications
– Driver interface specifies how driver responds to requests
- Approach: Model drivers as state machines
SLIDE 52
Drivers as State Machines
send complete
SLIDE 53 Drivers as State Machines
– Advance driver from initial state to state at time of crash – Reply to requests with valid responses according to driver state
close config
SLIDE 54 Shadow Drivers
– Normally:
- Records state-changing inputs
– On failure:
- Restarts driver
- Replays inputs to recover
- Emulates driver to applications/OS
One shadow driver handles recovery for an entire class of drivers
SLIDE 55 write(…) write(…)
Shadow Driver Overview
Kernel Device Driver Tap Shadow Driver
write(…)
SLIDE 56 Preparing for Recovery
Kernel Device Driver Shadow Driver
config(…) config(…) config(…) config …
Tap
SLIDE 57 Tap Device Driver
Recovering a Failed Driver
Kernel Shadow Driver Device Driver Tap
r e g i s t e r ( … ) register(…) i n i t ( … ) c
n e c t c
f i g config …
SLIDE 58 Recovering a Failed Driver
– Reset driver – Reinitialize driver – Replay logged requests
SLIDE 59 Spoofing a Failed Driver
Kernel Device Driver Shadow Driver Tap
write(…) write(…) return return
SLIDE 60 Spoofing a Failed Driver
Shadow acts as driver -- replies to requests with valid possible responses
– Applications and OS unaware that driver failed – No device control
General Strategies:
- 1. Answer request from log
- 2. Act busy
- 3. Block caller
- 4. Queue request
- 5. Drop request
SLIDE 61
Completing Recovery
Kernel Shadow Driver Tap Device Driver Tap Tap
SLIDE 62 Design Summary
– Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Object Table – Wrappers
– Shadow Drivers
SLIDE 63 Outline
- Introduction
- Problem
- Design
- Evaluation
– Implementation – Benefit – Cost
SLIDE 64
Drivers Tested
ide-disk, ide-cd IDE Storage Intel Pro/1000 Gigabit Ethernet, AMD PCnet32, Intel Pro/100 10/100, 3Com 3c59x 10/100, SMC Etherpower 100 Network Soundblaster Audigy, Soundblaster 16, Soundblaster Live!, Intel 810 Audio, Ensoniq 1371, Crystal Sound 4232 Sound Drivers Class
SLIDE 65 Implementation Complexity
– Kernel: 924 out of 1.1 million lines – Device drivers: 0 out of 50,000 lines
SLIDE 66 5,358 13,577 7,381
1 Device Driver L.O.C.
321 198 666
Shadow Driver L.O.C.
29,000 8 264,500 190 118,981 48
All Drivers L.O.C. All Drivers Count
Storage Network Sound
Driver Class
Implementation Complexity
– Isolation: 23,000 lines – Recovery: 3,300 lines
SLIDE 67 5,358 13,577 7,381
1 Device Driver L.O.C.
321 198 666
Shadow Driver L.O.C.
29,000 8 264,500 190 118,981 48
All Drivers L.O.C. All Drivers Count
Storage Network Sound
Driver Class
Implementation Complexity
– Isolation: 23,000 lines – Recovery: 3,300 lines
SLIDE 68 Outline
- Introduction
- Problem
- Design
- Evaluation
– Implementation – Benefit
– Cost
SLIDE 69
Reliability Test Methodology
Test Inject bugs Reboot Load driver Nothing Failure Recovery
SLIDE 70
Recovery Works
Sound Net Storage
SLIDE 71
Relative Performance
Sound Net Storage
SLIDE 72
CPU Usage
Sound Net Storage