Improving the Reliability of Commodity Operating Systems Mike - - PowerPoint PPT Presentation

improving the reliability of commodity operating systems
SMART_READER_LITE
LIVE PREVIEW

Improving the Reliability of Commodity Operating Systems Mike - - PowerPoint PPT Presentation

Improving the Reliability of Commodity Operating Systems Mike Swift, Brian Bershad, Hank Levy University of Washington Slides courtesy of Michael Swift University of Wisconsin-Madison Outline Introduction Vision Design


slide-1
SLIDE 1

Improving the Reliability of Commodity Operating Systems

Mike Swift, Brian Bershad, Hank Levy University of Washington

Slides courtesy of Michael Swift University of Wisconsin-Madison

slide-2
SLIDE 2

Outline

  • Introduction
  • Vision
  • Design
  • Evaluation
  • Summary
slide-3
SLIDE 3

The Problem

  • Operating system crashes are a huge problem

today

– 5% of Windows systems crash every day

  • Device drivers are the biggest cause of

crashes

– Drivers cause 85% of Windows XP crashes – Drivers are 7 times buggier than the kernel in Linux

  • We built Nooks, a system that prevents drivers

from crashing the OS

– We can prevent 99% of faults in our tests that crash native Linux

slide-4
SLIDE 4

Crashes Today

User Program Kernel Driver User Program

slide-5
SLIDE 5

Crashes Today

User Program Kernel Driver User Program

slide-6
SLIDE 6

Crashes Today

User Program Kernel Driver User Program

slide-7
SLIDE 7

Outline

  • Introduction
  • Vision
  • Design
  • Evaluation
  • Summary
slide-8
SLIDE 8

Vision

User Program Kernel Driver User Program

slide-9
SLIDE 9

Vision

User Program Kernel Driver User Program

slide-10
SLIDE 10

Reality

  • Windows XP

– 113 million copies sold in 2002 – 40 million lines of code – $1 billion development cost – 35,000 drivers available

  • Linux:

– 18 million users – 30 million lines of code – Equivalent $1 billion development cost

slide-11
SLIDE 11

Vision Requirements

  • 1. Isolation
  • 2. Recovery
  • 3. Compatibility
  • No code changes
  • No new languages
  • No new OS
  • No new hardware
  • No new perspective
slide-12
SLIDE 12

Outline

  • Introduction
  • Vision
  • Design
  • Evaluation
  • Summary
slide-13
SLIDE 13

Assumptions and Principles

  • Assumptions:

– Drivers are generally well behaved – Don’t need to prevent every crash to be useful

  • Principles:

– Design for fault resistance (not fault tolerance) – Design for mistakes (not abuse)

slide-14
SLIDE 14

Goal

We want a practical, “best-effort” solution

  • Prevents many crashes
  • Good performance
  • Works with today’s operating systems and

drivers

slide-15
SLIDE 15

Design of Nooks

  • Standard Linux kernel and drivers
  • Plus:

– Isolation – Recovery

  • Compatible with existing code
slide-16
SLIDE 16

Existing Kernels

User Program Kernel Driver User Program

slide-17
SLIDE 17

Isolation - Memory

User Program Kernel Driver User Program Stack Heap

Lightweight Kernel Protection Domains

slide-18
SLIDE 18

Isolation - Control Transfer

User Program Kernel Driver User Program

slide-19
SLIDE 19

Isolation - Control Transfer

User Program Kernel Driver User Program

XPC XPC

eXtension Procedure Call

slide-20
SLIDE 20

Isolation - Data Access

User Program Kernel Driver User Program

slide-21
SLIDE 21

Isolation - Data Access

User Program Kernel Driver User Program

Copy-in / Copy-out

slide-22
SLIDE 22

Isolation - Interposition

User Program Kernel Driver User Program

slide-23
SLIDE 23

Isolation - Interposition

User Program Kernel Driver User Program

XPC XPC

Wrappers

slide-24
SLIDE 24

Design Summary

  • Isolation

– Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Copy-in/Copy-out – Wrappers

slide-25
SLIDE 25

Recovery - Fault Detection

User Program Kernel Driver User Program Processor Recovery

slide-26
SLIDE 26

Recovery - Fault Detection

User Program Kernel Driver User Program Recovery

slide-27
SLIDE 27

Recovery - Fault Detection

User Program Kernel Driver User Program Recovery Detector

slide-28
SLIDE 28

Recovery

User Program Kernel Driver User Program Recovery

STOP

Stop

slide-29
SLIDE 29

Recovery

User Program Kernel User Program Recovery

Stop / Unload

slide-30
SLIDE 30

Recovery

User Program Kernel Driver User Program Recovery

Stop / Unload / Reload

GO

slide-31
SLIDE 31

Design Summary

  • Isolation

– Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Copy-in/Copy-out – Wrappers

  • Recovery

– Hardware and software checks – Stop / Unload and GC / Reload

slide-32
SLIDE 32

Some Limitations

  • Blame the processor
  • Blame the operating system
  • Blame us
slide-33
SLIDE 33

Outline

  • Vision
  • Design
  • Evaluation

– Reliability – Performance – Implementation Cost

  • Summary
slide-34
SLIDE 34

Tested Drivers

  • Sound card drivers

– SoundBlaster 16 (sb) – Ensoniq 1371

  • Network drivers

– Intel Pro/1000 Gigabit Ethernet (e1000) – AMD PCnet32 10/100 Mb Ethernet (pcnet32) – 3COM 3c90x 10/100 Mb Ethernet – 3Com 3c59x 10/100 Mb Ethernet

  • Filesystems

– VFAT Windows-compatible filesystem (vfat)

  • Other

– kHTTPd in-kernel web server (khttpd)

slide-35
SLIDE 35

Reliability Test Methodology

Test Inject bugs Reboot Load driver Nothing Failure

slide-36
SLIDE 36

Reliability Test Methodology

Test Inject bugs Reboot Load driver Nothing Failure Recovery

slide-37
SLIDE 37

Nooks Stops Crashes

50 100 150 200 pcnet32 Extension Number of crashes

No Nooks Nooks 119

slide-38
SLIDE 38

Nooks Stops Crashes

50 100 150 200 pcnet32 Extension Number of crashes

No Nooks Nooks 119

slide-39
SLIDE 39

Nooks Stops Crashes

50 100 150 200 pcnet32 e1000 Extension Number of crashes

No Nooks Nooks 119 52

slide-40
SLIDE 40

Nooks Stops Crashes

50 100 150 200 pcnet32 e1000 Extension Number of crashes

No Nooks Nooks 119 52

slide-41
SLIDE 41

Nooks Stops Crashes

50 100 150 200 pcnet32 e1000 sb Extension Number of crashes

No Nooks Nooks 119 52 10 1

slide-42
SLIDE 42

Nooks Stops Crashes

50 100 150 200 pcnet32 e1000 sb kHTTPd VFAT Extension Number of crashes

No Nooks Nooks 119 52 10 1 175 2 10 2

slide-43
SLIDE 43

Performance

  • Dominant cost is XPC

– Performance depends frequency of interaction with kernel

slide-44
SLIDE 44

0.2 0.4 0.6 0.8 1 Play MP3 Receive Stream Send Stream Apache SpecWeb Compile Local Simple Web Workload

  • Perf. Relative to Native Linux

Relative Performance

sb 150 XPC/sec

slide-45
SLIDE 45

0.2 0.4 0.6 0.8 1 Play MP3 Receive Stream Send Stream Apache SpecWeb Compile Local Simple Web Workload

  • Perf. Relative to Native Linux

Relative Performance

sb e1000 e1000 8,923 60,352 150 XPC/sec

slide-46
SLIDE 46

0.2 0.4 0.6 0.8 1 Play MP3 Receive Stream Send Stream Apache SpecWeb Compile Local Simple Web Workload

  • Perf. Relative to Native Linux

Relative Performance

sb e1000 e1000 e1000 8,923 60,352 1,960 150 XPC/sec

slide-47
SLIDE 47

0.2 0.4 0.6 0.8 1 Play MP3 Receive Stream Send Stream Apace SpecWeb Compile Local Simple Web Workload

  • Perf. Relative to Native Linux

Relative Performance

sb e1000 e1000 e1000 VFAT

kHTTPd

61,183 22,653 8,923 60,352 1,960 150 XPC/sec

slide-48
SLIDE 48

Implementation Cost

  • Changes to old code

– Kernel: 924 out of 1.1 million lines – Device drivers+VFAT: 0 out of 33,000 lines – kHTTPd: 13 out of 2,000 lines

  • New code

– Nooks reliability layer: 22,266 lines

slide-49
SLIDE 49

Summary

  • Nooks provides a new reliability layer

between drivers and the OS

  • Nooks prevents 99% of tested faults

that cause Linux to crash

  • Nooks imposes a modest performance

cost

slide-50
SLIDE 50

Why didn’t we use a microkernel?

  • Doesn’t address our limitations

– Isolation not much better – Fault detection not much better – Recovery not much better – Doesn’t improve performance

  • Requires more changes to the kernel
  • Makes compatibility more difficult
slide-51
SLIDE 51

Recovery

  • Goals:

– Restore driver state so it can process requests as if it had never failed – Conceal failure from applications

  • Observation:

– Driver interface specifies how driver responds to requests

  • Approach: Model drivers as state machines
slide-52
SLIDE 52

Drivers as State Machines

send complete

slide-53
SLIDE 53

Drivers as State Machines

  • Recovery:

– Advance driver from initial state to state at time of crash – Reply to requests with valid responses according to driver state

  • pen

close config

slide-54
SLIDE 54

Shadow Drivers

  • Generic code that:

– Normally:

  • Records state-changing inputs

– On failure:

  • Restarts driver
  • Replays inputs to recover
  • Emulates driver to applications/OS

One shadow driver handles recovery for an entire class of drivers

slide-55
SLIDE 55

write(…) write(…)

Shadow Driver Overview

Kernel Device Driver Tap Shadow Driver

write(…)

slide-56
SLIDE 56

Preparing for Recovery

Kernel Device Driver Shadow Driver

config(…) config(…) config(…) config …

Tap

slide-57
SLIDE 57

Tap Device Driver

Recovering a Failed Driver

Kernel Shadow Driver Device Driver Tap

r e g i s t e r ( … ) register(…) i n i t ( … ) c

  • n

n e c t c

  • n

f i g config …

slide-58
SLIDE 58

Recovering a Failed Driver

  • Summary:

– Reset driver – Reinitialize driver – Replay logged requests

slide-59
SLIDE 59

Spoofing a Failed Driver

Kernel Device Driver Shadow Driver Tap

write(…) write(…) return return

slide-60
SLIDE 60

Spoofing a Failed Driver

Shadow acts as driver -- replies to requests with valid possible responses

– Applications and OS unaware that driver failed – No device control

General Strategies:

  • 1. Answer request from log
  • 2. Act busy
  • 3. Block caller
  • 4. Queue request
  • 5. Drop request
slide-61
SLIDE 61

Completing Recovery

Kernel Shadow Driver Tap Device Driver Tap Tap

slide-62
SLIDE 62

Design Summary

  • Isolation

– Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Object Table – Wrappers

  • Recovery

– Shadow Drivers

slide-63
SLIDE 63

Outline

  • Introduction
  • Problem
  • Design
  • Evaluation

– Implementation – Benefit – Cost

  • Summary and Future Work
slide-64
SLIDE 64

Drivers Tested

ide-disk, ide-cd IDE Storage Intel Pro/1000 Gigabit Ethernet, AMD PCnet32, Intel Pro/100 10/100, 3Com 3c59x 10/100, SMC Etherpower 100 Network Soundblaster Audigy, Soundblaster 16, Soundblaster Live!, Intel 810 Audio, Ensoniq 1371, Crystal Sound 4232 Sound Drivers Class

slide-65
SLIDE 65

Implementation Complexity

  • Changes to existing code

– Kernel: 924 out of 1.1 million lines – Device drivers: 0 out of 50,000 lines

slide-66
SLIDE 66

5,358 13,577 7,381

1 Device Driver L.O.C.

321 198 666

Shadow Driver L.O.C.

29,000 8 264,500 190 118,981 48

All Drivers L.O.C. All Drivers Count

Storage Network Sound

Driver Class

Implementation Complexity

  • New code

– Isolation: 23,000 lines – Recovery: 3,300 lines

slide-67
SLIDE 67

5,358 13,577 7,381

1 Device Driver L.O.C.

321 198 666

Shadow Driver L.O.C.

29,000 8 264,500 190 118,981 48

All Drivers L.O.C. All Drivers Count

Storage Network Sound

Driver Class

Implementation Complexity

  • New code

– Isolation: 23,000 lines – Recovery: 3,300 lines

slide-68
SLIDE 68

Outline

  • Introduction
  • Problem
  • Design
  • Evaluation

– Implementation – Benefit

  • Isolation
  • Recovery

– Cost

  • Summary and Future Work
slide-69
SLIDE 69

Reliability Test Methodology

Test Inject bugs Reboot Load driver Nothing Failure Recovery

slide-70
SLIDE 70

Recovery Works

Sound Net Storage

slide-71
SLIDE 71

Relative Performance

Sound Net Storage

slide-72
SLIDE 72

CPU Usage

Sound Net Storage