A State Spill-free Theseus: Operating System Kevin Boos Lin Zhong - - PowerPoint PPT Presentation

a state spill free theseus operating system
SMART_READER_LITE
LIVE PREVIEW

A State Spill-free Theseus: Operating System Kevin Boos Lin Zhong - - PowerPoint PPT Presentation

A State Spill-free Theseus: Operating System Kevin Boos Lin Zhong Rice Efficient Computing Group Rice University PLOS 2017 October 28 Problems with Todays OSes Modern single-node OSes are vast and very complex Results in entangled


slide-1
SLIDE 1

PLOS 2017 October 28

Kevin Boos Lin Zhong Rice Efficient Computing Group Rice University

Theseus: A State Spill-free Operating System

slide-2
SLIDE 2

Problems with Today’s OSes

  • Modern single-node OSes

are vast and very complex

  • Results in entangled web 

  • f components
  • Nigh impossible to decouple
  • Difficult to maintain, 


evolve, update safely, 
 and run reliably

2

slide-3
SLIDE 3

Easy Evolution is Crucial

  • Computer hardware must

endure longer upgrade cycles[1]

  • Exacerbated by the (economic)

decline of Moore’s Law

  • Evolutionary advancements 

  • ccur mostly in software
  • Extreme example: DARPA’s

challenge for systems to 
 “remain robust and functional 
 in excess of 100 years”[2]

3

[2] DARPA seeks to create software systems that could last 100 years. https://www.darpa.mil/news-events/2015-04-08. [1] The PC upgrade cycle slows to every ve to six years, Intel’s CEO says. PCWorld article.

slide-4
SLIDE 4

What do we need?

Easier evolution by reducing complexity without reducing size (and features). We need a disentangled OS that:

  • allows every component to 


evolve independently at runtime

  • prevents failures in one component 


from jeopardizing others

4

slide-5
SLIDE 5

– the astute audience member

“But surely existing systems have solved this already?”

5

slide-6
SLIDE 6

Existing attempts to decouple systems

  • 1. Traditional modularization
  • 2. Encapsulation-based
  • 3. Privilege-level separation
  • 4. Hardware-driven

6

slide-7
SLIDE 7

Existing attempts to decouple systems

  • 1. Traditional modularization
  • 2. Encapsulation-based
  • 3. Privilege-level separation
  • 4. Hardware-driven

7

  • Decompose large monolithic

system into smaller entities of related functionality 
 (separation of concerns) Achieves some goals: code reuse

  • Often causes tight coupling, 


which inhibits other goals

  • Evidence: Linux is highly modular,

but requires substantial effort to realize live update [3, 4, 5, 6] 
 and fault isolation [7, 8, 9]

[3] J. Arnold and M. F. Kaashoek. Ksplice: Automatic rebootless kernel updates. EuroSys, 2009. [4] M. Siniavine and A. Goel. Seamless kernel updates. DSN) 2013. [5] G. Altekar, I. Bagrak, P. Burstein, and A. Schultz. OPUS: Online patches and updates for security. USENIX Security, 2005. [6] K. Makris and K. D. Ryu. Dynamic and adaptive updates of non-quiescent subsystems in commodity OS. EuroSys, 2007.

[7] M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. OSDI, 2004. [8] M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating systems. SOSP, 2003. [9] C. Jacobsen, et al. Lightweight capability domains: Towards decomposing the Linux kernel. SIGOPS Oper. Syst. Rev., 2016. 


slide-8
SLIDE 8

Existing attempts to decouple systems

  • 1. Traditional modularization
  • 2. Encapsulation-based
  • 3. Privilege-level separation
  • 4. Hardware-driven

8

  • Group related code and data

together into a single entity

  • Strict boundaries between entities,

e.g., classes in OOP Achieves better maintainability
 and adaptability

  • Similar problems as traditional

modularization, i.e., inextricably coupled entities that are difficult to interchange [10, 11]

[10] C. A. Soules, et al.. System support for online reconfiguration. Usenix ATC, 2003. [11] F. M. David, E. M. Chan, J. C. Carlyle, and R. H. Campbell. CuriOS: Improving reliability through operating system structure. OSDI, 2008.

slide-9
SLIDE 9

Existing attempts to decouple systems

  • 1. Traditional modularization
  • 2. Encapsulation-based
  • 3. Privilege-level separation
  • 4. Hardware-driven

9

  • Aims to decouple entities by forcing

them into separate domains with boundaries based on privilege levels

  • Microkernels, virtual machines

Achieves fault isolation

  • Coarse spatial granularity [12]
  • Evolution remains difficult because

microkernel userspace servers 
 must still closely collaborate [13]

[12] J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanenbaum. MINIX 3: A highly reliable, self-repairing operating system. ACM OS Review, 2006. [13] C. Giuffrida, A. Kuijsten, and A. S. Tanenbaum. Safe and automatic live update for operating systems. ASPLOS, 2013.

slide-10
SLIDE 10

Existing attempts to decouple systems

  • 1. Traditional modularization
  • 2. Encapsulation-based
  • 3. Privilege-level separation
  • 4. Hardware-driven

10

  • Choose entity bounds based on the

underlying hardware architecture
 (cores, coherence domains)

  • Barrelfish [14], Helios [15], fos [16],

K2 [17] Achieves scalable and 
 energy-efficient performance

  • Does not facilitate evolution, 


runtime flexibility, or fault isolation

[14] A. Baumann, et al. The multikernel: A new os architecture for scalable multicore systems. SOSP, 2009. [15] E. B. Nightingale, et al. Helios: Heterogeneous multiprocessing with satellite kernels. SOSP, 2009. [16] D. Wentzlaff, et al. An operating system for multicore and clouds. SoCC, 2010. [17] Felix Lin, et al. K2: a mobile OS for heterogeneous coherence domains. ASPLOS, 2014

slide-11
SLIDE 11

Key Insight: state spill is the root cause

  • f entanglement within OSes


  • verlooked by existing


decoupling strategies

slide-12
SLIDE 12

What is state spill?

  • When one software entity’s state undergoes a lasting change

as a result of handling an interaction with another entity.

12

  • Prevalent and deeply ingrained 


in modern system software

  • Causes entanglement
  • Individual entities cannot be


easily interchanged

  • Multiple entities share fate
  • Hinders other goals as well

[EuroSys’17]

Kevin Boos, et al. A Characterization of State Spill in Modern Operating Systems. EuroSys, 2017.

slide-13
SLIDE 13

– the skeptical audience member

“Why is state spill a useful concept?”

13

slide-14
SLIDE 14

Modern OSes under the light of state spill

14

task mgmt

nano-core

kern el cons
  • le
input event mux key boar d indir ecti
  • n
laye r s c h e d u l e r CFQ policy FCFS policy RR policy mouse indirection layer VGA indirection layer graphics mux filesystem PIC IRQ PIT clock IRQ event dispatcher syscall dispatcher sysc all indir ecti
  • n
laye r heap allocator frame allocator stack allocator

userspace processes

PIC IRQ s c h e d u l e r filesystem PIC IRQ PIC IRQ filesyst em fil e s y st e m

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

module submodule entanglement via state spill

s c h e d u l e r filesyst em s c h e d ul er
  • Web of interacting modules spill state into each other
  • Larger modules contain submodules that cannot be managed independently
task mgmt

nano-core

kern el cons
  • le
input event mux key boar d indir ecti
  • n
laye r s c h e d u l e r CFQ policy FCFS policy RR policy mouse indirection layer VGA indirection layer graphics mux filesystem PIC IRQ PIT clock IRQ event dispatcher syscall dispatcher sysc all indir ecti
  • n
laye r heap allocator frame allocator stack allocator

userspace processes

PIC IRQ s c h e d u l e r filesystem PIC IRQ PIC IRQ filesyst em fil e s y st e m

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

module submodule entanglement via state spill

s c h e d u l e r filesyst em s c h e d ul er
slide-15
SLIDE 15

Theseus: a Disentangled OS

15

task mgmt

nano-core

kern el cons
  • le
input event mux key boar d indir ecti
  • n
laye r s c h e d u l e r CFQ policy FCFS policy RR policy mouse indirection layer VGA indirection layer graphics mux filesystem PIC IRQ PIT clock IRQ event dispatcher syscall dispatcher sysc all indir ecti
  • n
laye r heap allocator frame allocator stack allocator

userspace processes

PIC IRQ s c h e d u l e r filesystem PIC IRQ PIC IRQ filesyst em fil e s y st e m

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

module submodule entanglement via state spill

s c h e d u l e r filesyst em s c h e d ul er task mgmt

nano-core

kern el cons
  • le
input event mux key boar d indir ecti
  • n
laye r s c h e d u l e r CFQ policy FCFS policy RR policy mouse indirection layer VGA indirection layer graphics mux filesystem PIC IRQ PIT clock IRQ event dispatcher syscall dispatcher sysc all indir ecti
  • n
laye r heap allocator frame allocator stack allocator

userspace processes

PIC IRQ s c h e d u l e r filesystem PIC IRQ PIC IRQ filesyst em fil e s y st e m

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

module submodule entanglement via state spill

s c h e d u l e r filesyst em s c h e d ul er

Implemented
 from scratch using Rust Runtime 
 composable Inspired by distributed computing

slide-16
SLIDE 16

Our Namesake: Ship of Theseus

The ship wherein Theseus and the youth

  • f Athens returned from Crete had thirty
  • ars, and was preserved by the Athenians

down even to the time of Demetrius Phalereus, for they took away the old planks as they decayed, putting in new and stronger timber in their places, in so much that this ship became a standing example among the philosophers, for the logical question of things that grow;

  • ne side holding that the ship remained

the same, and the other contending that it was not the same.
 — Plutarch (Theseus)

16

TODO: picture of ship

slide-17
SLIDE 17

Theseus Directives

17

Primary Directive

no state spill

Secondary Directive

elementary modules

eliminate state spill above all else, e.g., performance, ease of programming no submodules; modules as small as possible

slide-18
SLIDE 18

Design Principles

slide-19
SLIDE 19

Design Principles

Decoupling entities based on state spill (pairwise)

  • 1. No Encapsulation
  • 2. Stateless Communication


 Composing a Disentangled OS (multi-entity)

  • 3. Universal, connectionless interfaces
  • 4. Generic pattern reuse

19

slide-20
SLIDE 20

P1: No Encapsulation

  • Encapsulation: code & data


are bundled together

  • Direct cause of state spill

20

Standard Encapsulation Opaque Exportation + Stateless Communication

Client state Server state

config(c) func1() func2(r1) c c, s1 c, s1, s2 r1 r1, r2 void r1 r2 a’ c r2

Client state Server state

r1 r1, r2 c

c, s1

c, s1, s2

c c, s1 config(c) func1() func2(r1) void r1 r2 c, s1 c c

c, s1

c, s1, s2 c, s1, s2

start end

  • Instead, eschew encapsulation
  • Client should maintain its own


progress with the server

  • Must preserve information hiding
slide-21
SLIDE 21

P2: Stateless Communication

  • An interaction from client to server must be self-sufficient,

containing everything that the server requires to handle it

  • No assumption of prior interactions or state

  • Implications:
  • Server entity is effectively stateless
  • No hidden global dependencies

21

fn0( ) fn1( ) fn2() void a’ c r2

( )

stateful

slide-22
SLIDE 22

Select Design and Implementation Decisions

slide-23
SLIDE 23

State Management via Opaque Exportation

  • P1 + P2 –> opaque exportation
  • Server returns sealed representation
  • f progress to client
  • Preserves information hiding
  • Client cannot inspect/modify state
  • Allows stateless communication
  • Handled transparently by compiler
  • Leverage Rust’s affine type system 


to avoid high overhead

23

Standard Encapsulation Opaque Exportation + Stateless Communication

Client state Server state

config(c) func1() func2(r1) c c, s1 c, s1, s2 r1 r1, r2 void r1 r2 a’ c r2

Client state Server state

r1 r1, r2 c

c, s1

c, s1, s2

c c, s1 config(c) func1() func2(r1) void r1 r2 c, s1 c c

c, s1

c, s1, s2 c, s1, s2

start end

slide-24
SLIDE 24

– the slightly persnickety audience member

“But can you always eliminate state spill?”

24

slide-25
SLIDE 25

… some states must exist somewhere

  • State spill is not always avoidable, especially in low-level

OS code that interfaces directly with hardware

  • e.g., frame allocator, interrupt handler

25

Server

Directive 1: no state spill (above all else) Directive 2: elementary modules

Flat Module Architecture

  • Submodules contribute to complex entanglement

○ Extract submodules into first-order modules

  • Simplifies module logic → nano_core manages all
  • Permits communication and compositional hierarchy

Theseus: a State Spill-free Operating System

Kevin Boos and Lin Zhong

OSes are complex and entangled

  • Existing OSes are a web of entangled entities

○ Cannot treat entities independently

  • OS components should be easily interchangeable

at runtime, for fluid system evolution ○ Goal: Runtime Composability

  • Prior decoupling strategies are insufficient;

entanglement remains between system entities ○ Modularization ○ Encapsulation (OOP) ○ Privilege-level separation (μkernel) ○ Hardware-driven (Barrelfish)

Theseus design principles

  • 1. No traditional encapsulation

○ Client A should maintain the state representing its progress with server B, instead of B ○ Preserve information hiding: A cannot inspect

  • r modify state from B
  • 2. Stateless interactions

○ An interaction from A → B must include everything B needs to handle it ○ Implication: B can be practically stateless

  • 3. Universal, connectionless communication

○ All entities are accessible in a uniform way ○ Do not assume ongoing existence of interfaces

  • 4. Re-use of generic, spill-free patterns

○ Implement common OS design patterns once in a spill-free way, then re-use across system

Design & implementation decisions State Spill is the root cause

  • Scenario: source entity A (“client” role)

communicates with destination B (“server role).

  • State spill occurs when B’s state undergoes a

lasting change after handling an interaction from A.

Main goals of Theseus

Caller entity A (“client”) Callee entity B (“server”)

Initial state pre-interaction Changed state mid-interaction Lasting changes post-interaction

state spill

Encapsulation causes state spill

Standard Encapsulation Opaque Exportation + Stateless Communication

Client state Server state

config(c) fn1() fn2(r1) c c, s1 c, s1, s2 r1 r1, r2 void r1 r2

Client state Server state

r1 r1, r2 c

c, s1

c, s1, s2

c c, s1 config(c) fn1() fn2(r1)

void

r1 r2

c, s1 c c

c, s1

c, s1, s2

c, s1, s2

task mgmt kernel console input event mux keyboar d indirectio n layer sc he du ler CFQ policy FCFS policy RR policy mouse indirection layer VGA indirection layer graphics mux filesystem PIC IRQ PIT clock IRQ event dispatcher syscall dispatcher syscall indirectio n layer heap allocator frame allocator stack allocator PIC IRQ filesystem PIC IRQ PIC IRQ filesystem filesy stem

Monolithic / Microkernel OS Theseus

For disentanglement, we focus only on states and how they propagate throughout entities in the system.

schedule r

Current status and future work

  • Done: baseline OS from scratch, all in Rust
  • Now:

analyze & rethink modules and interfaces to remove state spill

  • Far:

no user/kernel distinction: “bag of modules”

Software-only Isolation and Safety

  • Modules are separate binaries: namespace isolation
  • Augment Rust compiler to permit minimal subset of

unsafe code necessary for basic OS functionality

  • Error handling is mandatory, using Option & Result

○ Panics are disallowed and transformed into errors

State Management

  • At some point, some entities must hold some state

System-wide entity

(e.g., hardware resource)

Client Client Client Multi-client state

  • Export multi-client state

as data blob jointly

  • wned by all clients
  • Clientless states are
  • wned by state_db

metamodule ○ Entity caches a weak reference to it

nano_core state db

W W

  • Multi-client states are 


exported as blobs jointly

  • wned by every client
  • Clientless states are owned 


by state_db metamodule,
 module caches weak reference

slide-26
SLIDE 26

Software-only Safety & Isolation

  • Problem: we need isolation between modules
  • For fault isolation but also for interchangeability
  • Easy solution: rely on Rust’s memory safety guarantees
  • Why not just use existing techniques? (Singularity, SPIN, Vino)
  • Challenge: many modules, modules are in kernel core
  • No support for processes, SIPs, extensions, etc

26

slide-27
SLIDE 27

Building Modules in Isolation

  • Each module is a separate Rust crate
  • Compiled into individual binaries, isolated into private “namespaces”

27

Compiler Compiler/Linker

Theseus
 Easy to extricate a single crate due to clear boundaries Standard OS No true distinction between modules, or blurry lines

slide-28
SLIDE 28

Many modules, all at the core

✓ Naming isolation done

  • Problem: Hardware protection or

spatial multiplexing is infeasible 
 for 100+ low-level modules

  • Solution: shared resources
  • Challenge: modules implement core

kernel functionality

  • They need to execute unsafe code
  • Unsafe code can do … well, anything

28

slide-29
SLIDE 29

Unsafe code is a necessary evil

  • Some unsafe code is okay
  • Port I/O & MMIO
  • Register access
  • Interrupts & descriptor tables
  • Most unsafe code is bad!
  • Dereferencing arbitrary pointers
  • Random type reinterpretations
  • Solution: augment Rust compiler 


to permit minimal subset of necessary unsafe code

  • Principal of least privilege, per-module

29

fn keyboard_irq_handler() { let scan_code: u8 = unsafe { in_byte(0x60) }; log!("scan_code: {}", scan_code); // notify end of interrupt unsafe {

  • ut_byte(0x20, 0x20);

} } fn bad_boy_bad_boy() { unsafe { *(0xFFFF1234 as *mut u32) = 0xDEADF00D; } }

slide-30
SLIDE 30 task mgmt

nano_core

kern el cons
  • le
input event mux key boar d indir ecti
  • n
laye r s c h e d u l e r CFQ policy FCFS policy RR policy mouse indirection layer VGA indirection layer graphics mux filesystem PIC IRQ PIT clock IRQ event dispatcher syscall dispatcher sysc all indir ecti
  • n
laye r heap allocator frame allocator stack allocator

userspace processes

PIC IRQ filesystem PIC IRQ PIC IRQ filesyst em fil e s y st e m

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus

module submodule entanglement via state spill

s c h e d ul er

Module Management, flattened

  • Submodules contribute to entanglement
  • Must be separated out into 


standalone, first-order modules 
 with spill-free public interfaces

  • nano_core can then manage


all modules indiscriminately 
 Two important clarifications:

  • 1. State spill freedom does not preclude arbitrary module interaction
  • 2. A module’s flat code structure does not preclude 


compositional hierarchy amongst modules

30

slide-31
SLIDE 31

Why Rust?

slide-32
SLIDE 32

Beneficial Features of Rust

  • High-level constructs with low-level flexibility

32

fn clear_vga_screen() { range(0, 80*25, |i| { *((0xb8000 + i * 2) as *mut u16) = VgaChar::new(Black) << 12; }); }

slide-33
SLIDE 33

Beneficial Features of Rust

  • High-level constructs with low-level flexibility

33

let &mut MemoryManagementInfo { ref mut page_table, ref mut vmas, ref mut stack_allocator } = current_mmi; match page_table { &mut PageTable::Active(ref mut active_table) => { let mut frame_allocator = FRAME_ALLOCATOR.lock(); if let Some((stack, stack_vma)) = stack_allocator.alloc_stack( ... ) { vmas.push(stack_vma); Ok(stack) } } _ => { Err("MemoryManagementInfo::alloc_stack: failed to allocate stack!") } }

slide-34
SLIDE 34

Beneficial Features of Rust

  • High-level constructs with low-level flexibility
  • Functional & imperative

34

p3.and_then(|p3| p3.next_table(page.p3_index())) .and_then(|p2| p2.next_table(page.p2_index())) .and_then(|p1| p1[page.p1_index()].pointed_frame()) .or_else(huge_page) .map(|frame| { frame.number * PAGE_SIZE + offset }) sections.filter(|s| !s.is_allocated()) .map(|s| s.addr) .min()

slide-35
SLIDE 35

Beneficial Features of Rust

  • High-level constructs with low-level flexibility
  • Functional & imperative
  • Strongly typed + type inference

35

slide-36
SLIDE 36

Beneficial Features of Rust

  • High-level constructs with low-level flexibility
  • Functional & imperative
  • Strongly typed + type inference
  • Clear ownership/borrowing semantics
  • Explicit lifetimes

36

slide-37
SLIDE 37

Beneficial Features of Rust

  • High-level constructs with low-level flexibility
  • Functional & imperative
  • Strongly typed + type inference
  • Clear ownership/borrowing semantics
  • Explicit lifetimes

37

let y: &u32 = { let x = 5; &x }; println!("{}", y);

error: `x` does not live long enough | 3 | &x | - borrow occurs here 4 | }; | ^ `x` dropped here while still borrowed ... 10 | println!(“{}”, y); | - borrowed value needs to live until here

slide-38
SLIDE 38

Beneficial Features of Rust

  • High-level constructs with low-level flexibility
  • Functional & imperative
  • Strongly typed + type inference
  • Clear ownership/borrowing semantics
  • Explicit lifetimes
  • Compile-time checking & guarantees

38

slide-39
SLIDE 39

Concluding Remarks

slide-40
SLIDE 40

Current & Future Work

  • Applying the herein described design to transform our

existing baseline OS into state spill-free design

  • Evaluate and demonstrate runtime composability

Related projects using Theseus

  • Exploring multi-entity resource accounting with state spill
  • Massive MU-MIMO LTE/5G Basestation system software
  • Goal: bring reliability, safety, flexibility, scalability to the edge
  • Refactoring network stack for network provenance 


and tolerance of DoS attacks

40

slide-41
SLIDE 41

Theseus in conclusion

  • Eliminate state spill above all else
  • In pursuit of runtime composability 


for easy long-term OS evolution

  • Implemented from scratch in Rust
  • Will be open-sourced soon!

41

Rice Efficient Computing Group

recg.org

slide-42
SLIDE 42

42

slide-43
SLIDE 43

Backup Slides

slide-44
SLIDE 44

Rust disadvantages

  • Lifetime checking isn’t great
  • Copious overusage of panic
  • Must translate into normal error responses
  • Allocations are not as obvious as in C
  • Many more in paper

44

let tasklist = task::get_tasklist().read(); let mut curr_task = tasklist.get_current().unwrap().write(); let curr_mmi = curr_task.mmi.as_ref().unwrap(); let mut curr_mmi_locked = curr_mmi.lock(); curr_mmi_locked.map_dma_memory(paddr, 512, PRESENT | WRITABLE);

slide-45
SLIDE 45

P3: Universal, Connectionless Interfaces

  • All entities should be easily accessible through a 


uniform invocation and management interface

  • Promotes easy interchangeability
  • Avoids module-specific logic
  • No expectation of an interface’s ongoing availability
  • Interactions cannot be stateful, must be “connectionless”

45

slide-46
SLIDE 46

Pattern Reuse

  • Common recurring OS design patterns should be

implemented only once and reused throughout the OS

  • Examples: multiplexers, dispatchers, indirection layers
  • Pattern must enforce the absence of state spill 


regardless of specialization

  • Should be instantiable at compile time
  • Lowers development risk of adding new features

46

slide-47
SLIDE 47

Modules must not jeopardize evolution

Regular Modules

  • In general, easy to update because modules are stateless
  • The nano-core relieves modules that do contain states from the burden of

implementing their own state-saving logic

  • (Tedious approach, commonly used in live updates for reliable systems, e.g., MINIX 3 [40])
  • In Theseus, compiler is the sole manager of these exported states; 


it produces control routines to move them in and out of a module’s bounds, 
 a guaranteed-safe operation because they are not accessible at the source level

  • Updates can occur on demand without the modules’ cooperation or knowledge

Metamodules

  • Must also be easily updatable
  • state_db “recalls” lent-out states and externalizes them temporarily

47

slide-48
SLIDE 48

Related work (abridged)

  • Rust OSes with different goals: Tock, Redox
  • Static Composability: Flux OSKit, Think, Taligent
  • Componentized interfaces: OpenCOM, Knit
  • Microservices architecture and serverless

48