We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew - PowerPoint PPT Presentation

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew S. Tanenbaum Vrije Universiteit Amsterdam 6 th Usenix Workshop on Hot Topics in System Dependability October 3, 2010, Vancouver, BC, Canada We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 1

OS Dependability Threats We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 3

Are Core Components Safe? We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

Are Core Components Safe? ”We’re getting bloated and huge. We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

Are Core Components Safe? ”We’re getting bloated and huge. Yes, it’s a problem. We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

Are Core Components Safe? ”We’re getting bloated and huge. Yes, it’s a problem. [ . . . ] I’d like to say we have a plan.” We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

Are Core Components Safe? ”We’re getting bloated and huge. Yes, it’s a problem. [ . . . ] I’d like to say we have a plan.” Linus Torvalds on the Linux kernel, 2009 We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

High-coverage Crash Recovery Rapid evolution and huge size cause more bugs Crash recovery solution with smaller TCB needed Whole-OS crash recovery We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 5

High-coverage Crash Recovery Rapid evolution and huge size cause more bugs Crash recovery solution with smaller TCB needed Whole-OS crash recovery How? We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 5

High-coverage Crash Recovery Rapid evolution and huge size cause more bugs Crash recovery solution with smaller TCB needed Whole-OS crash recovery How? 1. Extend existing work on isolated subsystems to the entire OS We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 5

High-coverage Crash Recovery Rapid evolution and huge size cause more bugs Crash recovery solution with smaller TCB needed Whole-OS crash recovery How? 1. Extend existing work on isolated subsystems to the entire OS 2. Design a new high-coverage crash recovery infrastructure We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 5

? Entire OS Isolated Subsystems Work on extensions and drivers e.g., Safedrive , Nooks , Minix 3 Filesystems e.g., Membrane Assume isolated untrusted parties with well-defined interfaces Several recoverer-recoveree pairs to scale to the entire OS Complex and hard-to-maintain recovery infrastructure High exposure of the recovery code to the programmer We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 6

? Entire OS Isolated Subsystems Work on extensions and drivers e.g., Safedrive , Nooks , Minix 3 Filesystems e.g., Membrane Assume isolated untrusted parties with well-defined interfaces Several recoverer-recoveree pairs to scale to the entire OS . . . it is like a dog chasing its tail! Complex and hard-to-maintain recovery infrastructure High exposure of the recovery code to the programmer We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 6

Emerging High-coverage Solutions Shadow kernel vs Pure instrumentation e.g., Otherworld e.g., Recovery Domains We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 7

Emerging High-coverage Solutions Shadow kernel vs Pure instrumentation e.g., Otherworld e.g., Recovery Domains Best-effort (weak failure model) We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 7

Emerging High-coverage Solutions Shadow kernel vs Pure instrumentation e.g., Otherworld e.g., Recovery Domains Best-effort Heavyweight (weak failure model) (high complexity) (poor performance) (poor scalability) We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 7

WWW: What We Want We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

WWW: What We Want High coverage We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

WWW: What We Want Low complexity We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

WWW: What We Want Reasonable performance and scalability We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

WWW: What We Want Good maintainability We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

WWW: What We Want Address the many challenges of the crash recovery problem We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

The Crash Recovery Problem — I Crash detection Detect crashes proactively or reactively Isolate crashes so they do not disrupt the recovery process We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 9

The Crash Recovery Problem — I Crash detection Detect crashes proactively or reactively Isolate crashes so they do not disrupt the recovery process State transfer Create a new execution context to restart execution Transfer the state from the old execution context We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 9

The Crash Recovery Problem — I Crash detection Detect crashes proactively or reactively Isolate crashes so they do not disrupt the recovery process State transfer Create a new execution context to restart execution Transfer the state from the old execution context State consistency Restore a stable and consistent state in the new context Allow for deterministic execution upon restart We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 9

The Crash Recovery Problem — II State dependency tracking Preserve state dependencies among different contexts Allow for a globally coherent state upon restart We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 10

The Crash Recovery Problem — II State dependency tracking Preserve state dependencies among different contexts Allow for a globally coherent state upon restart State corruption Detect arbitrary data corruption Attempt to recover from arbitrary data corruption We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 10

The Crash Recovery Problem — II State dependency tracking Preserve state dependencies among different contexts Allow for a globally coherent state upon restart State corruption Detect arbitrary data corruption Attempt to recover from arbitrary data corruption Restart Determine a safe execution point to resume operation Attempt to avoid further crashes We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 10

Our Approach Combine OS design and lightweight instumentation We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 11

Our Approach Combine OS design and lightweight instumentation OS Design Reduce complexity at recovery time Good performance and scalability We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 11

Our Approach Combine OS design and lightweight instumentation OS Design Reduce complexity at recovery time Good performance and scalability Lightweight Compiler-based Instrumentation High coverage and component-agnostic recovery Good maintainability and evolvability We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 11

OS Architecture R3 . . . App App App App App . . . VFS SCH NET VM PM . . . PRN HDD NDD SND RS R0 Microkernel We break down the OS into several userspace components Multiserver microkernel architecture based on message-passing We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 12

The Programming Model O.S. Component We rely on an event-driven model We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 13

The Programming Model O.S. Component Events trigger execution of the task loop We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 13

The Programming Model O.S. Component Idempotent messages possible within the task loop We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 13

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew - PowerPoint PPT Presentation

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew S. Tanenbaum Vrije Universiteit Amsterdam 6 th Usenix Workshop on Hot Topics in System Dependability October 3, 2010, Vancouver, BC, Canada We Crashed, Now What? Cristiano

124TH AVE/116TH ST: Highest crash rate in Kirkland From 2012 to 2014, automobiles crashed 56 times

FireBox Safe and secure extinguish, salvage, transport and storage of crashed electric vehicles

K12 PRODUCT PROMOTION WHAT WE ARE DOING NOW Email and mail campaigns WHAT WE ARE DOING NOW

Sex Now: Canadas largest survey of Gay and Bisexual men Catie Webinar February 3 rd , 2015

LFCS Now and Then Gordon Plotkin LFCS@30 Edinburgh, April, 2016 Gordon Plotkin LFCS Now and

Know how. Know now. Know how. Know now. Please Thank our sponsor! The Nebraska Soybean Board

IPv6 Site Multihoming: Now What? (A view on what we should be doing now)

ReCrash Making crashes reproducible by preserving object states Shay Artzi, Sunghun Kim*,

Celiac research Why now? Celiac research Why now? Benny Kerzner MD Benny Kerzner MD

Presentation Now: Prepare a Perfect Presentation in Presentation Now: Prepare a Perfect

Tuition Freeze Now Consultation Results This presentation is compiled from the Tuition Freeze Now

Now Everyone Can Fly Now Everyone Can Fly First Quarter 2006 Results First Quarter

Meeting February 7, 2018 What is Nieman Now!? Nieman Now! encompasses four Stormwater

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter & Full Year Results

North Forest High School Data Conferences IWBAT Agenda Do Now IWBAT Do Now Process:

Presentation Now: Prepare a Perfect Presentation in Presentation Now: Prepare a Perfect

Clinical Trials in OSA Samuel T. Kuna, MD Department of Medicine Center for Sleep and Circadian

Git as a HIT Dan Licata Wesleyan University 1 1 Darcs Git as a HIT Dan Licata Wesleyan

Machine Learning II DS 4420 - Spring 2020 MLE, MAP, & Graphical models Byron C. Wallace

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , Yitao Duan , Yu Yu

Inducing a Discriminative Parser to Optimize Machine Translation Reordering Graham Neubig 1,2,3 ,

Warming up Storage-level Caches with Bonfire Yiying Zhang Gokul Soundararajan Mark W. Storer

A Low Power Asynchronous GPS Baseband Processor Benjamin Z. Tang, Stephen Longfield, Jr., Sunil

3. Data Structure and Algorithm 3.1 Proplets for Coding Propositional Content 3.1.1 C ONTEXT

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew - PowerPoint PPT Presentation

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew S. Tanenbaum Vrije Universiteit Amsterdam 6 th Usenix Workshop on Hot Topics in System Dependability October 3, 2010, Vancouver, BC, Canada We Crashed, Now What? Cristiano

124TH AVE/116TH ST: Highest crash rate in Kirkland From 2012 to 2014, automobiles crashed 56 times

FireBox Safe and secure extinguish, salvage, transport and storage of crashed electric vehicles

K12 PRODUCT PROMOTION WHAT WE ARE DOING NOW Email and mail campaigns WHAT WE ARE DOING NOW

Sex Now: Canadas largest survey of Gay and Bisexual men Catie Webinar February 3 rd , 2015

LFCS Now and Then Gordon Plotkin LFCS@30 Edinburgh, April, 2016 Gordon Plotkin LFCS Now and

Know how. Know now. Know how. Know now. Please Thank our sponsor! The Nebraska Soybean Board

IPv6 Site Multihoming: Now What? (A view on what we should be doing now)

ReCrash Making crashes reproducible by preserving object states Shay Artzi, Sunghun Kim*,

Celiac research Why now? Celiac research Why now? Benny Kerzner MD Benny Kerzner MD

Presentation Now: Prepare a Perfect Presentation in Presentation Now: Prepare a Perfect

Tuition Freeze Now Consultation Results This presentation is compiled from the Tuition Freeze Now

Now Everyone Can Fly Now Everyone Can Fly First Quarter 2006 Results First Quarter

Meeting February 7, 2018 What is Nieman Now!? Nieman Now! encompasses four Stormwater

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter &amp; Full Year Results

North Forest High School Data Conferences IWBAT Agenda Do Now IWBAT Do Now Process:

Presentation Now: Prepare a Perfect Presentation in Presentation Now: Prepare a Perfect

Clinical Trials in OSA Samuel T. Kuna, MD Department of Medicine Center for Sleep and Circadian

Git as a HIT Dan Licata Wesleyan University 1 1 Darcs Git as a HIT Dan Licata Wesleyan

Machine Learning II DS 4420 - Spring 2020 MLE, MAP, &amp; Graphical models Byron C. Wallace

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , Yitao Duan , Yu Yu

Inducing a Discriminative Parser to Optimize Machine Translation Reordering Graham Neubig 1,2,3 ,

Warming up Storage-level Caches with Bonfire Yiying Zhang Gokul Soundararajan Mark W. Storer

A Low Power Asynchronous GPS Baseband Processor Benjamin Z. Tang, Stephen Longfield, Jr., Sunil

3. Data Structure and Algorithm 3.1 Proplets for Coding Propositional Content 3.1.1 C ONTEXT

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter & Full Year Results

Machine Learning II DS 4420 - Spring 2020 MLE, MAP, & Graphical models Byron C. Wallace