Using Model Checking to Find Serious File System Errors Junfeng - PowerPoint PPT Presentation

Using Model Checking to Find Serious File System Errors Junfeng Yang, Paul Twohey, Dawson Engler Stanford University Madanlal Musuvathi Microsoft Research

Authors Dawson Junfeng Paul Madan

FS Errors are Destructive • Kernel crash, FS corruption • Recovery code is error-prone – Crash at any point, must recover • Hard to test – Slow reboot, reconstruction – many crash possibilities, hard to cover all

FiSC = File System Model Checker • Leverages CMC [OSDI 02, NSDI 04] – Implementation-level Model Checker • Generic and FS-specific checks • Good at enumerating failures/crashes • 32 Bugs on JFS, ReiserFS and ext3 – 10 unrecoverable losses of ‘/’, hard to get with static analysis – 3 security holes – 30 confirmed and 21 fixed quickly

Outline • How FiSC works • Two consistency checks • How to plug a file system into FiSC • Checking crashes during recovery • Results

Idealized Checking Process root file0 mkdir root root dir0 …

Galactic View of FiSC Test Driver FS operations ext3 User Mode Linux fsck Disk r/w libc r/w FiSC libc interceptor Disks CMC

The Checking Loop kernel threads root Test Driver scheduler create file0 … mkdir state Our modified scheduler will queue disk writes enumerate through all kernel threads state seen? and file system operations root permuter drop Err! file0 dir1 checkers Err! Will talk about permuter later…

Difference With Randomized Testing • Randomized testing = only one possible execution • Our approach = guided search – Systematic: enumerate through all actions – Better controlled: choose what to explore – Visibility: see all events – Repeatable: bugs are replicable

Long-lived JFS fsck Bug Fixed in 2 Days • loss of an extent of inodes! • 3 years old, ever since the first version! • Caused serious data-loss – Dave Kleikamp (IBM JFS): “I'm sure this has bitten us before, but it's usually hard to go back and find out what causes the file system to get messed up so bad” • Fixed in 2 days with our complete trace

Checking FS Operations are Correct Current State Next State Abstract FS abstract_mkdir Abstract FS abstract root root file0 dir1 file0 actual actual_mkdir block dirty cache blocks • Abstract FS: model of a file system. Currently tracks topology and file sizes. Can be extended • Reference model, run in parallel with the actual FS

Checking FS Operations are Correct Next State Actual FS Abstract FS abstract ? root root = file0 dir1 dir1 file0 abstract = marshal the actual dirty actual FS, record the blocks topology and file sizes, throw away details • Generic, implemented by FiSC

Permuter: Write Schedules are Recoverable kernel threads root Test Driver scheduler create file0 … mkdir state queue disk writes state seen? root permuter drop Err! file0 dir1 checkers Err!

Permuter: Write Schedules are Recoverable Current State Next State Stable FS Stable FS root root file0 file0 mkdir block dirty cache blocks • Stable FS: what FS should recover to after crash • FS-Specific, provided by FS developers

Permuter: Write Schedules are Recoverable Next State Recovered Stable FS ? root root = file0 file0 dirty blocks = fsck writes permute clone

Plugging an FS into FiSC 1. FS utilities: mkfs, fsck 2. Dirty buffers – Not needed if using standard system mark_dirty 3. Minimum disk and memory sizes – 2MB, 16 pages for ext3 4. Function to compute the Stable FS – Stable FS: What FS should recover to, FS-specific .................................................................... • Roughly 1-2 weeks for us

Stable FS Trick for Journaling FS • Only log write can update the Stable FS – Log write  use fsck to compute Stable FS – FS write  fsck and abstract, compare result to Stable FS – FS writes cannot change Stable FS • Log write = commit + normal log write – Only commit can update the Stable FS – If easy to recognize commit, update Stable FS on commit

Checking More Thoroughly • Downscale – Small disks. 2MB for ext3 – Small memory. 16 pages for ext3 – Tiny FS topology. 2-4 nodes • Canonicalization – General rule: setting things to constants: e.g. inode generation #, mount count – Filenames. “x”, “y”, “z” == “1”, “2”, “3”

Exposing choice points • Choice point = can abstractly do multiple actions, practically does one • Want to explore all actions struct block* read_block (int i) { return twice, struct block *b; 1 st time return 0, if ((b = cache_lookup(i))) 2 nd time return 1 if (fisc_choose(2) == 0) return b; if there are N return disk_read (i); possible actions, } call fisc_choose(N) return 0, 1, …, N -1

Scheduler is a Built-in Choice Point kernel threads root Test Driver scheduler create file0 … mkdir state queue disk writes Kernel threads and FS operations are possible actions. Enumerate state seen? root permuter through all of them. drop Err! file0 dir1 checkers Err!

The Basic Check • Obtain a crashed disk image D • Run fsck, recording all writes • Simulate a crash during recovery – Apply prefix to D – Re-run fsck – Compare to Stable FS • Repeat until all the prefixes are tried • Effective  , Speed  (redundant crashes)

Assume: fsck is Deterministic • Same inputs  same outputs – Inputs = disk reads , outputs = writes • Is crash after a write redundant? – A write doesn’t change prior reads  2 nd fsck computes the same write  redundant crash, can be optimized away • More optimizations in paper – Obvious: cache fsck results

Equivalent: Write But No Read Schedule 1: Schedule 2: read B1 read B1 write B2 write B2 = … crash & re-run done read B1 write B2, same! … Same! done. • No read of B2 prior to write of B2

Equivalent: Dominated Write Schedule 1: Schedule 2: read B1 read B1 write B2 write B2 … … = write B2 write B2 … crash & re-run done read B1 write B2 • 2 nd write of B2 is … dominated by 1 st write B2, same! write of B2 … Same! done

Results Error Type VFS ext2 ext3 JFS Reiser total Data loss N/A N/A 1 8 1 10 False clean N/A N/A 1 1 2 Security 2 2 1 3 + 2 Crashes 1 10 1 12 Other 1 1 1 3 Total 2 2 5 21 2 32 32 in total, 21 fixed, 9 of the remaining 11 confirmed

Recovery Write Ordering Bugs • Under Normal operation: – Changes must first be flushed to log before they can reach the actual FS • All FS seem to get this right • During Recovery: – Changes must first be flushed to the actual FS before the log can be cleared • Found this type of bug in all FS, total 5

ext3 Recovery Bug recover_ext3_journal(…) { // … journal_recover(…) { retval = -journal_recover(journal) // replay the journal // … //… // clear the journal // sync modifications to disk e2fsck_journal_release(…) fsync_no_super (…) // … } } // Error! Empty macro, doesn’t sync data! #define fsync_no_super(dev) do {} while (0) • Code was directly adapted from the kernel • But, fsync_no_super was defined as NOP !

Conclusion • FiSC, a FS model checker – On average 1-2 weeks to plug in an FS – Checked JFS, ReiserFS and ext3 – Serious data-loss bugs in all, 10 in total • Model Checking worked well – Can crash everywhere. Must always be recoverable. – Systematic • Future work: anything that must handle failure correctly, always – Raid, databases, consensus algorithms…

Using Model Checking to Find Serious File System Errors Junfeng - PowerPoint PPT Presentation

Using Model Checking to Find Serious File System Errors Junfeng Yang, Paul Twohey, Dawson Engler Stanford University Madanlal Musuvathi Microsoft Research Authors Dawson Junfeng Paul Madan FS Errors are Destructive Kernel crash, FS

Basic Errors Compiling in Unix Syntax errors Common Errors, and Debugging Run-Time errors

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

File Management What is a file? Elements of file management File organization

From Model Checking to Proof Checking ... and Back Kedar Namjoshi Bell Labs April 29, 2005

Hoare Logic and Model Checking Model Checking Lecture 11: Model checking for Computation Tree

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

Speeding up file system checks in ext4 Theodore Ts'o Why File System Checks Are Necessary

Session 1 SCRIPTING LANGUAGES 08-10-2018 WEN-CHI YANG Your first program Find the file 2.

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

The Work of the New Zealand Serious Fraud Office White Collar Crime and Serious Fraud Conference

How to use Microsoft Teams to participate in the SSEB 2020 online conference 1. Click this link to

SOC 101 System of Care Overview System of Care Across Tennessee is helping to build ld so solid

Budget Hearings Alameda County Wendy Still, MAS Chief Probation Officer June 26, 2018 Vision

Creating An Inclusive Economy in Denver DENVER CITY COUNCILMAN, ALBUS BROOKS, DIST. 9 TUES.,

Righting in Busyness Session # 52356 available on VHS or cassette tape for only $5.99 National

Kongsberg Automotive ASA Second quarter 2015 | July 10, 2015 1 Group highlights Q2 Q2 2015

Teaching technical writing to engineering majors Proposed textbooks : Writing in the sciences:

StrengthingYouth Financial Capability Vermont Financial Educators Conference | November 3, 2017