All File Systems Are Not Created Equal: On the Complexity of - - PowerPoint PPT Presentation

all file systems are not created equal on the complexity
SMART_READER_LITE
LIVE PREVIEW

All File Systems Are Not Created Equal: On the Complexity of - - PowerPoint PPT Presentation

All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications Thanumalayan Sankaranarayana Pillai Vijay Chidambaram Ramnatthan Alagappan, Samer Al-Kiswany Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau Crash


slide-1
SLIDE 1

All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications

Thanumalayan Sankaranarayana Pillai Vijay Chidambaram Ramnatthan Alagappan, Samer Al-Kiswany Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau

slide-2
SLIDE 2

CMU SDI Seminar 14

Crash Consistency

Maintaining data invariants across system crash

  • Ex: database transactions should be atomic

Important in many systems

  • Databases
  • File Systems
  • Key-Value Stores

2

slide-3
SLIDE 3

CMU SDI Seminar 14

File-System Crash Consistency

Many techniques to ensure file system remains consistent after a crash

  • Journaling
  • Copy-on-write
  • Soft Updates

Techniques ensure internal file-system structures are consistent

3

slide-4
SLIDE 4

CMU SDI Seminar 14

Application-Level Crash Consistency

Many applications run over file systems

  • Ex: SQLite, LevelDB, Git, etc.

4

slide-5
SLIDE 5

CMU SDI Seminar 14

Application-Level Crash Consistency

Many applications run over file systems

  • Ex: SQLite, LevelDB, Git, etc.

Provide user guarantees across system crashes

  • Ex: SQLite txs provide ACID guarantees

Interact with file systems via POSIX calls

4

slide-6
SLIDE 6

CMU SDI Seminar 14

Crash Recovery is Hard

Databases took a long time to get it right

  • Commercial database - System R (1981)
  • Crash Recovery algorithm - ARIES (1992)
  • ARIES proved correct (1997)

Applications must achieve high performance and correctness

  • Mutating persistent state synchronously too slow
  • Leads to complex update protocols

5

slide-7
SLIDE 7

CMU SDI Seminar 14

Belief: all POSIX file systems implement calls the same way

Application using POSIX interface should work on any POSIX-compliant file system POSIX specifies what happens in memory

  • Does not specify what happens on crash

6

slide-8
SLIDE 8

CMU SDI Seminar 14

/dir/file my bar /dir/file my foo

Example

Goal: atomically update multiple blocks of file

7

Initial state Final state

Solution: use write-ahead-logging

Protocol:

  • 1. Write to log (/dir/log)
  • 2. Update /dir/file
  • 3. Delete /dir/log

On crash: read log, do steps 2 and 3

slide-9
SLIDE 9

CMU SDI Seminar 14

  • 1. Write log
  • pen(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC)

pwrite(“/dir/log”, log_entry, 0, 6)

  • 2. Update /dir/file

pwrite(“/dir/file”, data, 3, 3)

  • 3. Delete /dir/log

unlink(“/dir/log”)

8

Example

slide-10
SLIDE 10

CMU SDI Seminar 14

1 open(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC) 2 pwrite(“/dir/log”, log_entry, 0, 6) 3 pwrite(“/dir/file”, “bar”, 3, 3) 4 unlink(“/dir/log”)

9

Works in ext3 data journaling mode

/dir/file my foo /dir/log /dir/file my foo /dir/log 3, 3, bar /dir/file my boo /dir/log 3, 3, bar /dir/file my bar

Example

Possible disk states after a crash: After 1 After 2 Middle

  • f 3

After 4

slide-11
SLIDE 11

CMU SDI Seminar 14

1 open(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC)

10

Fails in ext3 ordered mode!

/dir/file my boo /dir/log

3 pwrite(“/dir/file”, “bar”, 3, 3) 4 unlink(“/dir/log”) 2 pwrite(“/dir/log”, log_entry, 0, 6)

Example

slide-12
SLIDE 12

CMU SDI Seminar 14

1 open(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC)

10

Fails in ext3 ordered mode!

/dir/file my boo /dir/log

3 pwrite(“/dir/file”, “bar”, 3, 3) 4 unlink(“/dir/log”) 2 pwrite(“/dir/log”, log_entry, 0, 6)

Example

slide-13
SLIDE 13

CMU SDI Seminar 14

1 open(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC)

10

Fails in ext3 ordered mode!

/dir/file my boo /dir/log

3 pwrite(“/dir/file”, “bar”, 3, 3) 4 unlink(“/dir/log”) 2.5 fsync(“/dir/log”) 3.5 fsync(“/dir/file”) 2 pwrite(“/dir/log”, log_entry, 0, 6)

Example

slide-14
SLIDE 14

CMU SDI Seminar 14

1 open(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC)

10

Fails in ext3 ordered mode!

/dir/file my boo /dir/log

3 pwrite(“/dir/file”, “bar”, 3, 3) 4 unlink(“/dir/log”) 2.5 fsync(“/dir/log”)

Fails in ext3 writeback mode!

/dir/file my foo /dir/log &%^

3.5 fsync(“/dir/file”) 2 pwrite(“/dir/log”, log_entry, 0, 6)

Example

slide-15
SLIDE 15

CMU SDI Seminar 14

1 open(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC)

10

Fails in ext3 ordered mode!

/dir/file my boo /dir/log

3 pwrite(“/dir/file”, “bar”, 3, 3) 4 unlink(“/dir/log”) 2.5 fsync(“/dir/log”)

Fails in ext3 writeback mode!

/dir/file my foo /dir/log &%^

3.5 fsync(“/dir/file”)

Example

2 pwrite(“/dir/log”, log_entry + cxsum, 0, 6)

slide-16
SLIDE 16

CMU SDI Seminar 14

1 open(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC)

10

Fails in ext3 ordered mode!

/dir/file my boo /dir/log

3 pwrite(“/dir/file”, “bar”, 3, 3) 4 unlink(“/dir/log”) 2.5 fsync(“/dir/log”)

Fails in ext3 writeback mode!

/dir/file my foo /dir/log &%^

3.5 fsync(“/dir/file”)

May fail in new POSIX file system

/dir/file my boo

Example

2 pwrite(“/dir/log”, log_entry + cxsum, 0, 6)

slide-17
SLIDE 17

CMU SDI Seminar 14

1 open(“/dir/log”, O_CREAT|O_WRONLY| O_TRUNC)

10

Fails in ext3 ordered mode!

/dir/file my boo /dir/log

3 pwrite(“/dir/file”, “bar”, 3, 3) 4 unlink(“/dir/log”) 2.5 fsync(“/dir/log”)

Fails in ext3 writeback mode!

/dir/file my foo /dir/log &%^

3.5 fsync(“/dir/file”) 2.7 fsync(“/dir”)

May fail in new POSIX file system

/dir/file my boo

Example

2 pwrite(“/dir/log”, log_entry + cxsum, 0, 6)

slide-18
SLIDE 18

CMU SDI Seminar 14

How do file-systems vary in implementing POSIX calls?

11

How do applications maintain crash consistency?

Built Block-Order-Breaker(BOB) and analyzed 6 file systems Built Application-Level Intelligent Crash Explorer (ALICE) and analyzed 11 applications

slide-19
SLIDE 19

CMU SDI Seminar 14

Outline

Introduction Background Analyzing file systems with BOB Analyzing applications with ALICE Application Study Conclusion and Future Work

12

slide-20
SLIDE 20

CMU SDI Seminar 14

POSIX Standard

POSIX standard is extremely weak Example: POSIX fsync() need not flush data

13

From the man page for Mac OS X fsync():

Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written.

Flushing data to disk requires F_FULLFSYNC fcntl

slide-21
SLIDE 21

CMU SDI Seminar 14

Unwritten Standard

Developers coded to an unwritten standard

  • Based on ext3 (default Linux fs for many years)

Widely held belief:

  • Correct behavior == ext3 behavior
  • POSIX guarantees not widely known

14

slide-22
SLIDE 22

CMU SDI Seminar 14

All was well until…

ext4 introduced delayed allocation

  • data writes could be persisted after rename()

Changed guarantees given to applications Broke hundreds of applications

  • write (tmp); rename(tmp, old) led to zero

length files

Caused wide-spread data loss

15

slide-23
SLIDE 23

CMU SDI Seminar 14

FS developers: your application is broken! It doesn’t follow POSIX!

16

Application developers: your file system is broken!

slide-24
SLIDE 24

CMU SDI Seminar 14

File-system developers added code to bring back old behavior in certain cases

17

All this could have been avoided if application requirements are known to developers Our tool, ALICE, allows developers to do this

slide-25
SLIDE 25

CMU SDI Seminar 14

Outline

Introduction Background Analyzing file systems with BOB Analyzing applications with ALICE Application Study Conclusion and Future Work

18

slide-26
SLIDE 26

CMU SDI Seminar 14

Analyzing File Systems

File systems implement POSIX calls differently Study behavior via Persistence Properties:

  • define how system calls are persisted
  • affect which on-disk states are possible after a

system crash

  • two classes: atomicity and ordering

19

slide-27
SLIDE 27

CMU SDI Seminar 14

Persistence Properties: Example

20

write(f1, “AA”) write(f2, “BB”)

Consider the following code snippet:

f1 f2 f1 AA f2 BB f1 A f2 B f1 YY f2 ZZ

Atomicity Ordering

f1 AA f2 f1 AA f2 BB f1 f2 BB

Only size updated

slide-28
SLIDE 28

CMU SDI Seminar 14

Block-Order-Breaker

Built Block-Order-Breaker (BOB) to study persistence properties Methodology:

  • Re-order block IO to construct legal disk images
  • Inspect file-system state on constructed images
  • Test whether persistence properties hold

21

slide-29
SLIDE 29

CMU SDI Seminar 14

Block-Order-Breaker (BOB)

22

write (12K)

W1 Initial Disk State

Test workload designed to stress persistence property Capture block-level trace

  • f writes, flushes, barriers

W2 F W3 File System

slide-30
SLIDE 30

CMU SDI Seminar 14

Block-Order-Breaker (BOB)

23

W1 Initial Disk State W2 F W3

Reconstruct crash state States limited by flushes and barriers

slide-31
SLIDE 31

CMU SDI Seminar 14

Block-Order-Breaker (BOB)

23

W1 Initial Disk State

Reconstruct crash state States limited by flushes and barriers

slide-32
SLIDE 32

CMU SDI Seminar 14

Block-Order-Breaker (BOB)

23

W1 Crashed Disk State

Reconstruct crash state States limited by flushes and barriers

slide-33
SLIDE 33

CMU SDI Seminar 14

Block-Order-Breaker (BOB)

23

W1 File System Crashed Disk State

Reconstruct crash state States limited by flushes and barriers Mount file system on crashed disk state

slide-34
SLIDE 34

CMU SDI Seminar 14

Block-Order-Breaker (BOB)

23

W1 File System Crashed Disk State

Reconstruct crash state States limited by flushes and barriers Mount file system on crashed disk state Check if persistence property violated Is entire write() data present?

slide-35
SLIDE 35

CMU SDI Seminar 14

Caveats

24

BOB can used to find:

  • Which properties don’t hold
  • The exact case where the property fails

BOB does not explore all workloads BOB cannot be used to prove a file system has a specific persistence property

slide-36
SLIDE 36

CMU SDI Seminar 14

Studying File Systems

Uses BOB to study how properties varied

  • ver file systems

Studied six file systems

  • ext2, ext3, ext4, btrfs, xfs, reiserfs
  • A total of 16 configurations

25

slide-37
SLIDE 37

CMU SDI Seminar 14

Study Results: Atomicity

26

ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-38
SLIDE 38

CMU SDI Seminar 14

Study Results: Atomicity

26

Single Sector ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

write(512) atomic?

slide-39
SLIDE 39

CMU SDI Seminar 14

Study Results: Atomicity

26

Single Sector ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-40
SLIDE 40

CMU SDI Seminar 14

Study Results: Atomicity

26

Single Sector Multi Sector ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

write(1GB) atomic?

slide-41
SLIDE 41

CMU SDI Seminar 14

Study Results: Atomicity

26

x x x x x x x x x x x x

Single Sector Multi Sector ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-42
SLIDE 42

CMU SDI Seminar 14

Study Results: Atomicity

26

x x x x x x x x x x x x

Single Sector Append Content Multi Sector ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

  • pen(file, O_APPEND)

write(12K) atomic?

slide-43
SLIDE 43

CMU SDI Seminar 14

Study Results: Atomicity

26

x x x x x x x x x x x x x x x x

Single Sector Append Content Multi Sector ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-44
SLIDE 44

CMU SDI Seminar 14

Study Results: Atomicity

26

x x x x x x x x x x x x x x x x

Single Sector Append Content Multi Sector Directory Operation ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

rename(old, new) atomic?

slide-45
SLIDE 45

CMU SDI Seminar 14

Study Results: Atomicity

26

x x x x x x x x x x x x x x x x x x

Single Sector Append Content Multi Sector Directory Operation ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-46
SLIDE 46

CMU SDI Seminar 14

Study Results: Ordering

27

ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-47
SLIDE 47

CMU SDI Seminar 14

Study Results: Ordering

27

Overwrite

  • > any op

ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

write(4K) -> rename()

slide-48
SLIDE 48

CMU SDI Seminar 14

Study Results: Ordering

27

x x x x x x x x

Overwrite

  • > any op

ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-49
SLIDE 49

CMU SDI Seminar 14

Study Results: Ordering

27

x x x x x x x x x x x x x

Overwrite

  • > any op

Append

  • > any op

ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-50
SLIDE 50

CMU SDI Seminar 14

Study Results: Ordering

27

x x x x x x x x x x x x x x x

Overwrite

  • > any op

Append

  • > any op

Dir op

  • > any op

ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-51
SLIDE 51

CMU SDI Seminar 14

Study Results: Ordering

27

x x x x x x x x x x x x x x x x x x

Overwrite

  • > any op

Append

  • > any op

Dir op

  • > any op

Append(f)

  • > rename(f)

ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-52
SLIDE 52

CMU SDI Seminar 14

Study Results: Ordering

27

x x x x x x x x x x x x x x x x x x

Overwrite

  • > any op

Append

  • > any op

Dir op

  • > any op

Append(f)

  • > rename(f)

ext2 async ext2 sync ext3 writeback ext3 ordered ext3 journal ext4 writeback ext4 ordered ext4 no-delalloc ext4 journal btrfs xfs default xfs wsync

slide-53
SLIDE 53

CMU SDI Seminar 14

File-System Study Results

Persistence properties vary widely among file systems

  • Even within different configurations of same

file system

Applications should not rely on them Testing application correctness on single file system is not enough

28

slide-54
SLIDE 54

CMU SDI Seminar 14

Outline

Introduction Background Analyzing file systems with BOB Analyzing applications with ALICE Application Study Conclusion and Future Work

29

slide-55
SLIDE 55

CMU SDI Seminar 14

Application-Level Intelligent Crash Explorer (ALICE)

ALICE: tool to find Crash Vulnerabilities Application Crash Vulnerabilities

  • code that depends on specific persistence

properties for correct behavior

  • ex: if file system doesn't persist two system calls

calls in order, it leads to data corruption

30

slide-56
SLIDE 56

CMU SDI Seminar 14

ALICE Methodology

Construct crash state by violating a single persistence property Run application on crash state (allow recovery) Examine application state If application inconsistent, it depended on persistence property violated in crash state

31

slide-57
SLIDE 57

CMU SDI Seminar 14

ALICE Overview

32

Application Workload

git add file1 creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

System-Call Trace FS Abstract Persistence Model Crash State Constructor Crash State Crash State Crash State Application Checker

git status

ERROR

slide-58
SLIDE 58

CMU SDI Seminar 14

ALICE Overview

33

Application Workload

git add file1 creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

System-Call Trace FS Abstract Persistence Model Crash State Constructor Crash State Crash State Crash State Application Checker

git status

ERROR

slide-59
SLIDE 59

CMU SDI Seminar 14

Tracing the Workload

Run the application workload Collect the system-call traces System calls converted into logical operations:

  • Abstract away current file offset, fd, etc
  • Group writev(), pwrite() etc into a single

type of operation

34

slide-60
SLIDE 60

CMU SDI Seminar 14

ALICE Overview

35

Application Workload

git add file1 creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

System-Call Trace FS Abstract Persistence Model Crash State Constructor Crash State Crash State Crash State Application Checker

git status

ERROR

slide-61
SLIDE 61

CMU SDI Seminar 14

Constructing Crash States

ALICE constructs crash states by applying a subset of operations to the initial disk image

36

Initial Disk State

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

Crash State

slide-62
SLIDE 62

CMU SDI Seminar 14

Constructing Crash States

Persistence Properties Violated:

  • 1. Atomicity across system calls

37

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

  • 2. Atomicity within system calls

Method: apply prefix

  • f operations
slide-63
SLIDE 63

CMU SDI Seminar 14

Constructing Crash States

Persistence Properties Violated:

  • 1. Atomicity across system calls

37

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

  • 2. Atomicity within system calls

Method: apply prefix

  • f operations

Method: apply prefix + partial operation

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

slide-64
SLIDE 64

CMU SDI Seminar 14

Constructing Crash States

Persistence Properties Violated:

  • 1. Atomicity across system calls

37

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

  • 2. Atomicity within system calls

Method: apply prefix

  • f operations

Method: apply prefix + partial operation

creat(index.lock) creat(tmp) append(tmp, 512) … append(tmp, 512) fsync(tmp) link(tmp, perm)

slide-65
SLIDE 65

CMU SDI Seminar 14

Constructing Crash States

Persistence Properties Violated:

  • 1. Atomicity across system calls

37

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

  • 2. Atomicity within system calls

creat(index.lock) creat(tmp) append(tmp, 512) … append(tmp, 512) fsync(tmp) link(tmp, perm)

Method: apply prefix

  • f operations

Method: apply prefix + partial operation

slide-66
SLIDE 66

CMU SDI Seminar 14

Constructing Crash States

38

Persistence Properties Violated:

  • 3. Ordering among system calls

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

Method: ignore an operation, apply prefix

slide-67
SLIDE 67

CMU SDI Seminar 14

Constructing Crash States

38

Persistence Properties Violated:

  • 3. Ordering among system calls

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

Method: ignore an operation, apply prefix

slide-68
SLIDE 68

CMU SDI Seminar 14

ALICE Overview

39

Application Workload

git add file1 creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

System-Call Trace FS Abstract Persistence Model Crash State Constructor Crash State Crash State Crash State Application Checker

git status

ERROR

slide-69
SLIDE 69

CMU SDI Seminar 14

FS Abstract Persistence Model

Each file system implements persistence properties differently

  • Ex: ext4 orders writes of a file before its rename

APM defines which crash states are permitted APM defines atomicity and ordering constraints APM allow ALICE to model file-system behavior without file-system implementation

40

slide-70
SLIDE 70

CMU SDI Seminar 14

ALICE Overview

41

Application Workload

git add file1 creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

System-Call Trace FS Abstract Persistence Model Crash State Constructor Crash State Crash State Crash State Application Checker

git status

ERROR

slide-71
SLIDE 71

CMU SDI Seminar 14

Finding Crash Vulnerabilities

Identify persistence property violated

42

creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

Identify system calls involved Identify source code lines involved

slide-72
SLIDE 72

CMU SDI Seminar 14

ALICE Overview

43

Application Workload

git add file1 creat(index.lock) creat(tmp) append(tmp, 4K) fsync(tmp) link(tmp, perm)

System-Call Trace FS Abstract Persistence Model Crash State Constructor Crash State Crash State Crash State Application Checker

git status

ERROR

slide-73
SLIDE 73

CMU SDI Seminar 14

ALICE Limitations

Not complete

  • does not execute all code paths in application
  • does not explore all crash states
  • does not test combinations of persistence

property violations (ex: atomicity + ordering)

Cannot prove an update protocol is correct

44

slide-74
SLIDE 74

CMU SDI Seminar 14

Outline

Introduction Background Analyzing file systems with BOB Analyzing applications with ALICE Application Study Conclusion and Future Work

45

slide-75
SLIDE 75

CMU SDI Seminar 14

Application Study

Used ALICE to study eleven applications

Version Control Systems Key-Value Stores Relational Databases Distributed Systems Virtualization Platforms

GDBM LMDB ZooKeeper Player

slide-76
SLIDE 76

CMU SDI Seminar 14

Study Goals

Analyzed applications using weak APM

  • Minimum constraints on possible crash states

Sought to answer:

  • Which persistence properties do applications

depend upon?

  • What are the consequences of vulnerabilities?
  • How many vulnerabilities occur on today’s file

systems?

Did not seek to compare applications

47

slide-77
SLIDE 77

CMU SDI Seminar 14

Study: Setup

What is correct behavior for an application?

  • We use guarantees in documentation
  • In case of no documentation, we assume typical

user expectations (“committed data is durable”)

Configurations change guarantees

  • We test each configuration separately
  • Tested 34 configurations across 11 applications

Post-crash, we run all appropriate application recovery mechanisms

48

slide-78
SLIDE 78

CMU SDI Seminar 14

Example: Git

49

mkdir(o/x) creat(o/x/tmp_y) append(o/x/tmp_y) fsync(o/x/tmp_y) link(o/x/tmp_y, o/x/y) unlink(o/x/tmp_y) do(store object) creat(branch.lock) append(branch.lock) append(branch.lock) append(logs/branch) append(logs/HEAD) rename(branch.lock,x/branch) stdout(“finished commit”)

store object git commit

slide-79
SLIDE 79

CMU SDI Seminar 14

Example: Git

49

mkdir(o/x) creat(o/x/tmp_y) append(o/x/tmp_y) fsync(o/x/tmp_y) link(o/x/tmp_y, o/x/y) unlink(o/x/tmp_y) do(store object) creat(branch.lock) append(branch.lock) append(branch.lock) append(logs/branch) append(logs/HEAD) rename(branch.lock,x/branch) stdout(“finished commit”)

store object git commit

[ ]

Atomicity

slide-80
SLIDE 80

CMU SDI Seminar 14

Example: Git

49

mkdir(o/x) creat(o/x/tmp_y) append(o/x/tmp_y) fsync(o/x/tmp_y) link(o/x/tmp_y, o/x/y) unlink(o/x/tmp_y) do(store object) creat(branch.lock) append(branch.lock) append(branch.lock) append(logs/branch) append(logs/HEAD) rename(branch.lock,x/branch) stdout(“finished commit”)

store object git commit Ordering

slide-81
SLIDE 81

CMU SDI Seminar 14

Example: Git

49

mkdir(o/x) creat(o/x/tmp_y) append(o/x/tmp_y) fsync(o/x/tmp_y) link(o/x/tmp_y, o/x/y) unlink(o/x/tmp_y) do(store object) creat(branch.lock) append(branch.lock) append(branch.lock) append(logs/branch) append(logs/HEAD) rename(branch.lock,x/branch) stdout(“finished commit”)

store object git commit Ordering

slide-82
SLIDE 82

CMU SDI Seminar 14

Example: Git

49

mkdir(o/x) creat(o/x/tmp_y) append(o/x/tmp_y) fsync(o/x/tmp_y) link(o/x/tmp_y, o/x/y) unlink(o/x/tmp_y) do(store object) creat(branch.lock) append(branch.lock) append(branch.lock) append(logs/branch) append(logs/HEAD) rename(branch.lock,x/branch) stdout(“finished commit”)

store object git commit Durability

slide-83
SLIDE 83

CMU SDI Seminar 14

Vulnerability Types

50

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10

Multi-call atomicity Single-call atomicity Ordering Durability

slide-84
SLIDE 84

CMU SDI Seminar 14

Vulnerability Types

50

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10

Multi-call atomicity Single-call atomicity Ordering Durability 1 1 1 2 1

slide-85
SLIDE 85

CMU SDI Seminar 14

Vulnerability Types

50

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10

Multi-call atomicity Single-call atomicity Ordering Durability 1 1 1 3 1 1 1 2 2 2 1 1 1 1 2 1

slide-86
SLIDE 86

CMU SDI Seminar 14

Vulnerability Types

50

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10

Multi-call atomicity Single-call atomicity Ordering Durability 1 1 4 1 3 6 5 6 1 1 1 3 1 1 1 2 2 2 1 1 1 1 2 1

slide-87
SLIDE 87

CMU SDI Seminar 14

Vulnerability Types

50

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10

Multi-call atomicity Single-call atomicity Ordering Durability 2 1 3 2 1 2 1 1 1 4 1 3 6 5 6 1 1 1 3 1 1 1 2 2 2 1 1 1 1 2 1

slide-88
SLIDE 88

CMU SDI Seminar 14

Vulnerability Types

50

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10

Multi-call atomicity Single-call atomicity Ordering Durability 2 1 3 2 1 2 1 1 1 4 1 3 6 5 6 1 1 1 3 1 1 1 2 2 2 1 1 1 1 2 1

60 vulnerabilities across 11 applications

slide-89
SLIDE 89

CMU SDI Seminar 14

Vulnerability Consequences

51

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Silent Errors Data Loss Cannot Open Failed reads/writes Misc 2 2 1

slide-90
SLIDE 90

CMU SDI Seminar 14

Vulnerability Consequences

51

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Silent Errors Data Loss Cannot Open Failed reads/writes Misc 2 1 3 2 1 2 1 2 2 1

slide-91
SLIDE 91

CMU SDI Seminar 14

Vulnerability Consequences

51

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Silent Errors Data Loss Cannot Open Failed reads/writes Misc 1 2 2 5 1 3 2 5 1 3 2 1 3 2 1 2 1 2 2 1

slide-92
SLIDE 92

CMU SDI Seminar 14

Vulnerability Consequences

51

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Silent Errors Data Loss Cannot Open Failed reads/writes Misc 2 4 6 5 1 2 2 5 1 3 2 5 1 3 2 1 3 2 1 2 1 2 2 1

slide-93
SLIDE 93

CMU SDI Seminar 14

Vulnerability Consequences

51

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Silent Errors Data Loss Cannot Open Failed reads/writes Misc 1 5 3 2 4 6 5 1 2 2 5 1 3 2 5 1 3 2 1 3 2 1 2 1 2 2 1

slide-94
SLIDE 94

CMU SDI Seminar 14

Vulnerability Consequences

51

Git Mercurial LevelDB-1.10 LevelDB-1.15 GDBM LMDB PostgreSQL HSQLDB SQLite HDFS ZooKeeper VMWare Player #vulnerabilties 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Silent Errors Data Loss Cannot Open Failed reads/writes Misc 1 5 3 2 4 6 5 1 2 2 5 1 3 2 5 1 3 2 1 3 2 1 2 1 2 2 1

Many vulnerabilities result in data loss, silent errors, and failed reads/writes

slide-95
SLIDE 95

CMU SDI Seminar 14

Vulnerabilities on Current File Systems

52

#vulnerabilities 15 30 45 60 Weak APM ext3-writeback ext3-ordered ext3-journal ext4-ordered btrfs

60

slide-96
SLIDE 96

CMU SDI Seminar 14

Vulnerabilities on Current File Systems

52

#vulnerabilities 15 30 45 60 Weak APM ext3-writeback ext3-ordered ext3-journal ext4-ordered btrfs

10 12 16 60

slide-97
SLIDE 97

CMU SDI Seminar 14

Vulnerabilities on Current File Systems

52

#vulnerabilities 15 30 45 60 Weak APM ext3-writeback ext3-ordered ext3-journal ext4-ordered btrfs

17 10 12 16 60

slide-98
SLIDE 98

CMU SDI Seminar 14

Vulnerabilities on Current File Systems

52

#vulnerabilities 15 30 45 60 Weak APM ext3-writeback ext3-ordered ext3-journal ext4-ordered btrfs

31 17 10 12 16 60

slide-99
SLIDE 99

CMU SDI Seminar 14

Vulnerabilities on Current File Systems

52

#vulnerabilities 15 30 45 60 Weak APM ext3-writeback ext3-ordered ext3-journal ext4-ordered btrfs

31 17 10 12 16 60

Every current file system exposes at least one vulnerability; btrfs exposes more than half

slide-100
SLIDE 100

CMU SDI Seminar 14

Observations

Applications very careful in overwriting user data

  • None required atomicity for multi-block overwrites

Applications not as careful in appending to logs

  • Multi-block appends require prefix atomicity
  • Ex: write(“ABC”) should result in “A”/“AB”/“ABC”

Atomicity across system calls doesn't seem useful

53

slide-101
SLIDE 101

CMU SDI Seminar 14

Observations

Update protocols spread over layers and files

  • Ex: HSQLDB has 3 consecutive fsync() calls

Recovery code is poorly written and untested

  • Ex: LevelDB recovery does not correct errors

Documentation unclear or misleading

  • SQLite by default does not provide durability
  • GDBM_SYNC does not ensure durability

54

slide-102
SLIDE 102

CMU SDI Seminar 14

Reporting Vulnerabilities

Developers generally suspicious when we reported vulnerabilities

  • Dev #1: “Maybe it is the disk”
  • Dev #2: “File systems don't behave that way”

Tough to reproduce without tools like ALICE Developers acting on five vulnerabilities

  • Vulnerabilities in LevelDB, HDFS, ZooKeeper

55

slide-103
SLIDE 103

CMU SDI Seminar 14

Outline

Introduction Background Analyzing file systems with BOB Analyzing applications with ALICE Application Study Conclusion and Future Work

56

slide-104
SLIDE 104

CMU SDI Seminar 14

Summary

Built BOB to study persistence properties

  • Studied 16 configurations of 6 file systems
  • Persistence properties vary widely

Built ALICE to study application-level crash- consistency protocols

  • Studied 11 applications
  • Found 60 vulnerabilities across all applications

57

slide-105
SLIDE 105

CMU SDI Seminar 14

Application-Level Consistency in the Cloud

Cloud computing and software-defined storage make this problem worse

  • Increased storage-stack diversity
  • Multiple storage media, file systems, etc.

Applications deployed in multiple environments

  • Cant rely on specific persistence properties

58

slide-106
SLIDE 106

CMU SDI Seminar 14

Portability in the Cloud

Need to match application requirements to storage-stack guarantees Challenges:

  • specifying application requirements
  • computing how layers build on each other to

provide guarantees

  • checking if requirements are met

59

slide-107
SLIDE 107

CMU SDI Seminar 14

Matching Applications to Stacks

Use a formal language (like Isar) to specify application requirements Specify high-level design of stack layers in Isar Use proof-assistants (like Isabelle) to verify that requirements are provided by stack Do this on-the-fly as storage stacks are constructed

60

slide-108
SLIDE 108

CMU SDI Seminar 14

Benefits of Matching

Currently, applications are coarsely matched to storage stacks Stacks provide either too much or too little Verifying application correctness allows construction of optimal stacks

  • Cheapest stack that satisfies application

requirements

61

slide-109
SLIDE 109

CMU SDI Seminar 14

Conclusion

Applications are moving to the cloud

  • For performance
  • For ease-of use or availability
  • correctness is often forgotten

To unlock true potential of cloud, portable applications need to be created Figuring out application requirements is the first step towards this vision

62

slide-110
SLIDE 110

SOSP 13

Thank You Source code http://research.cs.wisc.edu/adsl/ Software/alice/ Questions?