Verifying a high-performance crash-safe file system using a tree - - PowerPoint PPT Presentation

verifying a high performance crash safe file system using
SMART_READER_LITE
LIVE PREVIEW

Verifying a high-performance crash-safe file system using a tree - - PowerPoint PPT Presentation

Verifying a high-performance crash-safe file system using a tree specifica6on Haogang Chen, Tej Chajed , Stephanie Wang, Alex Konradi, Atalay leri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich File systems are difficult to make correct


slide-1
SLIDE 1

Verifying a high-performance crash-safe file system using a tree specifica6on

Haogang Chen, Tej Chajed, Stephanie Wang, Alex Konradi, Atalay İleri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich

slide-2
SLIDE 2

File systems are difficult to make correct

  • Complicated implementa6ons
  • on-disk layout
  • in-memory data structures
  • Computer can crash at any 6me

2

slide-3
SLIDE 3

Despite much effort, file systems have bugs

  • File systems s6ll have subtle bugs
  • Well documented [Lu, TOS ’14] [Min, SOSP ’15]
  • Example from ext4:


combina6on of two op6miza6ons allows data to leak from one file to another on crash

  • Discovered a[er 6 years [Kara 2014]

3

slide-4
SLIDE 4

Approach: formal verifica6on

  • Write a specifica6on
  • Prove implementa6on meets the specifica6on
  • Ensures implementa6on handles all corner cases
  • Proof assistant (Coq) ensures proof is correct
  • Avoid large class of bugs

4

slide-5
SLIDE 5

Exis6ng verified file systems

5

FSCQ [SOSP ’15] BilbyFS [ASPLOS ’16] Yggdrasil [OSDI ’16] ext4 btrfs ZFS

correctness performance

verified file systems

slide-6
SLIDE 6

Goal: verified high-performance file system

6

FSCQ [SOSP ’15] BilbyFS [ASPLOS ’16] Yggdrasil [OSDI ’16] ext4 btrfs ZFS

correctness performance

?

verified file systems

slide-7
SLIDE 7

Strawman: op6mize FSCQ

7

correctness performance

FSCQ code

slide-8
SLIDE 8

Strawman: op6mize FSCQ

7

correctness performance

FSCQ code spec proof

slide-9
SLIDE 9

Strawman: op6mize FSCQ

7

correctness performance

FSCQ code spec proof fast FSCQ proof?

slide-10
SLIDE 10

Problem: specifica6on incompa6ble with high performance

  • Achieving high performance requires op6miza6ons
  • Some op6miza6ons change file-system behavior
  • Requires changes to specifica6on

8

slide-11
SLIDE 11

Example op6miza6on: deferred commit

  • Deferred commit: buffer system calls un6l fsync
  • FSCQ’s specifica6on: “if create(f) has returned

and computer crashes, f exists”

  • Deferred commit requires a new specifica6on

9

slide-12
SLIDE 12

Op6miza6ons that change crash behavior

  • Deferred commit: buffer system calls un6l fsync
  • Log-bypass writes: skip log for data writes
  • Buffer cache: cache data un6l fdatasync
  • Exis*ng specifica*ons do not support these
  • p*miza*ons

10

slide-13
SLIDE 13

Contribu6on: DFSCQ file system

  • Precise specifica6on for a subset of POSIX
  • supports deferred commit and log-bypass writes
  • Verified, crash-safe file system
  • Tradi6onal journalling file-system design
  • Implements most of ext4’s op6miza6ons
  • Machine-checked proof that implementa6on meets specifica6on
  • Performance on par with ext4 (but DFSCQ has fewer features)

11

slide-14
SLIDE 14

Specifying a file system

  • Design abstract state

12

slide-15
SLIDE 15

Specifying a file system

  • Design abstract state
  • Describe how system calls execute

12

slide-16
SLIDE 16

Specifying a file system

  • Design abstract state
  • Describe how system calls execute
  • Describe effect of crashes

12

slide-17
SLIDE 17

Star6ng point: tree as abstract state

Trees are a simplified abstrac6on of a file system

13

g f

slide-18
SLIDE 18

Specifica6on abstracts implementa6on details

14

g f

abstract state implementa6on’s state

slide-19
SLIDE 19

Specify how system calls affect abstract state

15

unlink(g) g f f unlink(g)

specifica6on describes transi6on

slide-20
SLIDE 20

Challenges in specifying crash behavior

  • Op6miza6ons mean crashes can be complex
  • Problem 1: deferred commit
  • Problem 2: log-bypass writes
  • Problem 3: caching

16

slide-21
SLIDE 21

Problem 1: deferred commit leads to many crash states

17

unlink(g) g f f

slide-22
SLIDE 22

Problem 1: deferred commit leads to many crash states

17

unlink(g) g f f

crash: reset memory

slide-23
SLIDE 23

Problem 1: deferred commit leads to many crash states

17

unlink(g) g f f

crash: reset memory

g f f

slide-24
SLIDE 24

How do we specify crash outcomes with deferred commit?

18

f g f

slide-25
SLIDE 25

How do we specify crash outcomes with deferred commit?

18

f g f crash

slide-26
SLIDE 26

tree sequence

Specify deferred commit using tree sequences

19

f g

slide-27
SLIDE 27

tree sequence

Specify deferred commit using tree sequences

  • Abstract state is a sequence of trees

19

f g

slide-28
SLIDE 28

tree sequence

Specify deferred commit using tree sequences

  • Abstract state is a sequence of trees
  • Always read from the latest tree

19

f g

slide-29
SLIDE 29

Specify deferred commit using tree sequences

  • Metadata updates add new trees in the specifica6on
  • Always read from the latest tree

20

unlink(g) f g f f g

slide-30
SLIDE 30

Specify deferred commit using tree sequences

  • Metadata updates add new trees in the specifica6on
  • Always read from the latest tree

21

f g f

slide-31
SLIDE 31

Specify deferred commit using tree sequences

  • Metadata updates add new trees in the specifica6on
  • Always read from the latest tree

22

truncate(f,2) f g f f f g f

slide-32
SLIDE 32

Specify deferred commit using tree sequences

  • Metadata updates add new trees in the specifica6on
  • Always read from the latest tree

23

f g f f

slide-33
SLIDE 33

Specify deferred commit using tree sequences

  • Metadata updates add new trees in the specifica6on
  • Always read from the latest tree

24

f g f f rename(f,/) f f g f f

slide-34
SLIDE 34

tree sequence

Behavior of tree sequences on crash

  • What about crash behavior?

25

f g f f f

slide-35
SLIDE 35

tree sequence

Behavior of tree sequences on crash

  • What about crash behavior?

25

f g post-crash tree sequence

crash

f g f f f

slide-36
SLIDE 36

Crash specifica6on allows background commits

26

post-crash states: f f f g f tree sequence f g f f f crash

slide-37
SLIDE 37

Specifica6on for fsync

27

f g f f f f fsync("/")

slide-38
SLIDE 38

Problem 2: log-bypass writes may reorder updates

  • Log-bypass writes: update file data blocks in place, skipping log

28

f f

rename

f

write

slide-39
SLIDE 39

Problem 2: log-bypass writes may reorder updates

  • Log-bypass writes: update file data blocks in place, skipping log
  • Effect: data writes and metadata updates can be reordered on crash

28

f f

rename

f

write crash

f

slide-40
SLIDE 40

Log-bypass writes

29

f g f f

At minimum, writes to latest tree

f g f f f f write(f,…)

slide-41
SLIDE 41

Log-bypass writes

30

Affects the same file in earlier trees

f g f f f f g f f f write(f,…)

slide-42
SLIDE 42

Specify that other files are unaffected

31

f g f f

Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence

b21 b21 b21 f g f f f f

?

write(f,…)

slide-43
SLIDE 43

Specify that other files are unaffected

32

f g f f b21 b21 f g f f f f

Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence

write(f,…) b21

slide-44
SLIDE 44

Specify that other files are unaffected

32

f g f f b21 b21 b51 f g f f f f

Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence

write(f,…) b21 b51

slide-45
SLIDE 45

Problem 3: data writes are cached

  • Write-back buffer cache

33

f f

write

f

crash

slide-46
SLIDE 46

Problem 3: data writes are cached

  • Write-back buffer cache
  • Data can be persisted in any order

33

f f

write

f f f

crash

f

slide-47
SLIDE 47

Specifying data caching: block sets

34

f g f

uncached two possible values: old ( ) and new ( )

f f

slide-48
SLIDE 48

f

Behavior of block sets on crash

f f g f f f g f f crash

slide-49
SLIDE 49

Behavior of block sets on crash

f f g f f f g f f f f f

two degrees of non-determinism in crash states:

crash

slide-50
SLIDE 50

Behavior of block sets on crash

f f g f f f g f f f f f

specifica6on allows metadata and data updates to be reordered

two degrees of non-determinism in crash states:

crash

slide-51
SLIDE 51

Specifica6on for fdatasync

37

f g f f f fdatasync(f)

slide-52
SLIDE 52

Specifica6on for fdatasync

38

f g f f fdatasync(f) f g f f

fdatasync specifica6on says block sets collapse in every tree

f f

slide-53
SLIDE 53

Summary: DFSCQ’s tree-based specifica6on

  • metadata opera6ons add a new tree
  • fsync collapses to latest tree
  • writes update blocksets in every tree
  • fdatasync collapses blocksets in every tree

39

slide-54
SLIDE 54

Prove implementa6on meets specifica6on

40

length: 2 type: file … stat(g) g f g f length: 2 type: file … stat(g)

slide-55
SLIDE 55

Prove implementa6on meets specifica6on

40

length: 2 type: file … stat(g) g f

return values match

g f length: 2 type: file … stat(g)

slide-56
SLIDE 56

Prove implementa6on meets specifica6on

40

unlink(g) length: 2 type: file … stat(g) g f

return values match

g f g f f length: 2 type: file … stat(g) unlink(g)

slide-57
SLIDE 57

Prove implementa6on meets specifica6on

40

unlink(g) length: 2 type: file … stat(g) g f

return values match disk con6nues to relate to abstract state

g f g f f length: 2 type: file … stat(g) unlink(g)

slide-58
SLIDE 58

DFSCQ Design

41

buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache

slide-59
SLIDE 59

Many single-layer op6miza6ons

42

  • Affect only proof of single layer

buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache

slide-60
SLIDE 60

Many single-layer op6miza6ons

42

  • Affect only proof of single layer

buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache cache free blocks

slide-61
SLIDE 61

Many single-layer op6miza6ons

42

  • Affect only proof of single layer

buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache improves performance with no change to abstrac6on cache free blocks

slide-62
SLIDE 62

Cross-layer op6miza6ons

43

buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache

  • Break abstrac6on

boundaries

  • Complicate proofs
  • Good for performance
slide-63
SLIDE 63

Cross-layer op6miza6ons

43

buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache

  • Break abstrac6on

boundaries

  • Complicate proofs
  • Good for performance

track dirty blocks in the cache

slide-64
SLIDE 64

Cross-layer op6miza6ons

43

buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache

  • Break abstrac6on

boundaries

  • Complicate proofs
  • Good for performance

records dirent offset from inode layer track dirty blocks in the cache

slide-65
SLIDE 65

Implementa6on and proof

  • Extend FSCQ [SOSP ’15]
  • 75,000 lines of Coq (compared to 31,000 in FSCQ)

44

specifica6on code proofs

Coq

OK

Coq proof checker

slide-66
SLIDE 66

Running DFSCQ

45

code

Coq code extrac6on

implementa6on

Haskell

slide-67
SLIDE 67

Running DFSCQ

45

code

Coq

DFSCQ FUSE server

GHC code extrac6on

implementa6on

Haskell

FUSE interface

Haskell

+

slide-68
SLIDE 68

Performance evalua6on

  • Several workloads
  • micro benchmarks
  • applica6on workloads
  • Compare with ext4 in default mode
  • Running on an SSD on a desktop

46

(see paper for more results)

slide-69
SLIDE 69

DFSCQ is compe66ve with ext4

47

files/s 80 160 240 320 400 smallfile FSCQ DFSCQ ext4

slide-70
SLIDE 70

DFSCQ is compe66ve with ext4

47

files/s 80 160 240 320 400 smallfile FSCQ DFSCQ ext4 MB/s 36 72 108 144 180 largefile

x

slide-71
SLIDE 71

DFSCQ is compe66ve with ext4

  • DFSCQ s6ll has high CPU overhead compared to ext4
  • Haskell code allocates large amounts of memory

47

files/s 80 160 240 320 400 smallfile FSCQ DFSCQ ext4 MB/s 36 72 108 144 180 largefile

x

slide-72
SLIDE 72

DFSCQ outperforms ext4 on mailbench

48

msgs/s 14 28 42 56 70 mailbench FSCQ DFSCQ ext4

slide-73
SLIDE 73

DFSCQ outperforms ext4 on mailbench

48

msgs/s 14 28 42 56 70 mailbench FSCQ DFSCQ ext4

  • mailbench simulates a qmail-like mail server
  • metadata and fsync-heavy workload
slide-74
SLIDE 74

SQLite on DFSCQ is compe66ve with ext4

49

txns/s 16 32 48 64 80 TPC-C on SQLite FSCQ DFSCQ ext4

x

slide-75
SLIDE 75

SQLite on DFSCQ is compe66ve with ext4

49

txns/s 16 32 48 64 80 TPC-C on SQLite FSCQ DFSCQ ext4

x

  • Write-heavy database workload
  • DFSCQ issues less I/O, but has higher CPU overhead
slide-76
SLIDE 76

Future work

  • Reduce CPU overhead
  • Concurrency

50

slide-77
SLIDE 77

Summary

  • DFSCQ: verified, efficient, crash-safe file system
  • Precise tree-based specifica*on of


deferred commit and log-bypass writes

  • Proof that implementa6on meets specifica6on
  • Performance on par with Linux ext4

51

hsps:/ /github.com/mit-pdos/fscq

slide-78
SLIDE 78

Backup slides

  • ext4 async commit + log-bypass bug
  • verifica6on architecture diagram
  • write-ahead logging
  • group commit
  • log-bypass writes
  • deferred commit and log-bypass perf
  • spec example
  • fsync(2)
  • atomic write
  • FUSE architecture
  • LOC

52

slide-79
SLIDE 79

Op6miza6ons are hard to implement correctly

Subtle interac6on between op6miza6ons

  • bug where crash could leak data in Linux ext4
  • discovered a[er 6 years

ext4 now forbids both op6miza6ons

53 Author: Jan Kara <jack@suse.cz> Date: Tue Nov 25 20:19:17 2014 -0500 ext4: forbid journal_async_commit in data=ordered mode [...]

slide-80
SLIDE 80

Approach to avoid bugs: verifica6on

54

disk hardware file system applica6on

specifica6on specifica*on

verify FS correct

specifica6on

verify applica6ons

slide-81
SLIDE 81

Write-ahead logging

  • System calls can update mul6ple disk blocks

55

create(‘d/a’)

(address, block)

slide-82
SLIDE 82

Write-ahead logging

  • System calls can update mul6ple disk blocks
  • Logging ensures all updates are persisted or none

even if computer crashes

55

disk log data create(‘d/a’)

(address, block)

slide-83
SLIDE 83

disk

Deferred commit enables high throughput

56

log data

slide-84
SLIDE 84

disk

Deferred commit enables high throughput

56

memory log data

  • 1. Buffer system calls in memory
slide-85
SLIDE 85

disk

Deferred commit enables high throughput

56

➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ rename(‘d/a’, ‘d/b’)

, ,

memory log data

  • 1. Buffer system calls in memory
slide-86
SLIDE 86

disk

Deferred commit enables high throughput

56

➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ rename(‘d/a’, ‘d/b’) ➡ fsync(‘d’)

, ,

memory log data

  • 1. Buffer system calls in memory
  • 2. fsync() flushes cached transac6ons

to the on-disk log in a batch

slide-87
SLIDE 87

disk

Deferred commit enables high throughput

56

➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ rename(‘d/a’, ‘d/b’) ➡ fsync(‘d’) memory log data

  • 1. Buffer system calls in memory
  • 2. fsync() flushes cached transac6ons

to the on-disk log in a batch

slide-88
SLIDE 88

Log-bypass writes avoid doubling data writes

57

disk log data ➡ mkdir(‘d’) ➡ create(‘d/a’)

  • 1. Record metadata updates in log as

usual

transacCon cache

,

slide-89
SLIDE 89

Log-bypass writes avoid doubling data writes

57

disk log data ➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ write(‘d/a’,...)

  • 1. Record metadata updates in log as

usual

transacCon cache

,

slide-90
SLIDE 90

Log-bypass writes avoid doubling data writes

57

disk log data ➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ write(‘d/a’,...)

  • 1. Record metadata updates in log as

usual

  • 2. Bypass log for file data

transacCon cache

,

slide-91
SLIDE 91

Log-bypass writes avoid doubling data writes

57

disk log data ➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ write(‘d/a’,...)

  • 1. Record metadata updates in log as

usual

  • 2. Bypass log for file data

transacCon cache

,

slide-92
SLIDE 92

Deferred commit and log bypass maser in prac6ce

58

ext4 performance largefile synchronous 120 MB/s + deferred commit 150 MB/s + log-bypass 300 MB/s

fdatasync every 10 MB to an SSD

slide-93
SLIDE 93

Specifica6ons

59

SPEC unlink(cwd_ino, pathname)
 PRE disk: tree_rep(tree_seq)
 POST disk: tree_rep(tree_seq ++ [new_tree]) /\ new_tree = tree_prune(tree_seq.latest, cwd_ino, pathname) CRASH disk: tree_intact(tree_seq ++ [new_tree])

slide-94
SLIDE 94

POSIX manual gives complicated specifica6on

  • not clear enough about crash behavior

60

fsync() flushes modified buffer cache pages for fd to the disk device so that all changed informa<on can be retrieved even a=er the system crashes or is rebooted. fsync() also flushes metadata informa<on associated with the file (see inode(7)). fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled.

paraphrase of fsync(2) manpage

slide-95
SLIDE 95

Evalua6ng the specifica6on: atomic write pasern

61

f tmpfile

rename()

Goal: on crash f either:

  • doesn’t exist
  • or contains
slide-96
SLIDE 96

Proved atomic write pasern crash safe

62

def atomic_write(data, name): with open(tmpfile, "cw") as f: ftruncate(f, len(data)) write(f, data) fdatasync(f) rename(tmpfile, name) fsync(dirname(name))

prepare tmpfile persist data move to des6na6on persist metadata

slide-97
SLIDE 97

Proved atomic write pasern crash safe

62

def atomic_write(data, name): with open(tmpfile, "cw") as f: ftruncate(f, len(data)) write(f, data) fdatasync(f) rename(tmpfile, name) fsync(dirname(name))

prepare tmpfile persist data move to des6na6on persist metadata

Specifica6on is sufficient to prove applica6on-level proper6es

slide-98
SLIDE 98

Atomic write is correct

63

/tmp name

Specifica6on: on crash, name either does not exist

  • r contains data

/tmp name

crash states:

(just a[er rename)

slide-99
SLIDE 99

DFSCQ runs ordinary Linux programs using FUSE

64

DFSCQ FUSE server

userspace Linux kernel

$ mv src dst $ git clone FUSE

slide-100
SLIDE 100

Effort to implement DFSCQ

  • Total of 75,000 lines of verified code, specs, and proofs in Coq
  • Compared to FSCQ’s 31,000 lines
  • 4,800 lines of implementa6on
  • Took 5 authors 2 years (but less than 10 person years)

65

10% 12% 43% 35%

CHL infrastructure FS impl and proofs Top-level API Tree sequences