Verifying a high-performance crash-safe file system using a tree specifica6on
Haogang Chen, Tej Chajed, Stephanie Wang, Alex Konradi, Atalay İleri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich
Verifying a high-performance crash-safe file system using a tree - - PowerPoint PPT Presentation
Verifying a high-performance crash-safe file system using a tree specifica6on Haogang Chen, Tej Chajed , Stephanie Wang, Alex Konradi, Atalay leri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich File systems are difficult to make correct
Haogang Chen, Tej Chajed, Stephanie Wang, Alex Konradi, Atalay İleri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich
2
3
4
5
correctness performance
6
correctness performance
7
correctness performance
7
correctness performance
7
correctness performance
8
9
10
11
12
12
12
13
g f
14
g f
15
unlink(g) g f f unlink(g)
16
17
unlink(g) g f f
17
unlink(g) g f f
crash: reset memory
17
unlink(g) g f f
crash: reset memory
g f f
18
f g f
18
f g f crash
tree sequence
19
f g
tree sequence
19
f g
tree sequence
19
f g
20
unlink(g) f g f f g
21
f g f
22
truncate(f,2) f g f f f g f
23
f g f f
24
f g f f rename(f,/) f f g f f
tree sequence
25
f g f f f
tree sequence
25
f g post-crash tree sequence
f g f f f
26
post-crash states: f f f g f tree sequence f g f f f crash
27
f g f f f f fsync("/")
28
f f
rename
f
write
28
f f
rename
f
write crash
f
29
f g f f
At minimum, writes to latest tree
f g f f f f write(f,…)
30
Affects the same file in earlier trees
f g f f f f g f f f write(f,…)
31
f g f f
Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence
b21 b21 b21 f g f f f f
?
write(f,…)
32
f g f f b21 b21 f g f f f f
Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence
write(f,…) b21
32
f g f f b21 b21 b51 f g f f f f
Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence
write(f,…) b21 b51
33
f f
write
f
crash
33
f f
write
f f f
crash
f
34
f g f
uncached two possible values: old ( ) and new ( )
f f
f
f f g f f f g f f crash
f f g f f f g f f f f f
two degrees of non-determinism in crash states:
crash
f f g f f f g f f f f f
specifica6on allows metadata and data updates to be reordered
two degrees of non-determinism in crash states:
crash
37
f g f f f fdatasync(f)
38
f g f f fdatasync(f) f g f f
fdatasync specifica6on says block sets collapse in every tree
f f
39
40
length: 2 type: file … stat(g) g f g f length: 2 type: file … stat(g)
40
length: 2 type: file … stat(g) g f
g f length: 2 type: file … stat(g)
40
unlink(g) length: 2 type: file … stat(g) g f
g f g f f length: 2 type: file … stat(g) unlink(g)
40
unlink(g) length: 2 type: file … stat(g) g f
g f g f f length: 2 type: file … stat(g) unlink(g)
41
buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache
42
buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache
42
buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache cache free blocks
42
buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache improves performance with no change to abstrac6on cache free blocks
43
buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache
boundaries
43
buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache
boundaries
track dirty blocks in the cache
43
buffer cache logging checksums deferred commit log-bypass API block allocator free-bit cache avoid re-use inode k-indirect blocks dirty blocks directory name cache
boundaries
records dirent offset from inode layer track dirty blocks in the cache
44
specifica6on code proofs
Coq
OK
Coq proof checker
45
code
Coq code extrac6on
implementa6on
Haskell
45
code
Coq
DFSCQ FUSE server
GHC code extrac6on
implementa6on
Haskell
FUSE interface
Haskell
+
46
(see paper for more results)
47
files/s 80 160 240 320 400 smallfile FSCQ DFSCQ ext4
47
files/s 80 160 240 320 400 smallfile FSCQ DFSCQ ext4 MB/s 36 72 108 144 180 largefile
47
files/s 80 160 240 320 400 smallfile FSCQ DFSCQ ext4 MB/s 36 72 108 144 180 largefile
48
msgs/s 14 28 42 56 70 mailbench FSCQ DFSCQ ext4
48
msgs/s 14 28 42 56 70 mailbench FSCQ DFSCQ ext4
49
txns/s 16 32 48 64 80 TPC-C on SQLite FSCQ DFSCQ ext4
49
txns/s 16 32 48 64 80 TPC-C on SQLite FSCQ DFSCQ ext4
50
deferred commit and log-bypass writes
51
52
53 Author: Jan Kara <jack@suse.cz> Date: Tue Nov 25 20:19:17 2014 -0500 ext4: forbid journal_async_commit in data=ordered mode [...]
54
specifica6on specifica*on
specifica6on
55
create(‘d/a’)
(address, block)
55
disk log data create(‘d/a’)
(address, block)
disk
56
log data
disk
56
memory log data
disk
56
➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ rename(‘d/a’, ‘d/b’)
memory log data
disk
56
➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ rename(‘d/a’, ‘d/b’) ➡ fsync(‘d’)
memory log data
to the on-disk log in a batch
disk
56
➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ rename(‘d/a’, ‘d/b’) ➡ fsync(‘d’) memory log data
to the on-disk log in a batch
57
disk log data ➡ mkdir(‘d’) ➡ create(‘d/a’)
usual
transacCon cache
57
disk log data ➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ write(‘d/a’,...)
usual
transacCon cache
57
disk log data ➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ write(‘d/a’,...)
usual
transacCon cache
57
disk log data ➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ write(‘d/a’,...)
usual
transacCon cache
58
ext4 performance largefile synchronous 120 MB/s + deferred commit 150 MB/s + log-bypass 300 MB/s
fdatasync every 10 MB to an SSD
59
SPEC unlink(cwd_ino, pathname) PRE disk: tree_rep(tree_seq) POST disk: tree_rep(tree_seq ++ [new_tree]) /\ new_tree = tree_prune(tree_seq.latest, cwd_ino, pathname) CRASH disk: tree_intact(tree_seq ++ [new_tree])
60
fsync() flushes modified buffer cache pages for fd to the disk device so that all changed informa<on can be retrieved even a=er the system crashes or is rebooted. fsync() also flushes metadata informa<on associated with the file (see inode(7)). fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled.
paraphrase of fsync(2) manpage
61
rename()
62
62
63
/tmp name
/tmp name
crash states:
64
DFSCQ FUSE server
$ mv src dst $ git clone FUSE
65
10% 12% 43% 35%
CHL infrastructure FS impl and proofs Top-level API Tree sequences