Tolera'ng File‐System Mistakes with EnvyFS
Lakshmi N. Bairavasundaram
NetApp, Inc.
Swaminathan Sundararaman Andrea C. Arpaci‐Dusseau Remzi H. Arpaci‐Dusseau
University of Wisconsin Madison
Tolera'ng FileSystem Mistakes with EnvyFS Swaminathan Sundararaman - - PowerPoint PPT Presentation
Tolera'ng FileSystem Mistakes with EnvyFS Swaminathan Sundararaman Lakshmi N. Bairavasundaram Andrea C. ArpaciDusseau NetApp, Inc. Remzi H. ArpaciDusseau University of Wisconsin Madison File Systems in Todays World Modern
NetApp, Inc.
University of Wisconsin Madison
6/18/09 2 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 3 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 4 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 5
EnvyFS layer
Child 1 Child 2 Child N
Disk driver
Disk
SIS layer Applica'on
6/18/09 6 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 7 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 8 Tolera'ng File‐System Mistakes with EnvyFS
– Needs to handle cases where specifica'on is not precise – e.g., Ordering directory entries, inode number alloca'on
6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 9
– No specified return order – Can’t blindly compare entries
– Read all entries from a directory (dir: test in our case) from all FSes – Match entries from FSes – Return majority results
6/18/09 10 Tolera'ng File‐System Mistakes with EnvyFS
FS X FS Y FS Z EnvyFS layer
File 1 File 2 File 3 Dir: test File 2 File 3 File 1 Dir: test Dir: test File 1 File 2 File 3 Readdir: test No Entries File 3 File 1 File 2
File 1 File 2 File 3
File 1 File 2 File 3 Dir: test
Virt # FS 1 FS 3 FS 2
?? File 1 | 36
– Inode numbers returned through system calls – Each child file system issues different inode numbers – Possible solu'on: Force file systems to use same algorithm? – Our solu'on: Issue inode numbers at EnvyFS layer
6/18/09 11 Tolera'ng File‐System Mistakes with EnvyFS
FS X FS Y FS Z EnvyFS layer
Dir: test Dir: test Dir: test File 1 | 10 File 1 |65
File 1 10 File 2 15 File 3 16 File 2 04 File 3 44 File 1 36
File 1 |
15 10 36 65
Inode Mapping Table 15 Stat: File 1
File 3 99 File 1 65 File 2 43
Inode Numbers Inode Mapping Table not persistently stored
6/18/09 12 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 13
– Inserted beneath VFS – Simple design to avoid bugs
– Allocate N data buffers – Read data block from the disk – Compare: data, return code, file posi'on – Return: data, return code
– Allocate memory for each read opera'on – Extra copy from allocated buffer to applica'on – Comparison overheads
Comparators Wrappers Inode Mapping Table
Applica'on VFS layer
ext3 JFS ReiserFS
EnvyFS Layer
Read (file, 1 block) Read (file, 1 block)
Read (…) Read (…) Read (…)
F F F
pos: x pos: x pos: x
D D D D D D
err = err = err =
Disk
D
err ,
D
err , Tolera'ng File‐System Mistakes with EnvyFS
– Same applica'on buffer for all FS – TCP‐like checksums for data comparison – Compare: checksums, return code, file posi'on – Read data un'l majority
6/18/09 14 Tolera'ng File‐System Mistakes with EnvyFS
Comparators Wrappers Inode Mapping Table
Applica'on VFS layer
ext3 JFS ReiserFS
EnvyFS Layer
Read (file, 1 block) Read (file, 1 block)
Read (…) Read (…)
F F F
pos: x pos: x
D D D D D
err = err = err =
FS 1 # FS 2 # FS N # …
435 435 … 436
Checksums Disk
D
err ,
D
err , Read (…)
D
pos: x
6/18/09 15 Tolera'ng File‐System Mistakes with EnvyFS
Part 1 Part 2 Part N …
Disk 1 Disk 2 Disk N Disk
6/18/09 16 Tolera'ng File‐System Mistakes with EnvyFS
EnvyFS layer
FS 1
FS 2
FS N Applica'on VFS layer Disk Req. Queue
1 1 2 N 1 2 N
– Selec'vely merges data blocks
– Exports virtual disks to FSes – Manages mapping, free space info. – Not persistently stored on disk
– N data blocks merged to 1 data block – Content hashes not stored persistently – Meta‐data blocks not merged – Inter FS blocks and not intra FS
6/18/09 17 Tolera'ng File‐System Mistakes with EnvyFS
EnvyFS layer
FS 1
FS 2
FS N Applica'on VFS layer
Vdisk 1 Disk Vdisk 2 Vdisk N
Read Cache CHash Layer Free Space Management
SubSIST
D D M M M D D D D D
FS 1
D Disk
– Due to bugs, bit flips, storage stack – Corrupt data blocks not merged – All other N‐1 data blocks merged – Corrupt data block fixed at next read
– Different code paths – Different on‐disk structures
6/18/09 18 Tolera'ng File‐System Mistakes with EnvyFS
EnvyFS layer
FS 2
FS N Applica'on VFS layer
Vdisk 1 Vdisk 2 Vdisk N
Read Cache CHash Layer Free Space Management
SubSIST
D D D D D D D D D D D
6/18/09 19 Tolera'ng File‐System Mistakes with EnvyFS
Disk B B B
EnvyFS layer Block Driver
B
– superblock, inode, block bitmap, file data, …
– mount, stat, creat, unlink, read, …
6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 20
ext 3
JFS
ReiserFS
Pseudo Device Driver
VFS
B B B B
Type‐aware fault injecHon [Prabhakaran05]
path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC
Normal Data loss N/A Cannot mount Ops fail Data corrupt Crash Read‐only
Depends
E
6/18/09 21 Tolera'ng File‐System Mistakes with EnvyFS
path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC
Data loss N/A Cannot mount
6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 22
E
path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC
Data loss N/A Cannot mount Ops fail Crash
6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 23
E
path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC
Normal Data loss N/A Cannot mount Ops fail Data corrupt Crash Read‐only
Depends
E
6/18/09 24 Tolera'ng File‐System Mistakes with EnvyFS
Kernel panic in ext3 path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount
Normal N/A
INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC
6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 25
E R J EnvyFS
Time Unlink on corrupt inode: ‐ ext3_lookup (bug) ‐ ext3_unlink Unmount (panic) Time Unlink on corrupt inode: ‐ ext3_lookup (bug) ‐ ext3 inode does not match
‐ Further ops not issued In typical use, a problem is no'ced only on panic In EnvyFS3, a problem is no'ced the first 'me child file system returns wrong results
6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 26
path traversal SET‐1 SET‐2 read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 umount INODE DIR BMAP IMAP INTERNAL DATA SUPER JSUPER JDATA AGGR‐INODE IMAPDESC IMAPCNTL
Normal Data loss N/A Cannot mount Ops fail Data corrupt Crash Read‐only
Depends
J
6/18/09 27 Tolera'ng File‐System Mistakes with EnvyFS
path traversal SET‐1 SET‐2 read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 umount INODE DIR BMAP IMAP INTERNAL DATA SUPER JSUPER JDATA AGGR‐INODE IMAPDESC IMAPCNTL
Normal N/A Crash
Kernel panic in EnvyFS3
E R J EnvyFS
6/18/09 28 Tolera'ng File‐System Mistakes with EnvyFS
5 10 15 20 25 30 ext3 JFS ReiserFS EnvyFS EnvyFS+SIS
6/18/09 29 Tolera'ng File‐System Mistakes with EnvyFS
Elapsed Time (in Seconds) File Systems
3 % overhead
100 200 300 400 500 600 700 800 900 Postmark‐10K Postmark‐100K Postmark‐100K*
ext3 JFS ReiserFS EnvyFS EnvyFS+SIS
6/18/09 30 Tolera'ng File‐System Mistakes with EnvyFS
Elapsed Time (in Seconds)
129.0 39.0 26.4 14.7 9.6 29 107 34 851 430 128 243 78 406 271
EnvyFS3: 3.3x + SubSIST: ‐32% EnvyFS3: 8x + SubSIST: 4x EnvyFS3: 1.7x + SubSIST: 11.5%
6/18/09 31 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 32 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 33 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 34 Tolera'ng File‐System Mistakes with EnvyFS
6/18/09 35 Tolera'ng File‐System Mistakes with EnvyFS
path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC
Data loss N/A Read‐only
6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 36
E