Tolera'ng FileSystem Mistakes with EnvyFS Swaminathan Sundararaman - - PowerPoint PPT Presentation

tolera ng file system mistakes with envyfs
SMART_READER_LITE
LIVE PREVIEW

Tolera'ng FileSystem Mistakes with EnvyFS Swaminathan Sundararaman - - PowerPoint PPT Presentation

Tolera'ng FileSystem Mistakes with EnvyFS Swaminathan Sundararaman Lakshmi N. Bairavasundaram Andrea C. ArpaciDusseau NetApp, Inc. Remzi H. ArpaciDusseau University of Wisconsin Madison File Systems in Todays World Modern


slide-1
SLIDE 1

Tolera'ng File‐System Mistakes with EnvyFS

Lakshmi N. Bairavasundaram

NetApp, Inc.

Swaminathan Sundararaman Andrea C. Arpaci‐Dusseau Remzi H. Arpaci‐Dusseau

University of Wisconsin Madison

slide-2
SLIDE 2

File Systems in Today’s World

  • Modern file systems are complex

– Tens of thousands of lines of code (e.g., XFS 45K LOC)

  • Storage stack is also geVng deeper

– Hypervisor, network, logical volume manager

  • Need to handle a gamut of failures

– Memory alloca'on, disk faults, bit flips, system crashes

  • Preserve integrity of its meta‐data and user data

6/18/09 2 Tolera'ng File‐System Mistakes with EnvyFS

slide-3
SLIDE 3

File System Bugs

  • Bug reports for Linux 2.6 series from Bugzilla

– ext3: 64, JFS: 17, ReiserFS: 38 – Some are FS corrup'on causing permanent data loss

  • FS bugs broadly classified into two categories

– “fail‐stop”: System immediately crashes

  • Solu'ons: Nooks [Swi/ 04], CuriOS [David08]

– “fail‐silent”: Accidentally corrupt on‐disk state

  • Many such bugs uncovered [Prabhakaran05, Gunawi08, Yang04, Yang06b]

6/18/09 3 Tolera'ng File‐System Mistakes with EnvyFS

slide-4
SLIDE 4

6/18/09 4 Tolera'ng File‐System Mistakes with EnvyFS

Bugs are inevitable in file systems Challenge: how to cope with them?

slide-5
SLIDE 5
  • Based on N‐version programming [Avizienis77]

– NFS servers [Rodrigues01], databases [Vandiver07], security [Cox06]

N‐Version File Systems

  • EnvyFS: Simple solware layer

– Store data in N child file systems

– Opera'ons performed on all children

  • Rely on a simple so-ware layer
  • Challenge: reducing overheads while

retaining reliability

– SubSIST: Novel Single Instance Store

6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 5

EnvyFS layer

Child 1 Child 2 Child N

Disk driver

Disk

SIS layer Applica'on

slide-6
SLIDE 6

Results

  • Robustness

– Tradi'onal file systems handle few corrup'ons (< 4%) – EnvyFS3 tolerates 98.9% of single file system mistakes

  • Performance

– Desktop workloads: EnvyFS3 has comparable performance – I/O intensive workloads:

  • Normal mode: EnvyFS3 + SubSIST acceptable performance
  • Under memory pressure: EnvyFS3 + SubSIST large overheads
  • Poten'al as a debugging tool for FS developers

– Pinpoint the source of “fail‐silent” bug in ext3

6/18/09 6 Tolera'ng File‐System Mistakes with EnvyFS

slide-7
SLIDE 7

Outline

  • Introduc'on
  • Building reliable file systems
  • Reducing overheads with SubSIST
  • Evalua'on
  • Conclusion

6/18/09 7 Tolera'ng File‐System Mistakes with EnvyFS

slide-8
SLIDE 8

N‐Version Systems

Development process:

  • 1. Producing the specifica'on of solware
  • 2. Implemen'ng N versions of the solware
  • 3. Crea'ng N‐version layer

— Executes different versions — Determines the consensus result

6/18/09 8 Tolera'ng File‐System Mistakes with EnvyFS

slide-9
SLIDE 9
  • 1. Producing Specifica'on
  • Our own specifica'on ?

– Imprac'cal: Requires wide scale changes to file systems – Specifica'ons take years to get accepted

  • Can we leverage exis'ng specifica'on ?

– Yes, can leverage VFS, but there are some issues

  • VFS not precise for N‐versioning purpose

– Needs to handle cases where specifica'on is not precise – e.g., Ordering directory entries, inode number alloca'on

6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 9

slide-10
SLIDE 10

Imprecise VFS Specifica'on

Ordering directory entries

  • Issue:

– No specified return order – Can’t blindly compare entries

  • Solu'on:

– Read all entries from a directory (dir: test in our case) from all FSes – Match entries from FSes – Return majority results

6/18/09 10 Tolera'ng File‐System Mistakes with EnvyFS

FS X FS Y FS Z EnvyFS layer

File 1 File 2 File 3 Dir: test File 2 File 3 File 1 Dir: test Dir: test File 1 File 2 File 3 Readdir: test No Entries File 3 File 1 File 2

File 1 File 2 File 3

File 1 File 2 File 3 Dir: test

slide-11
SLIDE 11

Virt # FS 1 FS 3 FS 2

?? File 1 | 36

Imprecise VFS Specifica'on (cont)

  • Inode number alloca'on

– Inode numbers returned through system calls – Each child file system issues different inode numbers – Possible solu'on: Force file systems to use same algorithm? – Our solu'on: Issue inode numbers at EnvyFS layer

6/18/09 11 Tolera'ng File‐System Mistakes with EnvyFS

FS X FS Y FS Z EnvyFS layer

Dir: test Dir: test Dir: test File 1 | 10 File 1 |65

File 1 10 File 2 15 File 3 16 File 2 04 File 3 44 File 1 36

File 1 |

15 10 36 65

Inode Mapping Table 15 Stat: File 1

File 3 99 File 1 65 File 2 43

Inode Numbers Inode Mapping Table not persistently stored

slide-12
SLIDE 12
  • 2. Implemen'ng N versions of FS
  • Painful process

– High cost of development, long 'me delays

  • Lucky! Hard work already done for us

– 30 different disk based file systems in Linux 2.6

  • Which file systems to use?

– ext3, JFS, ReiserFS in a three‐version FS – Others should work without modifica'ons

6/18/09 12 Tolera'ng File‐System Mistakes with EnvyFS

slide-13
SLIDE 13
  • 3. Crea'ng N‐Version Layer

6/18/09 13

  • N‐Version layer (EnvyFS)

– Inserted beneath VFS – Simple design to avoid bugs

  • Example: Reading a file

– Allocate N data buffers – Read data block from the disk – Compare: data, return code, file posi'on – Return: data, return code

  • Issues:

– Allocate memory for each read opera'on – Extra copy from allocated buffer to applica'on – Comparison overheads

Comparators Wrappers Inode Mapping Table

Applica'on VFS layer

ext3 JFS ReiserFS

EnvyFS Layer

Read (file, 1 block) Read (file, 1 block)

Read (…) Read (…) Read (…)

F F F

pos: x pos: x pos: x

D D D D D D

err = err = err =

Disk

D

err ,

D

err , Tolera'ng File‐System Mistakes with EnvyFS

slide-14
SLIDE 14

Reading a File in EnvyFS

  • Solu'on:

– Same applica'on buffer for all FS – TCP‐like checksums for data comparison – Compare: checksums, return code, file posi'on – Read data un'l majority

6/18/09 14 Tolera'ng File‐System Mistakes with EnvyFS

Comparators Wrappers Inode Mapping Table

Applica'on VFS layer

ext3 JFS ReiserFS

EnvyFS Layer

Read (file, 1 block) Read (file, 1 block)

Read (…) Read (…)

F F F

pos: x pos: x

D D D D D

err = err = err =

FS 1 # FS 2 # FS N # …

435 435 … 436

Checksums Disk

D

err ,

D

err , Read (…)

D

pos: x

slide-15
SLIDE 15

Outline

  • Introduc'on
  • Building reliable file systems
  • Reducing overheads with SubSIST
  • Evalua'on
  • Conclusion

6/18/09 15 Tolera'ng File‐System Mistakes with EnvyFS

slide-16
SLIDE 16

Part 1 Part 2 Part N …

Disk 1 Disk 2 Disk N Disk

Case for Single Instance Storage (SIS)

  • Ideal: One disk per FS
  • Prac'cal: One disk for all FS
  • Overheads

– Effec've storage space: 1/N – N 'mes more I/O (Read/write)

  • Challenge: Maintain diversity

while minimizing overheads

6/18/09 16 Tolera'ng File‐System Mistakes with EnvyFS

EnvyFS layer

FS 1

FS 2

FS N Applica'on VFS layer Disk Req. Queue

1 1 2 N 1 2 N

slide-17
SLIDE 17

SubSIST: Single Instance Store

  • Variant of an Single Instance Store

– Selec'vely merges data blocks

  • Block addressable SIS

– Exports virtual disks to FSes – Manages mapping, free space info. – Not persistently stored on disk

  • EnvyFS writes through N file systems

– N data blocks merged to 1 data block – Content hashes not stored persistently – Meta‐data blocks not merged – Inter FS blocks and not intra FS

6/18/09 17 Tolera'ng File‐System Mistakes with EnvyFS

EnvyFS layer

FS 1

FS 2

FS N Applica'on VFS layer

Vdisk 1 Disk Vdisk 2 Vdisk N

Read Cache CHash Layer Free Space Management

SubSIST

D D M M M D D D D D

slide-18
SLIDE 18

FS 1

D Disk

Handling Data Block Corrup'ons?

 Corrup'on to data in a single FS

– Due to bugs, bit flips, storage stack – Corrupt data blocks not merged – All other N‐1 data blocks merged – Corrupt data block fixed at next read

× Corrup'on to data block inside disk

  • Single copy of data

– Different code paths – Different on‐disk structures

6/18/09 18 Tolera'ng File‐System Mistakes with EnvyFS

EnvyFS layer

FS 2

FS N Applica'on VFS layer

Vdisk 1 Vdisk 2 Vdisk N

Read Cache CHash Layer Free Space Management

SubSIST

D D D D D D D D D D D

slide-19
SLIDE 19

Outline

  • Introduc'on
  • Building reliable file systems
  • Reducing overheads with SubSIST
  • EvaluaHon

– Reliability – Performance

  • Conclusion

6/18/09 19 Tolera'ng File‐System Mistakes with EnvyFS

slide-20
SLIDE 20

Robustness of EnvyFS in recovering from a child file system’s mistake?

Disk B B B

EnvyFS layer Block Driver

B

Reliability Evalua'on: Fault Injec'on

  • Corrup'on: bugs in FS / storage stack
  • Types of disk blocks

– superblock, inode, block bitmap, file data, …

  • Perform different file ops

– mount, stat, creat, unlink, read, …

  • Report user visible results
  • All results are applicable with SubSIST

except corrupHon to data blocks

6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 20

ext 3

JFS

ReiserFS

Pseudo Device Driver

VFS

B B B B

Type‐aware fault injecHon [Prabhakaran05]

slide-21
SLIDE 21

ext3

path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC

Result Matrix

Normal Data loss N/A Cannot mount Ops fail Data corrupt Crash Read‐only

e

Depends

E

6/18/09 21 Tolera'ng File‐System Mistakes with EnvyFS

slide-22
SLIDE 22

ext3

path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC

Data loss N/A Cannot mount

Ext3 stores many superblock copies; but, does not handle superblock corrup'on

6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 22

E

slide-23
SLIDE 23

ext3

path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC

Data loss N/A Cannot mount Ops fail Crash

  • In addi'on to opera'ons failing, inode

corrup'on leads to data loss

  • Unlink: system crash during unmount

6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 23

E

slide-24
SLIDE 24

ext3

path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC

Normal Data loss N/A Cannot mount Ops fail Data corrupt Crash Read‐only

e

Depends

E

6/18/09 24 Tolera'ng File‐System Mistakes with EnvyFS

slide-25
SLIDE 25

Kernel panic in ext3 path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount

Normal N/A

INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC

EnvyFS3 works in every scenario

6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 25

EnvyFS3

E R J EnvyFS

slide-26
SLIDE 26

Poten'al for Bug Isola'on

ext3 EnvyFS3

Time Unlink on corrupt inode: ‐ ext3_lookup (bug) ‐ ext3_unlink Unmount (panic) Time Unlink on corrupt inode: ‐ ext3_lookup (bug) ‐ ext3 inode does not match

  • thers

‐ Further ops not issued In typical use, a problem is no'ced only on panic In EnvyFS3, a problem is no'ced the first 'me child file system returns wrong results

6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 26

slide-27
SLIDE 27

JFS

path traversal SET‐1 SET‐2 read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 umount INODE DIR BMAP IMAP INTERNAL DATA SUPER JSUPER JDATA AGGR‐INODE IMAPDESC IMAPCNTL

Normal Data loss N/A Cannot mount Ops fail Data corrupt Crash Read‐only

a

Depends

J

6/18/09 27 Tolera'ng File‐System Mistakes with EnvyFS

slide-28
SLIDE 28

EnvyFS3

path traversal SET‐1 SET‐2 read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 umount INODE DIR BMAP IMAP INTERNAL DATA SUPER JSUPER JDATA AGGR‐INODE IMAPDESC IMAPCNTL

Normal N/A Crash

Kernel panic in EnvyFS3

E R J EnvyFS

6/18/09 28 Tolera'ng File‐System Mistakes with EnvyFS

slide-29
SLIDE 29

5 10 15 20 25 30 ext3 JFS ReiserFS EnvyFS EnvyFS+SIS

  • Experimental setup

– AMD Opteron 2.2 GHz Processor – 2GB RAM – 80 GB Hitachi Deskstar 7200‐rpm SATA disk – Linux 2.6.12 – 4GB disk par''on for each file system

OpenSSH Benchmark Performance Evalua'on

6/18/09 29 Tolera'ng File‐System Mistakes with EnvyFS

Elapsed Time (in Seconds) File Systems

  • CPU Intensive
  • OpenSSH 4.5

‐‐ Copy, untar and make Performance of EnvyFS3 is comparable to a single file system

3 % overhead

slide-30
SLIDE 30

100 200 300 400 500 600 700 800 900 Postmark‐10K Postmark‐100K Postmark‐100K*

ext3 JFS ReiserFS EnvyFS EnvyFS+SIS

  • I/O Intensive

– Mimics busy mail server workload – Transac'on: creates, deletes, reads, appends, …

  • Postmark Configura'on

– 2500 files – File size: 4Kb – 40Kb – No. of transac'ons: 10K and 100K

Postmark Benchmark

6/18/09 30 Tolera'ng File‐System Mistakes with EnvyFS

Elapsed Time (in Seconds)

129.0 39.0 26.4 14.7 9.6 29 107 34 851 430 128 243 78 406 271

EnvyFS3: 3.3x + SubSIST: ‐32% EnvyFS3: 8x + SubSIST: 4x EnvyFS3: 1.7x + SubSIST: 11.5%

slide-31
SLIDE 31

Summary of Results

  • Robustness

– Tradi'onal file systems vulnerable to corrup'ons – EnvyFS3 tolerates almost all mistakes in one FS

  • Performance

– Desktop workloads: EnvyFS3 has comparable performance – I/O intensive workloads:

  • Regular Opera'ons: EnvyFS3 + SubSIST acceptable performance
  • Memory pressure: EnvyFS3 + SubSIST has large overhead

6/18/09 31 Tolera'ng File‐System Mistakes with EnvyFS

slide-32
SLIDE 32

Outline

  • Introduc'on
  • Building reliable file systems
  • Reducing overheads with SubSIST
  • Evalua'on
  • Conclusion

6/18/09 32 Tolera'ng File‐System Mistakes with EnvyFS

slide-33
SLIDE 33

Conclusion

  • Bugs/mistakes are inevitable in any solware

– Must cope, not just hope to avoid

  • EnvyFS: N‐version approach to tolera'ng FS bugs

– Built using exis'ng specifica'on and file systems

  • SubSIST: single instance store

– Decreases overheads while retaining reliability

6/18/09 33 Tolera'ng File‐System Mistakes with EnvyFS

slide-34
SLIDE 34

Thank You!

Advanced Systems Lab (ADSL) University of Wisconsin‐Madison hXp://www.cs.wisc.edu/adsl

6/18/09 34 Tolera'ng File‐System Mistakes with EnvyFS

slide-35
SLIDE 35

Future Work

  • Debugging tool for developers

– Run older and newer version of file systems – Compare results with older version

  • File system repair

– Simple repair: copy data from other file system – Complex repair: recreate en're file system tree – How to do micro repair ?

6/18/09 35 Tolera'ng File‐System Mistakes with EnvyFS

slide-36
SLIDE 36

ext3

path traversal SET‐1 (stat, …) SET‐2 (chmod) read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 (fsync) umount INODE DIR BMAP IMAP INDIRECT DATA SUPER JSUPER GDESC

Data loss N/A Read‐only

  • Ext3 detects corrup'on for rmdir, unlink
  • creat , mkdir, symlink cause ext3 to reuse an

inode, resul'ng in data loss

6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 36

E