Can Applications Recover from fsync Failures? Anthony Rebello, - - PowerPoint PPT Presentation

can applications recover from fsync failures
SMART_READER_LITE
LIVE PREVIEW

Can Applications Recover from fsync Failures? Anthony Rebello, - - PowerPoint PPT Presentation

Can Applications Recover from fsync Failures? Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau University of Wisconsin-Madison How does data reach the disk? Applications use the file


slide-1
SLIDE 1

Can Applications Recover from fsync Failures?

Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau University of Wisconsin-Madison

slide-2
SLIDE 2

How does data reach the disk?

  • Applications use the file system
  • System calls – open( ), read( ), write( )
  • For Performance
  • Data buffered in the page cache
  • Modified pages are marked dirty
  • Periodically flushed to disk
  • Vulnerable to data loss while in RAM
  • For Correctness
  • Dirty pages can be flushed immediately

using fsync( )

2

Applications Disk File System Clean pages: same content as disk Dirty Pages: New data to write to disk Periodically

  • r on fsync( )
slide-3
SLIDE 3

fsync is really important

  • Many applications care about durability
  • Ensure data on non-volatile storage before acknowledging client
  • Devices have volatile storage
  • Direct IO: fsync can issue a FLUSH command
  • Ordering of writes is important
  • Force to disk with fsync before writing the next
  • Optimistic Crash Consistency Chidambaram et al. [SOSP’13]
  • Decouples ordering from durability

3

slide-4
SLIDE 4

It’s hard to get durability correct

  • Applications find it difficult
  • Even when fsync works correctly
  • Example: persisting a newly created file
  • creat(/d/foo)
  • write(/d/foo, “abcd”)
  • fsync(/d/foo)
  • fsync(/d) ß Ensure that directory entry is persisted
  • All File Systems Are Not Created Equal Pillai et al. [OSDI’14]
  • Studied 11 applications
  • Update protocols are tricky
  • More than 30 vulnerabilities under ext3, ext4, btrfs

4

slide-5
SLIDE 5

fsync can fail

  • Durability gets harder to get right
  • Failures before interacting with disk
  • Invalid arguments, insufficient space
  • Easy to handle
  • Failures while interacting with disk
  • EIO: An error occurred during synchronization
  • Transient disk errors, network disconnects
  • In-memory data structures may need to be reverted

5

slide-6
SLIDE 6

Why care about fsync failures?

“About a year ago the PostgreSQL community discovered that fsync (on Linux and some BSD systems) may not work the way we always thought it is [sic], with possibly disastrous consequences for data durability/consistency (which is something the PostgreSQL community really values).”

  • Tomas Vondra, FOSDEM 2019

6

slide-7
SLIDE 7

Our work

  • Systematically understand fsync failures

7

Applications Disk File System 2 • Application reactions to fsync failures

  • Redis, LMDB, LevelDB, SQLite, PostgreSQL

1 • File system reactions to fsync failures

  • Ext4, XFS, Btrfs
slide-8
SLIDE 8

File System Results

  • All file systems mark dirty pages clean on fsync failure
  • Retries are ineffective
  • File systems do not handle errors during fsync uniformly
  • Content in pages is different
  • Latest data (ext4, XFS), Old data (Btrfs)
  • Failure notifications not always immediate
  • Ext4 data mode reports errors later
  • In-memory data structures are not entirely reverted after fsync failure
  • Garbage/Zeroes in the files
  • Free space and block allocation unaltered (ext4, XFS)
  • User-space file descriptor offset unaltered (Btrfs)

8

slide-9
SLIDE 9

Application Results

  • Simple strategies fail
  • Retries are ineffective
  • Crash/Restart can be incorrect
  • False Failures: Indicate failure but actually succeed
  • Incorrect recovery from WAL using the page cache
  • Defenseless against late error reporting
  • Ext4 data mode
  • Every application faced data loss
  • Most faced corruption (all except PostgreSQL)
  • Copy-on-write is good, but not invincible
  • Btrfs is bad for rollback strategies
  • But seems good for WAL recovery

9

slide-10
SLIDE 10

Outline

  • Introduction
  • File Systems
  • Methodology (dm-loki, workloads)
  • Results
  • Applications
  • Methodology (CuttleFS)
  • Results
  • Challenges and Directions
  • Summary

10

slide-11
SLIDE 11

File System | Methodology: Fault Injection

  • Goal: Understand file system reactions to fsync failures without modifying the kernel

11

Applications Disk File System

dm-loki: intercepts bio requests

  • Intercept all block requests that go to disk
  • Custom device mapper target – dm-loki
  • Trace bio requests
  • Fail ith write to sector/block
slide-12
SLIDE 12

File System | Methodology: Workloads

  • Common write patterns in applications
  • Reduced to simplest form
  • Single Block Update
  • Modify a single block in a file
  • Examples:
  • LMDB, PostgreSQL, SQLite
  • Multi Block Append
  • Add new blocks to the end of a file
  • Examples:
  • Redis append-only file
  • Write-ahead logs
  • PostgreSQL, LevelDB, SQLite

12

A

A B

B

C

C A

A X

X

C

C A

A

A

A B

B

C

C

slide-13
SLIDE 13

File System | Result #1: Clean Pages

  • Dirty page is marked clean after fsync failure on all three file systems
  • Feature, not bug
  • Avoids memory leaks when user removes USB stick
  • Retries are ineffective
  • No more dirty pages on the next fsync

13

A

A B

B

C

C 1 A

A *

B

C

C 2 A

A *

B

C

C 3

Modify middle page fsync( ) fails Page is marked clean

slide-14
SLIDE 14

File System | Result #2a: Page Content

  • File systems do not handle fsync errors uniformly
  • Page content depends on file system
  • Cannot reliably depend on page cache content after an fsync failure

14

A

A X

B

C

C 1

Middle page modified

A

A X

B

C

C 2a

Ext4 and XFS Keep latest data

A

A B

B

C

C 2b

Btrfs reverts state

A

A ?

B

C

C 2

fsync( ) fails Page is marked clean

slide-15
SLIDE 15
  • Ext4 data mode reports success too early
  • Two fsyncs can solve the problem

15

3 Journal A

A X

B

C

C

fsync( ) succeeds Data written to journal Failure when writing journal to disk Fails next fsync( )

File System | Result #2b: Notifications

  • File systems do not report fsync failures uniformly
  • Ext4 data mode reports failures later
  • Ext4 ordered mode, XFS, Btrfs report immediately

A

A X

B

C

C 1

Middle page modified

Journal A

A X

B

C

C X 2

slide-16
SLIDE 16

File System | Result #3: In-memory state

  • In-memory data structures are not entirely reverted
  • Free space and block allocation unaltered in ext4, XFS
  • On EXT4 and XFS -
  • Applications read block’s old contents - corruption

16

A

A B

1

Write to end

  • f file

No block allocated A

A B

? 2

fsync( ) fails No metadata persisted

Block allocated Link only in memory A

A B

?

C

C 3b

Link persisted if future writes + fsync succeeds

A

A B

? 3a

Link persisted after some time or unmount

Non-overwritten block

slide-17
SLIDE 17

File System | Result #3: In-memory state

  • In-memory data structures are not entirely reverted
  • Holes in Btrfs as file descriptor offset is not reverted
  • On Btrfs -
  • Application reads zeroes at the hole offset - corruption

17

A

A B

1

Write to end

  • f file

No block allocated A

A

2

fsync( ) fails State is reverted

A

A

?

C

Next write is at updated offset

3

fsync( ) persists at updated offset

A

A C

C 4 Hole in place of B

slide-18
SLIDE 18

File System | Results Summary

  • Dirty pages are marked clean
  • Retries are ineffective
  • Errors are not handled uniformly
  • Page content varies across file systems
  • Notifications are not always immediate
  • In-memory data structures are not correct
  • Future operations cause non-overwritten blocks (ext4, XFS), holes (Btrfs)
  • Both are corruptions to the application

18

After fsync failure …

slide-19
SLIDE 19

Applications

slide-20
SLIDE 20

Applications

  • Five widely used applications

20

Key Value Store Relational Database Embedded Server LMDB v0.9.24

v5.0.7 v1.22 v12.0 v3.30.1

slide-21
SLIDE 21

Applications | Methodology

  • Goal: Are application strategies effective when fsync fails
  • Simple workload
  • Insert/Update a key-value pair
  • Use two-column table for RDBMS
  • Make fsync fail
  • Dump all key-value pairs
  • When running
  • On application restart
  • On page eviction
  • On machine restart

21

slide-22
SLIDE 22

Applications | Methodology: CuttleFS

  • Deterministic fault injection with

configurable post-failure reactions

  • Fail file offsets, not block numbers
  • User-space page cache
  • Easy to simulate different post-failure

reactions

  • Dirty or clean pages
  • New or old content
  • Immediate or late error reporting
  • Fine grained control over page eviction

22

Applications Disk File System

CuttleFS (FUSE) Intercepts file system requests

slide-23
SLIDE 23

Applications | Results: Overview

Redis LMDB LevelDB SQLite Rollback WAL PostgreSQL Default DirectIO

23

False Failures

Ext4 Ordered Mode

Corruption Data Loss

Ext4 Data Mode

Data Loss Corruption Data Loss

Btrfs

1 2 3 False Failures

XFS

Corruption Data Loss False Failures Corruption Corruption False Failures (Same as ext4 ordered)

slide-24
SLIDE 24

Applications | Results #1: Crash/Restart

  • Simple strategies fail
  • Crash/restart is incorrect: recovers wrong data from page cache
  • Example: PostgreSQL

24

Key Val A 1 WAL A 1 A 1 A = 0 A = 0 Table Key Val A 1 A = 0 A = 0 Table WAL A 1 A 1 A 2 WAL A 1 A 1 A 2 App Crash + Restart A = 0 A = 0 Table WAL A 1 A 1 A 2 Key Val A 2 2a 2b 1 3 SET A = 2 fsync( ) fails

False Failure

slide-25
SLIDE 25

Applications | Results #1: False Failures

  • False Failures: Indicate failure but actually succeed
  • PostgreSQL, SQLite, LevelDB WAL are affected

25

Expected State Actual State

Initially A=100 A=100

UPDATE Table SET A = A - 1

Reports failure A=100 A=99

False Failure

Retry… UPDATE Table SET A = A - 1

A=99 A=98

Double Decrement

slide-26
SLIDE 26

Applications | Results #2: Late Error Reporting

  • All applications susceptible to data loss on ext4 data mode
  • Late error reporting
  • Example:
  • PostgreSQL WAL

26

Key Val A 1 Key Val A 2 A = 0 A = 0 Table WAL A 1 A 1 A 2 2a WAL A 1 A 1 A = 0 A = 0 Table 1 SET A = 2 fsync( ) succeeds 2b WAL A 1 A 1 A 2 Journal A 1 A 2 Ext4 checkpointing fails 3 WAL A 1 A 1 A 2 Journal Key Val A 1 WAL A 1 A 1 A = 0 A = 0 Table 4 Machine Restart

Data Loss

slide-27
SLIDE 27

Applications | Results #3: Btrfs winning?

  • Btrfs copy-on-write strategy is good, but not entirely
  • Reverts page cache to match disk
  • Works well for recovery from WAL
  • Bad for rollback techniques
  • Example: SQLite rollback mode

27

slide-28
SLIDE 28

Applications | Results #3: Btrfs winning?

28

1 A

A B

B

C

C SQLite DB Rollback Journal 2a A

A B

B

C

C SQLite DB Rollback Journal

B

A

A X

B

C

C SQLite DB 2b Rollback Journal

B

A

A X

B

C

C SQLite DB 3 Rollback Journal Query: Updates B First write B to rollback Update B in main db fsync( ) on rollback fails Btrfs reverts contents Nothing to rollback anymore

False Failure

Rollback should not assume page-cache contents Corruptions in ext4 ordered mode / XFS.

slide-29
SLIDE 29

Applications | Results Summary

  • Simple strategies fail
  • Applications have moved away from retries
  • Crash/Restart not entirely correct
  • Don’t trust the page cache while recovering
  • Defenseless against late error reporting
  • Ext4 Data Mode
  • Data loss in all applications
  • Corruptions in some
  • Double fsync should help
  • Copy-on-write file systems look promising
  • Btrfs
  • Works well with write-ahead logs
  • Problematic with rollback journals

29

slide-30
SLIDE 30

Wrapping Up

  • Can applications recover from fsync failures?
  • Maybe, if …
  • Developers write file-system specific code
  • Need to standardize file-system behavior for fsync failures

30

slide-31
SLIDE 31

Challenges and Directions

  • How should post-failure behavior be standardized?
  • FreeBSD re-dirties pages
  • Should applications code for specific file systems?
  • Currently, OS-specific
  • We need a stronger contract for failed intentions (ext4 data mode)
  • Fault injection
  • Don’t mock system calls
  • Exercise file-system error handling
  • dm-loki: https://github.com/WiscADSL/dm-loki
  • Mock the file-system error handling
  • CuttleFS: https://github.com/WiscADSL/cuttlefs

31

slide-32
SLIDE 32

Summary

  • Durability is important
  • Hard to get right
  • fsync is essential
  • Failures are inevitable
  • We don’t handle them uniformly
  • Applications have different strategies to achieve durability
  • No single strategy works well on all file systems

32

slide-33
SLIDE 33

Questions?

  • Anthony Rebello
  • arebello@wisc.edu
  • https://github.com/WiscADSL/cuttlefs
  • https://github.com/WiscADSL/dm-loki

33

Thank You