Can Applications Recover from fsync Failures?
Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau University of Wisconsin-Madison
Can Applications Recover from fsync Failures? Anthony Rebello, - - PowerPoint PPT Presentation
Can Applications Recover from fsync Failures? Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau University of Wisconsin-Madison How does data reach the disk? Applications use the file
Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau University of Wisconsin-Madison
using fsync( )
2
Applications Disk File System Clean pages: same content as disk Dirty Pages: New data to write to disk Periodically
3
4
5
“About a year ago the PostgreSQL community discovered that fsync (on Linux and some BSD systems) may not work the way we always thought it is [sic], with possibly disastrous consequences for data durability/consistency (which is something the PostgreSQL community really values).”
6
7
Applications Disk File System 2 • Application reactions to fsync failures
1 • File system reactions to fsync failures
8
9
10
11
Applications Disk File System
dm-loki: intercepts bio requests
12
A
A B
B
C
C A
A X
X
C
C A
A
A
A B
B
C
C
13
A
A B
B
C
C 1 A
A *
B
C
C 2 A
A *
B
C
C 3
Modify middle page fsync( ) fails Page is marked clean
14
A
A X
B
C
C 1
Middle page modified
A
A X
B
C
C 2a
Ext4 and XFS Keep latest data
A
A B
B
C
C 2b
Btrfs reverts state
A
A ?
B
C
C 2
fsync( ) fails Page is marked clean
15
3 Journal A
A X
B
C
C
fsync( ) succeeds Data written to journal Failure when writing journal to disk Fails next fsync( )
A
A X
B
C
C 1
Middle page modified
Journal A
A X
B
C
C X 2
16
A
A B
1
Write to end
No block allocated A
A B
? 2
fsync( ) fails No metadata persisted
Block allocated Link only in memory A
A B
?
C
C 3b
Link persisted if future writes + fsync succeeds
A
A B
? 3a
Link persisted after some time or unmount
Non-overwritten block
17
A
A B
1
Write to end
No block allocated A
A
2
fsync( ) fails State is reverted
A
A
?
C
Next write is at updated offset
3
fsync( ) persists at updated offset
A
A C
C 4 Hole in place of B
18
20
Key Value Store Relational Database Embedded Server LMDB v0.9.24
v5.0.7 v1.22 v12.0 v3.30.1
21
configurable post-failure reactions
reactions
22
Applications Disk File System
CuttleFS (FUSE) Intercepts file system requests
Redis LMDB LevelDB SQLite Rollback WAL PostgreSQL Default DirectIO
23
False Failures
Ext4 Ordered Mode
Corruption Data Loss
Ext4 Data Mode
Data Loss Corruption Data Loss
Btrfs
1 2 3 False Failures
XFS
Corruption Data Loss False Failures Corruption Corruption False Failures (Same as ext4 ordered)
24
Key Val A 1 WAL A 1 A 1 A = 0 A = 0 Table Key Val A 1 A = 0 A = 0 Table WAL A 1 A 1 A 2 WAL A 1 A 1 A 2 App Crash + Restart A = 0 A = 0 Table WAL A 1 A 1 A 2 Key Val A 2 2a 2b 1 3 SET A = 2 fsync( ) fails
False Failure
25
Expected State Actual State
Initially A=100 A=100
UPDATE Table SET A = A - 1
Reports failure A=100 A=99
False Failure
Retry… UPDATE Table SET A = A - 1
A=99 A=98
Double Decrement
26
Key Val A 1 Key Val A 2 A = 0 A = 0 Table WAL A 1 A 1 A 2 2a WAL A 1 A 1 A = 0 A = 0 Table 1 SET A = 2 fsync( ) succeeds 2b WAL A 1 A 1 A 2 Journal A 1 A 2 Ext4 checkpointing fails 3 WAL A 1 A 1 A 2 Journal Key Val A 1 WAL A 1 A 1 A = 0 A = 0 Table 4 Machine Restart
Data Loss
27
28
1 A
A B
B
C
C SQLite DB Rollback Journal 2a A
A B
B
C
C SQLite DB Rollback Journal
B
A
A X
B
C
C SQLite DB 2b Rollback Journal
B
A
A X
B
C
C SQLite DB 3 Rollback Journal Query: Updates B First write B to rollback Update B in main db fsync( ) on rollback fails Btrfs reverts contents Nothing to rollback anymore
False Failure
Rollback should not assume page-cache contents Corruptions in ext4 ordered mode / XFS.
29
30
31
32
33