Can Applications Recover from fsync Failures? Anthony Rebello, - PowerPoint PPT Presentation

Can Applications Recover from fsync Failures? Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau University of Wisconsin-Madison

How does data reach the disk? • Applications use the file system Applications • System calls – open( ), read( ), write( ) • For Performance File System Dirty Pages: New data • Data buffered in the page cache to write to disk • Modified pages are marked dirty • Periodically flushed to disk Clean pages: same • Vulnerable to data loss while in RAM content as disk Periodically • For Correctness or on fsync( ) • Dirty pages can be flushed immediately Disk using fsync( ) 2

fsync is really important • Many applications care about durability • Ensure data on non-volatile storage before acknowledging client • Devices have volatile storage • Direct IO: fsync can issue a FLUSH command • Ordering of writes is important • Force to disk with fsync before writing the next • Optimistic Crash Consistency Chidambaram et al. [SOSP’13] • Decouples ordering from durability 3

It’s hard to get durability correct • Applications find it difficult • Even when fsync works correctly • Example: persisting a newly created file • creat(/d/foo) • write(/d/foo, “abcd”) • fsync(/d/foo) • fsync(/d) ß Ensure that directory entry is persisted • All File Systems Are Not Created Equal Pillai et al. [OSDI’14] • Studied 11 applications • Update protocols are tricky • More than 30 vulnerabilities under ext3, ext4, btrfs 4

fsync can fail • Durability gets harder to get right • Failures before interacting with disk • Invalid arguments, insufficient space • Easy to handle • Failures while interacting with disk • EIO: An error occurred during synchronization • Transient disk errors, network disconnects • In-memory data structures may need to be reverted 5

Why care about fsync failures? “About a year ago the PostgreSQL community discovered that fsync (on Linux and some BSD systems) may not work the way we always thought it is [sic], with possibly disastrous consequences for data durability/consistency (which is something the PostgreSQL community really values).” - Tomas Vondra, FOSDEM 2019 6

Our work • Systematically understand fsync failures 2 • Application reactions to fsync failures Applications • Redis, LMDB, LevelDB, SQLite, PostgreSQL File System 1 • File system reactions to fsync failures • Ext4, XFS, Btrfs Disk 7

File System Results • All file systems mark dirty pages clean on fsync failure • Retries are ineffective • File systems do not handle errors during fsync uniformly • Content in pages is different • Latest data (ext4, XFS), Old data (Btrfs) • Failure notifications not always immediate • Ext4 data mode reports errors later • In-memory data structures are not entirely reverted after fsync failure • Garbage/Zeroes in the files • Free space and block allocation unaltered (ext4, XFS) • User-space file descriptor offset unaltered (Btrfs) 8

Application Results • Simple strategies fail • Retries are ineffective • Crash/Restart can be incorrect • False Failures: Indicate failure but actually succeed • Incorrect recovery from WAL using the page cache • Defenseless against late error reporting • Ext4 data mode • Every application faced data loss • Most faced corruption (all except PostgreSQL) • Copy-on-write is good, but not invincible • Btrfs is bad for rollback strategies • But seems good for WAL recovery 9

Outline • Introduction • File Systems • Methodology (dm-loki, workloads) • Results • Applications • Methodology (CuttleFS) • Results • Challenges and Directions • Summary 10

File System | Methodology: Fault Injection • Goal: Understand file system reactions to fsync failures without modifying the kernel Applications • Intercept all block requests that go to disk • Custom device mapper target – dm-loki • Trace bio requests • Fail i th write to sector/block File System dm-loki: intercepts bio requests Disk 11

File System | Methodology: Workloads • Common write patterns in applications • Reduced to simplest form • Single Block Update A B C A X C • Modify a single block in a file • Examples: A A B C X C • LMDB, PostgreSQL, SQLite • Multi Block Append • Add new blocks to the end of a file A A B C • Examples: • Redis append-only file A A B C • Write-ahead logs • PostgreSQL, LevelDB, SQLite 12

File System | Result #1: Clean Pages • Dirty page is marked clean after fsync failure on all three file systems A B C A * C A * C A A A B C B C B C fsync( ) fails 1 Modify middle page 2 3 Page is marked clean • Feature, not bug • Avoids memory leaks when user removes USB stick • Retries are ineffective • No more dirty pages on the next fsync 13

File System | Result #2a: Page Content • File systems do not handle fsync errors uniformly A X C • Page content depends on file system A B C A X C A ? C Ext4 and XFS 2a Keep latest data A A B C B C A B C Middle page fsync( ) fails 1 2 modified Page is marked clean A B C 2b Btrfs reverts state • Cannot reliably depend on page cache content after an fsync failure 14

File System | Result #2b: Notifications • File systems do not report fsync failures uniformly • Ext4 data mode reports failures later • Ext4 ordered mode, XFS, Btrfs report immediately A X C A X C A X C Journal Journal A B C X Middle page 1 A B C A B C modified fsync( ) succeeds Failure when writing Fails next fsync( ) 2 3 Data written to journal journal to disk • Ext4 data mode reports success too early • Two fsyncs can solve the problem 15

File System | Result #3: In-memory state • In-memory data structures are not entirely reverted • Free space and block allocation unaltered in ext4, XFS A B Non-overwritten Block allocated block A B A B Link only in memory A ? No block allocated Link persisted after 3a A A ? some time or unmount Write to end fsync( ) fails 2 1 of file No metadata persisted A B C A ? C • On EXT4 and XFS - Link persisted if future 3b • Applications read block’s old contents - corruption writes + fsync succeeds 16

File System | Result #3: In-memory state • In-memory data structures are not entirely reverted • Holes in Btrfs as file descriptor offset is not reverted Hole in place of B A B A A A C C No block allocated A A A A ? C Write to end fsync( ) fails Next write is at fsync( ) persists at 4 2 1 3 of file State is reverted updated offset updated offset • On Btrfs - • Application reads zeroes at the hole offset - corruption 17

File System | Results Summary After fsync failure … • Dirty pages are marked clean • Retries are ineffective • Errors are not handled uniformly • Page content varies across file systems • Notifications are not always immediate • In-memory data structures are not correct • Future operations cause non-overwritten blocks (ext4, XFS), holes (Btrfs) • Both are corruptions to the application 18

Applications

Applications • Five widely used applications Key Value Store Relational Database Embedded v1.22 LMDB v0.9.24 v3.30.1 Server v5.0.7 v12.0 20

Applications | Methodology • Goal: Are application strategies effective when fsync fails • Simple workload • Insert/Update a key-value pair • Use two-column table for RDBMS • Make fsync fail • Dump all key-value pairs • When running • On application restart • On page eviction • On machine restart 21

Applications | Methodology: CuttleFS • Deterministic fault injection with Applications configurable post-failure reactions CuttleFS (FUSE) • Fail file offsets, not block numbers Intercepts file system requests • User-space page cache File System • Easy to simulate different post-failure reactions • Dirty or clean pages • New or old content • Immediate or late error reporting • Fine grained control over page eviction Disk 22

Applications | Results: Overview Ext4 Ext4 XFS Btrfs Ordered Mode Data Mode Data Data Data Redis Corruption Corruption Loss Loss Loss LMDB LevelDB Corruption SQLite Data False False Rollback False False Loss Corruption Corruption Failures Failures Failures Failures WAL 3 PostgreSQL Default DirectIO 1 (Same as ext4 ordered) 2 23

Applications | Results #1: Crash/Restart • Simple strategies fail • Crash/restart is incorrect: recovers wrong data from page cache • Example: PostgreSQL False Failure fsync( ) fails Key Val Key Val Key Val A 1 A 1 A 2 Table WAL Table WAL WAL Table WAL App Crash A 1 A 1 A 1 A 1 A = 0 A = 0 A = 0 + A 2 A 2 A 2 Restart A 1 A 1 A 1 A 1 A = 0 A = 0 A = 0 3 2b SET A = 2 2a 1 24

Applications | Results #1: False Failures • False Failures: Indicate failure but actually succeed Expected State Actual State Initially A=100 A=100 UPDATE Table SET A = A - 1 Reports failure A=100 A=99 False Failure Retry… UPDATE Table SET A = A - 1 A=99 A=98 Double Decrement • PostgreSQL, SQLite, LevelDB WAL are affected 25

Can Applications Recover from fsync Failures? Anthony Rebello, - PowerPoint PPT Presentation

Can Applications Recover from fsync Failures? Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau University of Wisconsin-Madison How does data reach the disk? Applications use the file

Relax and Recover Relax and Recover (ReaR) The Ultimate Disaster Recovery Framework

Let it Recover: Multiparty Protocol-Induced Recovery 1 Fail fast and recover quickly

Investigation of Failures 49 CFR 192.617 192.617 Investigation of Failures Each operator

Protection and Restoration Introduction Fact: Networks fail. Types of failures: Path

Failures and Consensus Failures and Consensus Coordination Coordination If the solution to

MySQL High Availability Solutions Alex Poritskiy Percona The Five 9s of Availability

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/

Political Market Failures and Corruption November 2008 () Political Market Failures and

Contention-Related Crash Failures Anas Durand LIP6, Sorbonne Universit, Paris April 1st,

Analysis of link failures in an Analysis of link failures in an IP backbone network IP backbone

Using Feature Locality: Can We Motivation Leverage History to Avoid Failures Failure Avoidance

Force Stafford Act Recovery Funding Prepare + Prevent + Respond + Recover + Mitigate Stafford

Point-wise map recovery Task : Recover a point-to-point map from its functional representation n

Relax and Recover (rear) Workshop Gratien D'haese Gratien D'haese IT3 Consultants IT3

What Can We Learn from Four Years of Data Center Hardware Failures? Guosai Wang, Lifei Zhang, Wei

Reliability In case of a crash, recover to a consistent (or correct state) and continue

Enterprise Storage Architecture Fall 2018 Data recovery and forensics Tyler Bletsch Duke

The Recovery Audit Program and Medicare The Who, What, When, Where, How and Why? 1 Agenda

Disas Dis aster R Recovery an and Gr Gran ant t Re Reporting g (DRGR) ) System m DRGR

Argosy: Verifying layered storage systems with recovery refinement Tej Chajed , Joseph Tassarotti,

Indirect Cost Recovery Using Federal Funds to Recover Indirect Costs Federal Funding

Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services

Community Recovery in Graphs with Locality Yuxin Chen , Govinda Kamath , Changho Suh ,

Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling Shanshan Wu 1 , Alex