Optimistic Crash Consistency Vijay Chidambaram Thanumalayan - - PowerPoint PPT Presentation

optimistic crash consistency
SMART_READER_LITE
LIVE PREVIEW

Optimistic Crash Consistency Vijay Chidambaram Thanumalayan - - PowerPoint PPT Presentation

Optimistic Crash Consistency Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Crash Consistency Problem Single file-system operation updates multiple on-disk data structures System may crash


slide-1
SLIDE 1

Optimistic Crash Consistency

Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau

slide-2
SLIDE 2

SOSP 13

Crash Consistency Problem

Single file-system operation updates multiple

  • n-disk data structures

System may crash in middle of updates File-system is partially (incorrectly) updated

2

slide-3
SLIDE 3

SOSP 13

Performance OR Consistency

Crash-consistency solutions degrade performance Users forced to choose between high performance and strong consistency

  • Performance differs by 10x for some workloads

Many users choose performance

  • ext3 default configuration did not guarantee crash

consistency for many years

  • Mac OSX fsync() does not ensure data is safe

3

“The Fast drives out the Slow even if the Fast is wrong”

  • Kahan
slide-4
SLIDE 4

SOSP 13

Ordering and Durability

Crash consistency is built upon ordered writes File systems conflate ordering and durability

  • Ideal: {A, B} -> {C} (made durable later)
  • Current scenario
  • {A, B} durable
  • {C} durable

Inefficient when only ordering is required

4

slide-5
SLIDE 5

SOSP 13

Can a file system provide both high performance and strong consistency?

5

Is there a middle ground between: high performance but no consistency strong consistency but low performance?

slide-6
SLIDE 6

SOSP 13

Our solution Optimistic File System (OptFS)

6

Journaling file system that provides performance and consistency by decoupling ordering and durability Such decoupling allows OptFS to trade freshness for performance while maintaining crash consistency

slide-7
SLIDE 7

SOSP 13

Results

Techniques: checksums, delayed writes, etc. OptFS provides strong consistency

  • Equivalent to ext4 data journaling

OptFS improves performance significantly

  • 10x better than ext4 on some workloads

New primitive osync() provides ordering among writes at high performance

7

slide-8
SLIDE 8

SOSP 13

Outline

Introduction Ordering and Durability in Journaling Optimistic File System Results Conclusion

8

slide-9
SLIDE 9

SOSP 13

Outline

Introduction Ordering and Durability in Journaling

  • Journaling Overview
  • Realizing Ordering on Disks
  • Journaling without Ordering

Optimistic File System Results Conclusion

9

slide-10
SLIDE 10

SOSP 13

Journaling Overview

Before updating file system, write note describing update Make sure note is safely on disk Once note is safe, update file system

  • If interrupted, read note and redo updates

10

slide-11
SLIDE 11

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

METADATA DATA

slide-12
SLIDE 12

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

  • Data write (D)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

D METADATA DATA

slide-13
SLIDE 13

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

  • Data write (D)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

D METADATA DATA

slide-14
SLIDE 14

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

  • Data write (D)
  • Logging Metadata (JM)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

D JM METADATA DATA

slide-15
SLIDE 15

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

  • Data write (D)
  • Logging Metadata (JM)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

D JM METADATA DATA

slide-16
SLIDE 16

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

D JM JC METADATA DATA

slide-17
SLIDE 17

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

D JM JC METADATA DATA

slide-18
SLIDE 18

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)
  • Checkpointing (M)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

D JM JC M METADATA DATA

slide-19
SLIDE 19

SOSP 13

Journal

Workload: Creating and writing to a file Journaling protocol (ordered journaling)

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)
  • Checkpointing (M)

Journaling Overview

11

FILE SYSTEM DISK APPLICATION

D JM JC M METADATA DATA

slide-20
SLIDE 20

SOSP 13

Outline

Introduction Ordering and Durability in Journaling

  • Journaling Overview
  • Realizing Ordering on Disks
  • Journaling without Ordering

Optimistic File System Results Conclusion

12

slide-21
SLIDE 21

SOSP 13

How Writes are Ordered

13

Disk B A A B B A A B

Disk Cache Disk Platter

B A A B Flush

Original Disks Disks with Write Buffers

slide-22
SLIDE 22

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

METADATA DATA

DISK PLATTER

Journaling protocol

  • Data write (D)
slide-23
SLIDE 23

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D METADATA DATA

DISK PLATTER

Journaling protocol

  • Data write (D)
slide-24
SLIDE 24

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D METADATA DATA

DISK PLATTER

Journaling protocol

  • Data write (D)
slide-25
SLIDE 25

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM METADATA DATA

DISK PLATTER

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
slide-26
SLIDE 26

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM METADATA DATA

DISK PLATTER

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
slide-27
SLIDE 27

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM METADATA DATA

DISK PLATTER

FLUSH

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
slide-28
SLIDE 28

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM METADATA DATA

DISK PLATTER

FLUSH

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
slide-29
SLIDE 29

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM JC METADATA DATA

DISK PLATTER

FLUSH

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)
slide-30
SLIDE 30

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM JC METADATA DATA

DISK PLATTER

FLUSH

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)
slide-31
SLIDE 31

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM JC METADATA DATA

DISK PLATTER

FLUSH FLUSH

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)
slide-32
SLIDE 32

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM JC METADATA DATA

DISK PLATTER

FLUSH FLUSH

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)
slide-33
SLIDE 33

SOSP 13

Journal

Journaling with Flushes

14

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

FLUSH FLUSH

Journaling protocol

  • Data write (D)
  • Logging Metadata (JM)
  • Logging Commit (JC)
  • Checkpointing (M)
slide-34
SLIDE 34

SOSP 13

Outline

Introduction Ordering and Durability in Journaling

  • Journaling Overview
  • Realizing Ordering on Disks
  • Journaling without Ordering

Optimistic File System Results Conclusion

15

slide-35
SLIDE 35

SOSP 13

Journaling without Ordering

Practitioners turn off flushes due to performance degradation

  • Ex: ext3 by default did not enable flushes for

many years

Observe crashes do not cause inconsistency for some workloads We term this probabilistic crash consistency

  • Studied in detail

16

slide-36
SLIDE 36

SOSP 13

Journal

Journaling without Ordering

17

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

FLUSH FLUSH

slide-37
SLIDE 37

SOSP 13

Journal

Journaling without Ordering

17

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

slide-38
SLIDE 38

SOSP 13

Journal

Journaling without Ordering

17

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

Without flushes, blocks may be reordered

slide-39
SLIDE 39

SOSP 13

Journal

Journaling without Ordering

17

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

Without flushes, blocks may be reordered

  • Ex: JC and JM written first as disk head near journal
slide-40
SLIDE 40

SOSP 13

Journal

Journaling without Ordering

17

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

Without flushes, blocks may be reordered

  • Ex: JC and JM written first as disk head near journal
slide-41
SLIDE 41

SOSP 13

Probabilistic Crash Consistency

18

D JM JC M

Time MEMORY DISK

slide-42
SLIDE 42

SOSP 13

Probabilistic Crash Consistency

18

D JM JC M

Time MEMORY DISK

JC

slide-43
SLIDE 43

SOSP 13

Probabilistic Crash Consistency

18

D JM JC M

Time MEMORY DISK

D JM JC M

slide-44
SLIDE 44

SOSP 13

Probabilistic Crash Consistency

Re-ordering leads to windows of vulnerability

18

D JM JC M

Time Window Total I/O Time P-inconsistency = Time in window(s) / Total I/O Time MEMORY DISK

D JM JC M

slide-45
SLIDE 45

SOSP 13

Probabilistic Crash Consistency

p-inconsistency for different workloads

  • Read-heavy workloads have low p-inconsistency
  • Database workloads have high p-inconsistency

See paper for detailed study

  • Factors that affect p-inconsistency

Turning off flushing provides performance, but does not ensure consistency Additional techniques required to obtain both performance and consistency

19

slide-46
SLIDE 46

SOSP 13

Outline

Introduction Ordering and Durability in Journaling Optimistic File System

  • Overview
  • Handling Re-Ordering
  • New File-system Primitives

Results Conclusion

20

slide-47
SLIDE 47

SOSP 13

Optimistic File System

Achieves both performance and consistency by trading on new axis Freshness indicates how up-to-date state is after a crash OptFS provides strong consistency while trading freshness for increased performance

21

State 1 State 2 State 3 State 4

X

ext4 OptFS

slide-48
SLIDE 48

SOSP 13

Optimistic File System

Eliminates flushes in the common case Blocks may be re-ordered without flushes Optimistic Crash Consistency handles re-orderings with different techniques

  • Some re-orderings are detected after crash
  • Some re-orderings are prevented from occurring

22

slide-49
SLIDE 49

SOSP 13

Modified Disk Interface

Asynchronous Durability Notifications (ADN) signal when block is made durable

23

B A A B

Disk Cache Disk Platter

slide-50
SLIDE 50

SOSP 13

Modified Disk Interface

ADNs increase disk freedom

  • Blocks can be destaged in any order
  • Blocks can be destaged at any time
  • Only requirement is to inform upper layer

OptFS uses ADNs to control what blocks are dirty at the same time in disk cache

  • Re-ordering can only happen among these blocks

24

slide-51
SLIDE 51

SOSP 13

Outline

Introduction Ordering and Durability in Journaling Optimistic File System

  • Overview
  • Handling Re-Ordering
  • New File-system Primitives

Results Conclusion

25

slide-52
SLIDE 52

SOSP 13

Journal

Handling Re-Ordering: Removing Flush #1

26

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

FLUSH

Flush after JM is removed

  • Checksums used to handle reordering
slide-53
SLIDE 53

SOSP 13

Technique #1: Checksums

JC could be re-ordered before D or JM

27

D JM JC M

Re-ordering detected using checksums

  • Computed over data and metadata
  • Checked during recovery
  • Mismatch indicates blocks were lost during crash

FLUSH

slide-54
SLIDE 54

SOSP 13

Handling Re-Ordering: Removing Flush #2

28

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

Flush after JC is removed

  • Delayed writes used to prevent reordering

Journal

slide-55
SLIDE 55

SOSP 13

Technique #2: Delayed Writes

M could be re-ordered before D or JM or JC

29

D JM JC M

Re-ordering prevented using delayed writes

  • Wait until ADN arrive for D, JM, and JC
  • Then issue M to disk cache
  • Invariant: D/JM/JC and M never dirty in cache together

D JM JC

slide-56
SLIDE 56

SOSP 13

Journal

Optimistic Journaling

30

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

FLUSH

Checksums and Delayed Writes handle reordering from removing flushes

slide-57
SLIDE 57

SOSP 13

Journal

Optimistic Journaling

30

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

FLUSH

Checksums and Delayed Writes handle reordering from removing flushes

slide-58
SLIDE 58

SOSP 13

Journal

Optimistic Journaling

30

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

FLUSH

Checksums and Delayed Writes handle reordering from removing flushes

slide-59
SLIDE 59

SOSP 13

Journal

Optimistic Journaling

30

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

FLUSH

Checksums and Delayed Writes handle reordering from removing flushes

slide-60
SLIDE 60

SOSP 13

Journal

Optimistic Journaling

30

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

Checksums and Delayed Writes handle reordering from removing flushes

slide-61
SLIDE 61

SOSP 13

Journal

Optimistic Journaling

30

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

Checksums and Delayed Writes handle reordering from removing flushes

D JM JC

slide-62
SLIDE 62

SOSP 13

Journal

Optimistic Journaling

30

FILE SYSTEM DISK CACHE APPLICATION

D JM JC M METADATA DATA

DISK PLATTER

Checksums and Delayed Writes handle reordering from removing flushes

D JM JC

slide-63
SLIDE 63

SOSP 13

Optimistic Techniques

Other Techniques

  • In-order journal recovery and release
  • Reuse after notification
  • Selective data journaling

See paper for more details

31

slide-64
SLIDE 64

SOSP 13

Outline

Introduction Ordering and Durability in Journaling Optimistic File System

  • Overview
  • Handling Re-Ordering
  • New File-system Primitives

Results Conclusion

32

slide-65
SLIDE 65

SOSP 13

File-system Primitives

33

write(log) write(header) fsync(log) fsync(header) write(log) write(header)

  • sync(log)

dsync(header)

fsync() provides ordering and durability

OptFS splits fsync()

  • osync() for only ordering and high performance
  • dsync() for durability

Primitives can increase performance

  • Ex: SQLite
slide-66
SLIDE 66

SOSP 13

Implementation

OptFS based on ext4 code

  • Around 3000 lines of modified/added code

Required modifications to

  • Journaling layer
  • Virtual Memory subsystem

ADNs were emulated using timeouts

  • Block received by disk at time T
  • Block durable at time T+D
  • D = 30 s in our implementation (conservative)

34

slide-67
SLIDE 67

SOSP 13

Outline

Introduction Ordering and Durability in Journaling Optimistic File System Results Conclusion

35

slide-68
SLIDE 68

SOSP 13

Evaluation

Does OptFS preserve file-system consistency after crashes?

  • OptFS consistent after 400 random crashes

How does OptFS perform?

  • OptFS 4-10x better than ext4 with flushes

Can meaningful application-level consistency be built on top of OptFS?

  • Studied gedit and SQLite on OptFS

36

slide-69
SLIDE 69

SOSP 13

Testing Application-Level Consistency

Methodology

  • Start from initial disk image
  • Run application
  • Replace fsync() with osync()
  • Trace writes
  • Re-order writes
  • Drop writes after random point
  • Replay writes on initial disk image
  • Examine application state on new image

37

slide-70
SLIDE 70

SOSP 13

SQLite Consistency

38

Initial Image Final Image

W1 W2 W3 W4

slide-71
SLIDE 71

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3 W4

slide-72
SLIDE 72

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3 W4

slide-73
SLIDE 73

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3 W4

slide-74
SLIDE 74

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3

Crashed Image ext4 without flushes 73%

slide-75
SLIDE 75

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3 Zero inconsistencies with OptFS

  • r

ext4 with flushes

slide-76
SLIDE 76

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3

Final Image ext4 with flushes 50% 50%

slide-77
SLIDE 77

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3

Final Image ext4 with flushes 50% 50% OptFS 24% 76%

  • sync() changes semantics from ACID to

ACI-(Eventual Durability)

slide-78
SLIDE 78

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3

Final Image ext4 with flushes 50% 50% OptFS 24% 76% Time 150 ms 15 ms

  • sync() changes semantics from ACID to

ACI-(Eventual Durability)

slide-79
SLIDE 79

SOSP 13

SQLite Consistency

38

Initial Image

W1 W2 W3

Final Image ext4 with flushes 50% 50% OptFS 24% 76% Time 150 ms 15 ms

  • sync() changes semantics from ACID to

ACI-(Eventual Durability) SQLite is able to provide ACI semantics with osync(), at 10x performance

slide-80
SLIDE 80

SOSP 13

Outline

Introduction Ordering and Durability in Journaling Optimistic File System Results Conclusion

39

slide-81
SLIDE 81

SOSP 13

Summary

Problem: providing both performance and consistency Solution: decoupling ordering and durability in OptFS Eventual Durability maintains consistency while trading freshness for increased performance

  • sync() provides a cheap primitive to
  • rder application writes

40

slide-82
SLIDE 82

SOSP 13

Conclusion

Storage-stack layers are increasing

  • 18 layers between application and storage [Thereska13]
  • Interfaces that provide freedom to each layer are the

way forward

First impulse: trade consistency for performance

  • Trade-off not required in distributed systems [Escriva12]
  • By trading freshness, we can obtain both consistency

and high performance

41

slide-83
SLIDE 83

SOSP 13

Thank You Source code http://research.cs.wisc.edu/adsl/Software/optfs/

http://github.com/vijay03/optfs

Questions?