Resolving Journaling of Journal Anomaly via Weaving Recovery - - PowerPoint PPT Presentation

resolving journaling of journal anomaly via weaving
SMART_READER_LITE
LIVE PREVIEW

Resolving Journaling of Journal Anomaly via Weaving Recovery - - PowerPoint PPT Presentation

NVRAMOS 14 10.30. 2014 Resolving Journaling of Journal Anomaly via Weaving Recovery Information into DB Page Beomseok Nam UNIST Outline Motivation Journaling of Journal Anomaly How to resolve Journaling of Journal anomaly


slide-1
SLIDE 1

Resolving Journaling of Journal Anomaly via Weaving Recovery Information into DB Page Beomseok Nam

UNIST

NVRAMOS ‘14 10.30. 2014

slide-2
SLIDE 2

Outline

  • Motivation
  • Journaling of Journal Anomaly
  • How to resolve Journaling of Journal anomaly
  • Multi-Version B-Tree (MVBT)
  • Optimizations of Multi-Version B-Tree
  • Lazy Split
  • Reserved Buffer Space
  • Lazy Garbage Collection
  • Metadata Embedding
  • Disabling Sibling Redistribution
  • Evaluation
  • Conclusion

2

slide-3
SLIDE 3

Storage I/O Problems in Android

  • Performance Bottleneck

3

1 1 1 1 1 1 1 1 1 1

  • Lifetime of Storage

Cause = Excessive IO

slide-4
SLIDE 4

Android I/O Stack

Block Device Driver EXT4 SQLite

Apps

Insert/Update/Delete/Select

Read/Write

4

Txn Journaling Write() Journaling

Misaligned Interaction

slide-5
SLIDE 5

Journaling of Journal Anomaly

  • Journaling in SQLite (TRUNCATE mode)

SQLite

Insert a DB entry Open rollback journal. Record the data to journal. Put commit mark to journal. Insert entry to DB Truncate journal.

5

slide-6
SLIDE 6

EXT4 Journaling (ordered mode)

6

EXT4 SQLite

Write data block Write EXT4 journal Write journal metadata Write journal commit write()

slide-7
SLIDE 7

Journaling of Journal Anomaly

7

EXT4 SQLite insert

One insert of 100 Byte  9 Random Writes of 4KByte

Write data block Write EXT4 journal Write journal metadata Write journal commit

slide-8
SLIDE 8

How to Resolve Journaling of Journal?

EXT4 SQLite

Insert a DB entry Database Journaling Database Insertion Journaling of journal anomaly.

8

slide-9
SLIDE 9

How to Resolve Journaling of Journal?

fsync() fsync() fsync() Journal DB Journal+DB =

Version-based B-Tree (MVBT)

9

slide-10
SLIDE 10

Version-based B-Tree (MVBT)

  • Versioning

10 [T1, ∞) T1 Insert 10 Update 10 with 20 T2 10 [T1, T2) 20 [T2, ∞) Time

Do not overwrite old version data Do not need a rollback journal

10

Dead Entry

slide-11
SLIDE 11

Node Split in Multi-Version B-Tree

key 25, ver=5

11

5 [3~∞) 10 [2~∞) 12 [4~∞) 40 [1~∞) P1

slide-12
SLIDE 12

Node Split in Multi-Version B-Tree

key 25, ver=5

12

5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) P1 Dead Node

slide-13
SLIDE 13

Node Split in Multi-Version B-Tree

13

5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) P1 key 25, ver=5

slide-14
SLIDE 14

5 [5~∞) 10 [5~∞) P3 12 [5~∞) 40 [5~∞) P2

Node Split in Multi-Version B-Tree

14

5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) P1 key 25, ver=5

slide-15
SLIDE 15

∞ [0~5) : P1

Node Split in Multi-Version B-Tree

15

5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) P1 5 [5~∞) 10 [5~∞) P3 12 [5~∞) 40 [5~∞) P2 10 [5~∞) : P3 ∞ [5~∞) : P2 P4 key 25, ver=5 25 [5~∞)

slide-16
SLIDE 16

∞ [0~5) : P1

Node Split in Multi-Version B-Tree

16

5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) P1 5 [5~∞) 10 [5~∞) P3 12 [5~∞) 40 [5~∞) P2 10 [5~∞) : P3 ∞ [5~∞) : P2 P4 25 [5~∞) dirty dirty dirty dirty One more dirty page than

  • riginal B-Tree
slide-17
SLIDE 17

I/O Traffic in MVBT

DB buffer cache

fsync()

17

Reduce the number of dirty pages!

MVBT

slide-18
SLIDE 18

I/O Traffic in MVBT

MVBT LS -MVBT

DB buffer cache

fsync()

DB buffer cache

fsync()

18

slide-19
SLIDE 19

Optimizations in Android I/O

Write

Read twrite tread

>>

Write Transaction 1 Write Transaction 2 Write Transaction 3

19

Single Insert Transaction

Multiple Insert

slide-20
SLIDE 20
  • Optimizations

LS-MVBT

Lazy Split Multi-Version B-Tree (LS-MVBT)

Lazy Split Disabling Sibling Redistribution Metadata Embedding Lazy Garbage Collection Reserved Buffer Space

20

slide-21
SLIDE 21
  • Legacy Split in MVBT

Optimization1: Lazy Split

21

∞ [0~5) : P1 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) P1 5 [5~∞) 10 [5~∞) P3 12 [5~∞) 40 [5~∞) P2 10 [5~∞) : P3 ∞ [5~∞) : P2 P4 25 [5~∞) dirty dirty dirty dirty

4 dirty pages

slide-22
SLIDE 22

Optimization1: Lazy Split

22

key 25, ver=5 5 [3~∞) 10 [2~∞) 12 [4~∞) 40 [1~∞) P1

slide-23
SLIDE 23

Optimization1: Lazy Split

23

key 25, ver=5 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1 Lazy Node: Half-dead, Half-live

slide-24
SLIDE 24

Optimization1: Lazy Split

24

key 25, ver=5 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1

slide-25
SLIDE 25

Optimization1: Lazy Split

25

key 25, ver=5 12 [5~∞) 40 [5~∞) P2 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1

slide-26
SLIDE 26

Optimization1: Lazy Split

26

key 25, ver=5 12 [5~∞) 40 [5~∞) P2 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1 10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 25 [5~∞)

slide-27
SLIDE 27

Optimization1: Lazy Split

  • Lazy Node Overflow

27

key 8, ver = 6 Dead entries Dead entries 12 [5~∞) 40 [5~∞) P2 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1 10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 25 [5~∞)

slide-28
SLIDE 28
  • Lazy Node Overflow → Garbage collect dead entries

Optimization1: Lazy Split

key 8, ver = 6

28

12 [5~∞) 40 [5~∞) P2 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1 10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 25 [5~∞)

slide-29
SLIDE 29
  • Lazy Node Overflow → Garbage collect dead entries

Optimization1: Lazy Split

29

5 [3~∞) 8 [6~∞) 10 [2~∞) P1 10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 key 8, ver = 6 12 [5~∞) 40 [5~∞) P2 25 [5~∞) But, what if the dead entries are being accessed by other transactions?

slide-30
SLIDE 30

Optimization2: Reserved Buffer Space

  • Option 1. Wait for read transactions to finish

30

12 [5~∞) 40 [5~∞) P2 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1 10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 25 [5~∞) key 8, ver = 6

slide-31
SLIDE 31

5 [6~∞) 10 [6~∞) P3

Optimization2: Reserved Buffer Space

  • Option 2. Split as in legacy MVBT split

31

5 [3~6) 10 [2~6) 12 [4~5) 40 [1~5) P1 10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 12 [5~∞) 40 [5~∞) P2 25 [5~∞) key 8, ver = 6

slide-32
SLIDE 32

5 [6~∞) 8 [6~∞) P3

Optimization2: Reserved Buffer Space

  • Option 2. Split as in legacy MVBT split

32

5 [3~6) 10 [2~6) 12 [4~5) 40 [1~5) P1 10 [5~6) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 12 [5~∞) 40 [5~∞) P2 25 [5~∞) key 8, ver = 6 10 [6~∞) : P3 10 [6~∞)

slide-33
SLIDE 33
  • Option 3. Pad some buffer space in tree nodes

Optimization2: Reserved Buffer Space

Buffer space is used when lazy node is full

33

40 [5~∞) P2 10 [2~∞) 12 [4~5) 40 [1~5) 9 [6~∞) P1 10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 25 [5~∞) key 8, ver = 6 12 [5~∞) If buffer space is also full, split as in legacy MVBT

slide-34
SLIDE 34
  • Similar to rollback in MVBT

10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3

Rollback in LS-MVBT

34

Txn 5 crashes

12 [5~∞) 40 [5~∞) P2 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1 25 [5~∞)

Rollback

Number of dirty nodes touched by rollback of LS- MVBT is also smaller than that of MVBT. 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1 10 [5~∞) : P1 ∞ [0~5) : P1 ∞ [5~∞) : P2 P3 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) P1 ∞ [0~5) : P1 P3 ∞ [0~∞) : P1 P3 5 [3~∞) 10 [2~∞) 12 [4~∞) 40 [1~∞) P1

slide-35
SLIDE 35

Optimization3: Lazy Garbage Collection

  • Periodic Garbage Collection

35

P1 P2 P3 P1

Transaction buffer cache

Transaction 1

slide-36
SLIDE 36

Optimization3: Lazy Garbage Collection

  • Periodic Garbage Collection

36

Garbage Collection

P1

GC buffer cache

Transaction 1

P2 P3

Stopped

P1 P2 P3

fsync()

Extra Dirty Pages

Transaction buffer cache

slide-37
SLIDE 37

Optimization3: Lazy Garbage Collection

  • Lazy Garbage Collection

37

P2 P3

Transaction buffer cache

Transaction 1

fsync()

P1 P1

Do not garabge collect if no space is needed

Insert

No Extra Dirty Page by GC

slide-38
SLIDE 38
  • Version = “File Change Counter” in DB header page

25 [5~∞) 40 [1~∞) 15 [6~∞)

Optimization4: Metadata embedding

5

Header Page 1

Page 2 ...

Write Transaction

6

Read Transaction

5 6 Commit

dirty dirty

38

Page N

++

2 dirty pages

slide-39
SLIDE 39

RAMDISK

Optimization4: Metadata embedding

  • Flush “File Change Counter” to the last modified

page and RAMDISK

Write Transaction

6

Read Transaction

5 Commit 5

39

25 [5~∞) 40 [1~∞) 15 [6~∞)

Header Page 1

Page 2 ...

Page N

5 4 6 6

slide-40
SLIDE 40

Optimization4: Metadata embedding

  • Flush “File Change Counter” to the last modified

page and RAMDISK

RAMDISK

Volatile

6 6

40

Write Transaction

6

Read Transaction

5 Commit 25 [5~∞) 40 [1~∞) 15 [6~∞)

Header Page 1

Page 2 ...

Page N

6

dirty

4

1 dirty page CRASH

slide-41
SLIDE 41

P5: 10 : P4 40 : P3

∞ : P2 25 12

Optimization5: Disable Sibling Redistribution

  • Sibling redistribution hurts insertion performance
  • Disable sibling redistribution → avoid dirtying 4 pages

P4: 5 10 P3: P2: 55 ∞

Insertion of key 20

12 25 15 40 20 P5: 10 : P4 40 : P3 ∞ : P2 40 10 dirty dirty dirty dirty

4 dirty pages

41

slide-42
SLIDE 42

Optimization5: Disable Sibling Redistribution

  • Disabled sibling redistribution

42

P4: 5 10 P3: P5: 10 : P4 P2: 55 90

Insertion of key 20

12 25 15 40 40 : P3 90 : P2 P5: 20 15 : P5 dirty dirty dirty

3 dirty pages

slide-43
SLIDE 43

Summary

  • Optimizations

43

Lazy Split Disabling Sibling Redistribution Metadata Embedding Lazy Garbage Collection Reserved Buffer Space

LS-MVBT

Avoid dirtying an extra page when split occurs Reduce the probability

  • f node split

Delete dead entries on the next mutation Avoid dirtying header page Do not touch siblings to make search faster

slide-44
SLIDE 44

Performance Evaluation

  • Implemented LS-MVBT in SQLite 3.7.12
  • Testbed: Samsung Galaxy-S4 (JellyBean 4.3)
  • Exynos 5 Octa 5410 Quad Core 1.6GHz
  • 2GB Memory
  • 16GB eMMC
  • Performance Analysis Tools
  • Mobibench
  • MOST

44

slide-45
SLIDE 45

WAL (Write-Ahead-Logging) mode

  • Default mode in Jelly Bean and KitKat
  • Commit -> Write to a log file
  • Checkpointing -> Flush logs to DB pages

DRAM WAL Log dirty DB Page DB Page 1 2

slide-46
SLIDE 46

LS-MVBT fsync WAL fsync() B-tree computation

Analysis of insert Performance

WAL Checkpointing Interval

SQLite Insert (Response Time)

LS-MVBT: 30~78% faster than WAL mode

46

slide-47
SLIDE 47

Analysis of insert Performance

LS-MVBT flushes

  • nly one dirty page

47

SQLite Insert (# of Dirty Pages per Insert)

slide-48
SLIDE 48

Analysis of Block I/O Behavior

10 insertions (LS-MVBT) 10 insertions (WAL) 100 insertions (WAL) 100 insertions (LS-MVBT)

20 block accesses 34 block accesses 148 block accesses 218 block accesses

48

SQLite Insert (# of Accesses to Block Device)

slide-49
SLIDE 49

I/O Traffic at Block Device Level

49

LS-MVBT saves 67% I/O traffic: 9.9 MB (LS-MVBT) vs 31 MB (WAL) x3 longer life time of NAND flash

SQLite Insert (I/O Traffic per 10 msec)

1000 insertions (LS-MVBT) 1000 insertions (WAL)

slide-50
SLIDE 50
  • How about Search performance?
  • LS-MVBT is optimized for write performance.

50

slide-51
SLIDE 51

Search Performance

LS-MVBT wins unless more than 93% of transactions are search

51

Transaction Throughput with varying Search/Insert ratio

slide-52
SLIDE 52
  • How about recovery latency?
  • WAL mode exhibits longer recovery latency

due to log replay.

  • How about LS-MVBT?

52

slide-53
SLIDE 53

Recovery

LS-MVBT is x5~x6 faster than WAL.

53

Recovery Time with varying # of insert operations to rollback

slide-54
SLIDE 54
  • How much performance improvement

for each optimization?

54

slide-55
SLIDE 55

Effect of Disabling Sibling Redistribution

Disabling sibling redistribution

20% faster insertion

55

SQLite Insert: (Response Time and # of Dirty Pages)

slide-56
SLIDE 56

Performance Quantification

56

Throughput of SQLite: Combining All Optimizations

40% improvement

51% improvement

70% improvemnet

X1.7 performance improvement LS-MVBT: 704 insertions/sec WAL : 417 insertions/sec

slide-57
SLIDE 57

Conclusion

  • LS-MVBT resolves the Journaling of Journal anomaly via
  • Reducing the number of fsync()’s with Version based B-tree.
  • Reducing the overhead of single fsync() via lazy split,

disabling sibling redistribution, metadata embedding, reserved buffer space, lazy garbage collection.

  • LS-MVBT achieves 70% performance gain against WAL mode

(417 ins/sec 704 ins/sec) solely via software optimization.

57

slide-58
SLIDE 58

Future/Ongoing Works

1. Improve the performance of WAL mode

  • 3 dirty pages per insertion → not really necessary
  • Easier work than replacing the entire B-tree
  • Compatible with current SQLite database files
  • We expect the performance benefit would be similar to

LS-MVBT

  • 2. SQLite for Non-Volatile Memory
  • Logging/Journaling in NVRAM
  • 3. SQLite for multi-cores

58

slide-59
SLIDE 59

Credit

59

Woo-Hee Kim Beomseok Nam Dongil Park Youjip Won

slide-60
SLIDE 60

Thank you…