The Unwritten Contract of Solid State Drives Jun He, Sudarsun - - PowerPoint PPT Presentation

the unwritten contract of solid state drives
SMART_READER_LITE
LIVE PREVIEW

The Unwritten Contract of Solid State Drives Jun He, Sudarsun - - PowerPoint PPT Presentation

The Unwritten Contract of Solid State Drives Jun He, Sudarsun Kannan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Department of Computer Sciences, University of Wisconsin - Madison Enterprise SSD revenue is expected to exceed enterprise


slide-1
SLIDE 1

The Unwritten Contract

  • f Solid State Drives

Jun He, Sudarsun Kannan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

Department of Computer Sciences, University of Wisconsin - Madison

slide-2
SLIDE 2

5 10 15 2012 2013 2014 2015E 2016E 2017E 2018E 2019E Year Revenue (Billion Dollars) Enterprise HDD Enterprise SSD

Enterprise SSD revenue is expected to exceed enterprise HDD in 2017

HDD SSD

Source: Gartner, Stifel Estimates https://www.theregister.co.uk/2016/01/07/gartner_enterprise_ssd_hdd_revenue_crossover_in_2017/

2017

slide-3
SLIDE 3

Storage stack is shifting from the HDD era to the SSD era

App FS

slide-4
SLIDE 4

Storage stack is shifting from the HDD era to the SSD era

App FS

slide-5
SLIDE 5

Storage stack is shifting from the HDD era to the SSD era

App FS App FS

slide-6
SLIDE 6

Storage stack is shifting from the HDD era to the SSD era

App FS App FS

? ?

slide-7
SLIDE 7

Storage stack is shifting from the HDD era to the SSD era

App FS App FS

? ? ?

slide-8
SLIDE 8

Storage stack is shifting from the HDD era to the SSD era

App FS App FS

? ? ?

slide-9
SLIDE 9

App

for SSD

FS

for SSD

Storage stack is shifting from the HDD era to the SSD era

App FS App FS

? ? ?

slide-10
SLIDE 10

App

for SSD

FS

for SSD

Storage stack is shifting from the HDD era to the SSD era

App FS App FS

? ? ? ? ?

slide-11
SLIDE 11

App

for SSD

FS

for SSD

Storage stack is shifting from the HDD era to the SSD era

App FS App FS

? ? ? ? ? ?

slide-12
SLIDE 12

The consequences of misusing SSDs

http://crestingwave.com/sites/default/files/collateral/velobit_whitepaper_ssdperformancetips.pdf

  • S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In Proceedings of the 8th USENIX Symposium on File and Storage

Technologies (FAST ’10), San Jose, California, February 2010

slide-13
SLIDE 13

The consequences of misusing SSDs

http://crestingwave.com/sites/default/files/collateral/velobit_whitepaper_ssdperformancetips.pdf

  • S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In Proceedings of the 8th USENIX Symposium on File and Storage

Technologies (FAST ’10), San Jose, California, February 2010

Performance degradation

slide-14
SLIDE 14

The consequences of misusing SSDs

http://crestingwave.com/sites/default/files/collateral/velobit_whitepaper_ssdperformancetips.pdf

  • S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In Proceedings of the 8th USENIX Symposium on File and Storage

Technologies (FAST ’10), San Jose, California, February 2010

Performance degradation Performance fluctuation

slide-15
SLIDE 15

The consequences of misusing SSDs

http://crestingwave.com/sites/default/files/collateral/velobit_whitepaper_ssdperformancetips.pdf

  • S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In Proceedings of the 8th USENIX Symposium on File and Storage

Technologies (FAST ’10), San Jose, California, February 2010

Early end of device life

Performance degradation Performance fluctuation

slide-16
SLIDE 16

What is the right way to achieve high performance on SSDs?

slide-17
SLIDE 17

What is the right way to achieve high performance on SSDs?

Block Device Interface: read(range), write(range), discard(range)

slide-18
SLIDE 18

What is the right way to achieve high performance on SSDs?

Block Device Interface: read(range), write(range), discard(range)

  • Sequential accesses are best
  • Nearby accesses are more

efficient than farther ones

MEMS-based storage devices and standard disk interfaces: A square peg in a round hole? Steven W. Schlosser, Gregory R. Ganger FAST’04

Unwritten Contract of HDDs

slide-19
SLIDE 19

What is the right way to achieve high performance on SSDs?

Block Device Interface: read(range), write(range), discard(range)

  • Sequential accesses are best
  • Nearby accesses are more

efficient than farther ones

MEMS-based storage devices and standard disk interfaces: A square peg in a round hole? Steven W. Schlosser, Gregory R. Ganger FAST’04

Unwritten Contract of HDDs Unwritten Contract of SSDs

?

slide-20
SLIDE 20

What is the right way to achieve high performance on SSDs?

slide-21
SLIDE 21

What is the right way to achieve high performance on SSDs?

  • Existing studies
slide-22
SLIDE 22

What is the right way to achieve high performance on SSDs?

  • Existing studies
  • Experience of implementing a detailed SSD

simulator

slide-23
SLIDE 23

What is the right way to achieve high performance on SSDs?

  • Existing studies
  • Experience of implementing a detailed SSD

simulator

  • Analysis of experiments
slide-24
SLIDE 24

What is the right way to achieve high performance on SSDs?

  • Existing studies
  • Experience of implementing a detailed SSD

simulator

  • Analysis of experiments

The Unwritten Contract of SSDs

slide-25
SLIDE 25

Do the current apps/FSes comply with the unwritten contract of SSDs?

slide-26
SLIDE 26

Do the current apps/FSes comply with the unwritten contract of SSDs?

LevelDB

RocksDB SQLite

(RollBack)

SQLite

(WAL)

Varmail

5 apps

for SSD

slide-27
SLIDE 27

Do the current apps/FSes comply with the unwritten contract of SSDs?

LevelDB

RocksDB SQLite

(RollBack)

SQLite

(WAL)

Varmail

5 apps

for SSD

ext4 F2FS XFS

for SSD

3 file systems

slide-28
SLIDE 28

Do the current apps/FSes comply with the unwritten contract of SSDs?

LevelDB

RocksDB SQLite

(RollBack)

SQLite

(WAL)

Varmail

5 apps

for SSD

ext4 F2FS XFS

for SSD

3 file systems

Rule 1 Rule 2 Rule 3 Rule 4 Rule 5

Contract

slide-29
SLIDE 29

Do the current apps/FSes comply with the unwritten contract of SSDs?

LevelDB

RocksDB SQLite

(RollBack)

SQLite

(WAL)

Varmail

5 apps

for SSD

ext4 F2FS XFS

for SSD

3 file systems

Rule 1 Rule 2 Rule 3 Rule 4 Rule 5

Contract

slide-30
SLIDE 30

Do the current apps/FSes comply with the unwritten contract of SSDs?

LevelDB

RocksDB SQLite

(RollBack)

SQLite

(WAL)

Varmail

5 apps

for SSD

ext4 F2FS XFS

for SSD

3 file systems

Rule 1 Rule 2 Rule 3 Rule 4 Rule 5

Contract To study the contract, we built a sophisticated SSD simulator and workload analyzer

slide-31
SLIDE 31

In the paper

slide-32
SLIDE 32

We made 24 detailed observations

In the paper

slide-33
SLIDE 33

We made 24 detailed observations We learned several high-level lessons

In the paper

slide-34
SLIDE 34

Outline

Overview SSD Unwritten Contract Violations of the Unwritten Contract Conclusions

slide-35
SLIDE 35

Outline

Overview SSD Unwritten Contract Violations of the Unwritten Contract Conclusions

slide-36
SLIDE 36

SSD Background

slide-37
SLIDE 37

SSD Background

P

slide-38
SLIDE 38

SSD Background

P P P Block …

slide-39
SLIDE 39

SSD Background

P P P Block … P P P … P P P … …

slide-40
SLIDE 40

SSD Background

P P P Block … P P P … P P P … … Channel

slide-41
SLIDE 41

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel

slide-42
SLIDE 42

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel …

slide-43
SLIDE 43

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel …

slide-44
SLIDE 44

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel …

slide-45
SLIDE 45

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel …

Controller

slide-46
SLIDE 46

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel …

Controller

FTL

slide-47
SLIDE 47

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel …

Controller

FTL

  • address mapping
  • garbage collection
  • wear-leveling
slide-48
SLIDE 48

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel …

Controller

RAM

FTL

  • address mapping
  • garbage collection
  • wear-leveling
slide-49
SLIDE 49

SSD Background

P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel P P P Block … P P P … P P P … … Channel …

Controller

RAM

FTL

Mapping Table Data Cache

  • address mapping
  • garbage collection
  • wear-leveling
slide-50
SLIDE 50

Rules of the Unwritten Contract

#1 Request Scale #2 Locality #3 Aligned Sequentiality #4 Grouping by Death Time #5 Uniform Data Lifetime

slide-51
SLIDE 51

Rule #1: Request Scale

SSD clients should issue large data requests or multiple

  • utstanding data requests.
slide-52
SLIDE 52

Rule #1: Request Scale

SSD clients should issue large data requests or multiple

  • utstanding data requests.

Channel Channel Channel Channel SSD

slide-53
SLIDE 53

Rule #1: Request Scale

SSD clients should issue large data requests or multiple

  • utstanding data requests.

Channel Channel Channel Channel SSD

Request

slide-54
SLIDE 54

Rule #1: Request Scale

SSD clients should issue large data requests or multiple

  • utstanding data requests.

Channel Channel Channel Channel SSD

slide-55
SLIDE 55

Rule #1: Request Scale

SSD clients should issue large data requests or multiple

  • utstanding data requests.

Channel Channel Channel Channel SSD

High internal parallelism

slide-56
SLIDE 56

Rule #1: Request Scale Violation

slide-57
SLIDE 57

Rule #1: Request Scale Violation

Channel Channel Channel Channel SSD

slide-58
SLIDE 58

Rule #1: Request Scale Violation

Channel Channel Channel Channel SSD

slide-59
SLIDE 59

Rule #1: Request Scale Violation

Channel Channel Channel Channel SSD

Wasted

slide-60
SLIDE 60

Rule #1: Request Scale Violation

Channel Channel Channel Channel SSD

Wasted

If you violate the rule:

  • Low internal parallelism
slide-61
SLIDE 61

Rule #1: Request Scale Violation

Channel Channel Channel Channel SSD

Wasted

If you violate the rule:

  • Low internal parallelism

Performance impact: 18x read bandwidth 10x write bandwidth

  • F. Chen, R. Lee, and X. Zhang. Essential Roles of Exploit- ing Internal Parallelism
  • f Flash Memory Based Solid State Drives in High-speed Data Processing. In

Proceedings of the 17th International Symposium on High Performance Com- puter Architecture (HPCA-11), pages 266–277, San Antonio, Texas, February 2011.

slide-62
SLIDE 62

Translation Cache

Rule 2: Locality

SSD clients should access with locality RAM FLASH

Logical to Physical Mapping Table

SSD

P P P P P P P P P P P P P P P P P P P P P P P P

slide-63
SLIDE 63

Translation Cache

Rule 2: Locality

SSD clients should access with locality RAM FLASH

Logical to Physical Mapping Table

SSD

P P P P P P P P P P P P P P P P P P P P P P P P P P

slide-64
SLIDE 64

Translation Cache

Rule 2: Locality

SSD clients should access with locality RAM FLASH

Logical to Physical Mapping Table

SSD

P P P P P P P P P P P P P P P P P P P P P P P P P P

slide-65
SLIDE 65

Translation Cache

Rule 2: Locality

SSD clients should access with locality RAM FLASH

Logical to Physical Mapping Table

SSD

P P P P P P P P P P P P P P P P P P P P P P P P P P

Hit

slide-66
SLIDE 66

Translation Cache

Rule 2: Locality

SSD clients should access with locality RAM FLASH

Logical to Physical Mapping Table

SSD

P P P P P P P P P P P P P P P P P P P P P P P P P P

Hit

High Cache Hit Ratio

slide-67
SLIDE 67

Logical to Physical Mapping Table

Rule 2: Locality Violation

SSD clients should access with locality RAM FLASH SSD

P P P P P P P P P P P P P P P P P P P P P P P P

slide-68
SLIDE 68

Logical to Physical Mapping Table

Rule 2: Locality Violation

SSD clients should access with locality RAM FLASH SSD

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

slide-69
SLIDE 69

Logical to Physical Mapping Table

Rule 2: Locality Violation

SSD clients should access with locality RAM FLASH SSD

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

If you violate the rule:

  • Translation cache misses
  • More translation entry reads/writes
slide-70
SLIDE 70

Logical to Physical Mapping Table

Rule 2: Locality Violation

SSD clients should access with locality RAM FLASH SSD

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

If you violate the rule:

  • Translation cache misses
  • More translation entry reads/writes

Performance impact: 2.2x average response time

  • Y. Zhou, F. Wu, P. Huang, X. He, C. Xie, and J. Zhou. An Efficient Page-level FTL to

Optimize Address Translation in Flash Memory. In Proceedings of the EuroSys Conference (EuroSys ’15), Bordeaux, France, April 2015.

slide-71
SLIDE 71

Rule 3: Aligned Sequentiality

Details in the paper

slide-72
SLIDE 72

Rule 4: Grouping By Death Time

Data with similar death times should be placed in the same block.

slide-73
SLIDE 73

Rule 4: Grouping By Death Time

Data with similar death times should be placed in the same block.

1pm 1pm 1pm 1pm 2pm 2pm 2pm 2pm

slide-74
SLIDE 74

Rule 4: Grouping By Death Time

Data with similar death times should be placed in the same block.

1pm 1pm 1pm 1pm 2pm 2pm 2pm 2pm

Time

1pm 2pm

slide-75
SLIDE 75

Rule 4: Grouping By Death Time

Data with similar death times should be placed in the same block.

1pm 1pm 1pm 1pm 2pm 2pm 2pm 2pm

Time

1pm 2pm

slide-76
SLIDE 76

Rule 4: Grouping By Death Time

Data with similar death times should be placed in the same block.

1pm 1pm 1pm 1pm 2pm 2pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 1pm 1pm

slide-77
SLIDE 77

Rule 4: Grouping By Death Time

Data with similar death times should be placed in the same block.

1pm 1pm 1pm 1pm 2pm 2pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 1pm 1pm

slide-78
SLIDE 78

Rule 4: Grouping By Death Time

Data with similar death times should be placed in the same block.

1pm 1pm 1pm 1pm 2pm 2pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 1pm 1pm

😁

slide-79
SLIDE 79

Rule 4: Grouping By Death Time Violation

1pm 1pm 2pm 2pm 1pm 1pm 2pm 2pm

Time

1pm 2pm

slide-80
SLIDE 80

Rule 4: Grouping By Death Time Violation

1pm 1pm 2pm 2pm 1pm 1pm 2pm 2pm

Time

1pm 2pm

slide-81
SLIDE 81

Rule 4: Grouping By Death Time Violation

1pm 1pm 2pm 2pm 1pm 1pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 2pm 2pm 1pm 1pm 2pm 2pm

slide-82
SLIDE 82

1pm 1pm 2pm 2pm 1pm 1pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 2pm 2pm

Rule 4: Grouping By Death Time Violation

2pm 2pm

slide-83
SLIDE 83

1pm 1pm 2pm 2pm 1pm 1pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 2pm 2pm

Rule 4: Grouping By Death Time Violation

2pm 2pm

slide-84
SLIDE 84

1pm 1pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 2pm 2pm

Rule 4: Grouping By Death Time Violation

2pm 2pm

slide-85
SLIDE 85

1pm 1pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 2pm 2pm

Rule 4: Grouping By Death Time Violation

Data movement!!!

2pm 2pm

slide-86
SLIDE 86

1pm 1pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 2pm 2pm

Rule 4: Grouping By Death Time Violation

Data movement!!!

2pm 2pm

😟

slide-87
SLIDE 87

1pm 1pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 2pm 2pm

Rule 4: Grouping By Death Time Violation

Data movement!!!

2pm 2pm

😟

If you violate the rule:

  • Performance penalty
  • Write amplification
slide-88
SLIDE 88

1pm 1pm 2pm 2pm

Time

1pm 2pm

1pm 1pm 2pm 2pm

Rule 4: Grouping By Death Time Violation

Data movement!!!

2pm 2pm

😟

If you violate the rule:

  • Performance penalty
  • Write amplification

Performance impact: 4.8x write bandwidth 1.6x throughput 1.8x block erasure count

  • C. Lee, D. Sim, J.-Y. Hwang, and S. Cho. F2FS: A New File System for Flash Storage. In Proceedings of the 13th USENIX Conference on File

and Storage Technologies (FAST ’15), Santa Clara, California, February 2015. J.-U. Kang, J. Hyun, H. Maeng, and S. Cho. The Multi- streamed Solid-State Drive. In 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage ’14), Philadelphia, PA, June 2014. .

  • Y. Cheng, F. Douglis, P. Shilane, G. Wallace, P. Desnoyers, and K. Li. Erasing Belady’s Limitations: In Search of Flash Cache Offline
  • Optimality. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 379–392, Denver, CO, 2016. USENIX Association.
slide-89
SLIDE 89

Rule 5: Uniform Data Lifetime

Clients of SSDs should create data with similar lifetimes

slide-90
SLIDE 90

Rule 5: Uniform Data Lifetime

Clients of SSDs should create data with similar lifetimes Lifetime

1 Day

slide-91
SLIDE 91

Rule 5: Uniform Data Lifetime

Clients of SSDs should create data with similar lifetimes

Usage Count: SSD

Lifetime

1 Day

slide-92
SLIDE 92

Rule 5: Uniform Data Lifetime

Clients of SSDs should create data with similar lifetimes

Usage Count:

3 3 3 3

SSD

Lifetime

1 Day

slide-93
SLIDE 93

Rule 5: Uniform Data Lifetime

Clients of SSDs should create data with similar lifetimes

Usage Count:

3 3 3 3

No wear-leveling needed

SSD

Lifetime

1 Day

slide-94
SLIDE 94

Lifetime

1 Day 1000 Years

Rule 5: Uniform Data Lifetime Violation

SSD Usage Count:

slide-95
SLIDE 95

Lifetime

1 Day 1000 Years

Rule 5: Uniform Data Lifetime Violation

SSD

1 1

365*1000 365*1000

Usage Count:

slide-96
SLIDE 96

Lifetime

1 Day 1000 Years

Rule 5: Uniform Data Lifetime Violation

SSD

1 1

365*1000 365*1000

Usage Count:

Some blocks wear out sooner Frequent wear-leveling needed!!!

slide-97
SLIDE 97

Lifetime

1 Day 1000 Years

Rule 5: Uniform Data Lifetime Violation

SSD

1 1

365*1000 365*1000

Usage Count:

Some blocks wear out sooner Frequent wear-leveling needed!!!

If you violate the rule:

  • Performance penalty
  • Write amplification
slide-98
SLIDE 98

Lifetime

1 Day 1000 Years

Rule 5: Uniform Data Lifetime Violation

SSD

1 1

365*1000 365*1000

Usage Count:

Some blocks wear out sooner Frequent wear-leveling needed!!!

If you violate the rule:

  • Performance penalty
  • Write amplification

Performance impact: 1.6x write latency

  • S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and
  • Analysis. In Proceedings of the 8th USENIX Symposium on File and Storage

Technologies (FAST ’10), San Jose, California, February 2010.

slide-99
SLIDE 99

Outline

Overview SSD Unwritten Contract Violations of the Unwritten Contract Conclusions

slide-100
SLIDE 100

Do applications/file systems comply with the unwritten contract?

slide-101
SLIDE 101

We conduct vertical analysis to find violations of SSD contract

slide-102
SLIDE 102

We conduct vertical analysis to find violations of SSD contract

App FS + SSD

slide-103
SLIDE 103

We conduct vertical analysis to find violations of SSD contract

Block Trace

App FS + SSD

slide-104
SLIDE 104

We conduct vertical analysis to find violations of SSD contract

Block Trace

SSD Simulator: WiscSim

App FS + SSD

slide-105
SLIDE 105

We conduct vertical analysis to find violations of SSD contract

Block Trace

SSD Simulator: WiscSim

App FS + SSD

Rule violation? Analyzer: WiscSee

slide-106
SLIDE 106

We conduct vertical analysis to find violations of SSD contract

Block Trace Root Cause

SSD Simulator: WiscSim

App FS + SSD

Rule violation? Analyzer: WiscSee

slide-107
SLIDE 107

2 of Our 24 Observations

  • 1. Linux page cache limits request scale
  • 2. F2FS incurs more GC overhead than traditional file systems
slide-108
SLIDE 108

2 of Our 24 Observations

  • 1. Linux page cache limits request scale
  • 2. F2FS incurs more GC overhead than traditional file systems
slide-109
SLIDE 109

leveldb rocksdb sqlite−rb sqlite−wal varmail 10 20 30 10 20 30 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Num of Concurrent Requests ext4 f2fs xfs leveldb rocksdb sqlite−rb sqlite−wal varmail 400 800 1200 400 800 1200 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Request Size (KB) ext4 f2fs xfs

We evaluate request scale by request size and number of concurrent requests

slide-110
SLIDE 110

leveldb rocksdb sqlite−rb sqlite−wal varmail 10 20 30 10 20 30 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Num of Concurrent Requests ext4 f2fs xfs leveldb rocksdb sqlite−rb sqlite−wal varmail 400 800 1200 400 800 1200 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Request Size (KB) ext4 f2fs xfs

We evaluate request scale by request size and number of concurrent requests

slide-111
SLIDE 111

leveldb rocksdb sqlite−rb sqlite−wal varmail 10 20 30 10 20 30 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Num of Concurrent Requests ext4 f2fs xfs leveldb rocksdb sqlite−rb sqlite−wal varmail 400 800 1200 400 800 1200 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Request Size (KB) ext4 f2fs xfs

LevelDB & RocksDB: Insertions Background Compactions

We evaluate request scale by request size and number of concurrent requests

slide-112
SLIDE 112

leveldb rocksdb sqlite−rb sqlite−wal varmail 10 20 30 10 20 30 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Num of Concurrent Requests ext4 f2fs xfs leveldb rocksdb sqlite−rb sqlite−wal varmail 400 800 1200 400 800 1200 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Request Size (KB) ext4 f2fs xfs

LevelDB & RocksDB: Insertions Background Compactions Median: ~100KB

We evaluate request scale by request size and number of concurrent requests

slide-113
SLIDE 113

leveldb rocksdb sqlite−rb sqlite−wal varmail 10 20 30 10 20 30 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Num of Concurrent Requests ext4 f2fs xfs leveldb rocksdb sqlite−rb sqlite−wal varmail 400 800 1200 400 800 1200 read write seq rand mix seq rand mix seq rand mix seq rand mix small large mix Request Size (KB) ext4 f2fs xfs

LevelDB & RocksDB: Insertions Background Compactions Median: ~100KB Median: ~2

We evaluate request scale by request size and number of concurrent requests

slide-114
SLIDE 114

LevelDB and RocksDB can access files in large sizes.

Why was the request scale low?

slide-115
SLIDE 115

Buffered read(): Page cache implementation splits and serializes user requests

App Page Cache Block Layer SSD

slide-116
SLIDE 116

Buffered read(): Page cache implementation splits and serializes user requests

App Page Cache Block Layer SSD read()

2MB

slide-117
SLIDE 117

Buffered read(): Page cache implementation splits and serializes user requests

App Page Cache Block Layer SSD read()

2MB

128KB 128KB 128KB …

slide-118
SLIDE 118

Buffered read(): Page cache implementation splits and serializes user requests

App Page Cache Block Layer SSD read()

2MB

128KB 128KB 128KB … One request at a time

slide-119
SLIDE 119

Buffered read(): Page cache implementation splits and serializes user requests

App Page Cache Block Layer SSD read()

2MB

128KB 128KB 128KB …

Surprise! Even reading 2MB in your app will not utilize SSD well.

One request at a time

slide-120
SLIDE 120

Cause of Violation Large reads are throttled by small prefetching (readahead).

slide-121
SLIDE 121

2 of Our 24 Observations

  • 1. Linux page cache limits request scale
  • 2. F2FS incurs more GC overhead than traditional file systems
slide-122
SLIDE 122

2 of Our 24 Observations

  • 1. Linux page cache limits request scale
  • 2. F2FS incurs more GC overhead than traditional file systems
slide-123
SLIDE 123

We study GC (rule #3: grouping by death time) by zombie curves

slide-124
SLIDE 124

What’s a zombie curve?

?

We study GC (rule #3: grouping by death time) by zombie curves

slide-125
SLIDE 125

We study GC (rule #3: grouping by death time) by zombie curves

What’s a zombie curve?

slide-126
SLIDE 126

We study GC (rule #3: grouping by death time) by zombie curves

What’s a zombie curve? Run workloads with infinite space over-provisioning

slide-127
SLIDE 127

We study GC (rule #3: grouping by death time) by zombie curves

What’s a zombie curve? Run workloads with infinite space over-provisioning

slide-128
SLIDE 128

We study GC (rule #3: grouping by death time) by zombie curves

What’s a zombie curve? Run workloads with infinite space over-provisioning

Valid

slide-129
SLIDE 129

We study GC (rule #3: grouping by death time) by zombie curves

What’s a zombie curve? Run workloads with infinite space over-provisioning

Valid Invalid

slide-130
SLIDE 130

Valid ratio 1.0 0.25 0.75 0.75

We study GC (rule #3: grouping by death time) by zombie curves

slide-131
SLIDE 131

Valid ratio 1.0 0.25 0.75 0.75

We study GC (rule #3: grouping by death time) by zombie curves

slide-132
SLIDE 132

Valid ratio 1.0 0.25 0.75 0.75

We study GC (rule #3: grouping by death time) by zombie curves

slide-133
SLIDE 133

Valid ratio 1.0 0.25 0.75 0.75

We study GC (rule #3: grouping by death time) by zombie curves

slide-134
SLIDE 134

Valid ratio 1.0 0.25 0.75 0.75

We study GC (rule #3: grouping by death time) by zombie curves

slide-135
SLIDE 135

Valid ratio 1.0 0.25 0.75 0.75

We study GC (rule #3: grouping by death time) by zombie curves

slide-136
SLIDE 136

Valid ratio 1.0 0.25 0.75 0.75

We study GC (rule #3: grouping by death time) by zombie curves

slide-137
SLIDE 137

What’s a good zombie curve?

Over-provisioned Over-provisioned

slide-138
SLIDE 138

What’s a good zombie curve?

Over-provisioned Over-provisioned

Ready to be used Ready to be used

slide-139
SLIDE 139

What’s a bad zombie curve?

Over-provisioned

slide-140
SLIDE 140

What’s a bad zombie curve?

Over-provisioned

Move data before use

slide-141
SLIDE 141

BTW, zombie curve helps you choose over-provisioning ratio

Over-provisioned

slide-142
SLIDE 142

BTW, zombie curve helps you choose over-provisioning ratio

Over-provisioned

slide-143
SLIDE 143

F2FS incurs a worse zombie curve (higher GC overhead) than ext4 for SQLite

Over-Provisioned

Flash Space

1 0.5 1.5 2

0.25 0.5 0.75 1.0

Valid Ratio

Animations cannot be displayed in PDF. Please see the animations at http://pages.cs.wisc.edu/~jhe/zombie.html

slide-144
SLIDE 144

F2FS incurs a worse zombie curve (higher GC overhead) than ext4 for SQLite

Over-Provisioned

Flash Space

1 0.5 1.5 2

0.25 0.5 0.75 1.0

Valid Ratio

Animations cannot be displayed in PDF. Please see the animations at http://pages.cs.wisc.edu/~jhe/zombie.html

slide-145
SLIDE 145

F2FS incurs a worse zombie curve (higher GC overhead) than ext4 for SQLite

Over-Provisioned

Flash Space

1 0.5 1.5 2

0.25 0.5 0.75 1.0

Valid Ratio ext4

Animations cannot be displayed in PDF. Please see the animations at http://pages.cs.wisc.edu/~jhe/zombie.html

slide-146
SLIDE 146

F2FS incurs a worse zombie curve (higher GC overhead) than ext4 for SQLite

F2FS

Over-Provisioned

Flash Space

1 0.5 1.5 2

0.25 0.5 0.75 1.0

Valid Ratio ext4

Animations cannot be displayed in PDF. Please see the animations at http://pages.cs.wisc.edu/~jhe/zombie.html

slide-147
SLIDE 147

F2FS incurs a worse zombie curve (higher GC overhead) than ext4 for SQLite

F2FS

Over-Provisioned

Flash Space

1 0.5 1.5 2

0.25 0.5 0.75 1.0

Valid Ratio ext4 Stable-state curves characterize workloads.

Animations cannot be displayed in PDF. Please see the animations at http://pages.cs.wisc.edu/~jhe/zombie.html

slide-148
SLIDE 148

F2FS incurs a worse zombie curve (higher GC overhead) than ext4 for SQLite

F2FS

Over-Provisioned

Flash Space

1 0.5 1.5 2

0.25 0.5 0.75 1.0

Valid Ratio ext4 Stable-state curves characterize workloads.

Animations cannot be displayed in PDF. Please see the animations at http://pages.cs.wisc.edu/~jhe/zombie.html

slide-149
SLIDE 149

Why did F2FS incur a worse zombie curve (GC overhead)?

slide-150
SLIDE 150

Why did F2FS incur a worse zombie curve (GC overhead)?

  • SQLite fragmented F2FS
slide-151
SLIDE 151

Why did F2FS incur a worse zombie curve (GC overhead)?

  • SQLite fragmented F2FS
  • F2FS did not discard data that was deleted by SQLite
slide-152
SLIDE 152

Why did F2FS incur a worse zombie curve (GC overhead)?

  • SQLite fragmented F2FS
  • F2FS did not discard data that was deleted by SQLite
  • F2FS was not able to stay log-structured for SQLite’s I/O pattern
slide-153
SLIDE 153

More Observations

slide-154
SLIDE 154

More Observations

Legacy file system allocation policies break locality

slide-155
SLIDE 155

More Observations

Legacy file system allocation policies break locality Application log structuring does not reduce GC

slide-156
SLIDE 156

More Observations

Legacy file system allocation policies break locality Application log structuring does not reduce GC

24 observations in the paper

slide-157
SLIDE 157

Lessons Learned

slide-158
SLIDE 158

Lessons Learned

The SSD contract is multi-dimensional

  • Optimizing for one dimension is not enough
  • We need more sophisticated tools to analyze

workloads

slide-159
SLIDE 159

Lessons Learned

The SSD contract is multi-dimensional

  • Optimizing for one dimension is not enough
  • We need more sophisticated tools to analyze

workloads Although not perfect, traditional file systems perform surprisingly well upon SSDs

slide-160
SLIDE 160

Lessons Learned

The SSD contract is multi-dimensional

  • Optimizing for one dimension is not enough
  • We need more sophisticated tools to analyze

workloads Although not perfect, traditional file systems perform surprisingly well upon SSDs Myths spread if the unwritten contract is not clarified

  • “Random writes increase GC overhead”
slide-161
SLIDE 161

Conclusions

slide-162
SLIDE 162

Conclusions

Understanding the unwritten contract is crucial for designing high performance application and file systems

slide-163
SLIDE 163

Conclusions

Understanding the unwritten contract is crucial for designing high performance application and file systems System designing demands more vertical analysis

slide-164
SLIDE 164

Conclusions

Understanding the unwritten contract is crucial for designing high performance application and file systems System designing demands more vertical analysis WiscSee (analyzer) and WiscSim (SSD simulator) are available at: http://research.cs.wisc.edu/adsl/Software/wiscsee