Physical Separation in Modern Storage Systems Lanyue Lu Committee: - - PowerPoint PPT Presentation

physical separation in modern storage systems
SMART_READER_LITE
LIVE PREVIEW

Physical Separation in Modern Storage Systems Lanyue Lu Committee: - - PowerPoint PPT Presentation

Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael Swift, Xinyu Zhang University of Wisconsin - Madison Tuesday, December 1, 15 Local Storage Systems Are


slide-1
SLIDE 1

Physical Separation in Modern Storage Systems

Lanyue Lu

Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael Swift, Xinyu Zhang University of Wisconsin - Madison

Tuesday, December 1, 15

slide-2
SLIDE 2

Local Storage Systems Are Important

GFS, HDFS vmware docker

Local Storage

Riak, MongoDB ext4, NTFS, SQLite

Tuesday, December 1, 15

slide-3
SLIDE 3

Data Layout of Storage Systems

Data layout is fundamental

➡ how to organize data on disks and in memory ➡ impact both reliability and performance

Locality is the key

➡ store relevant data together ➡ locality is pursued in various storage systems ➡ file systems, key-value stores, databases ➡ better performance (caching and prefetching) ➡ high space utilization ➡ optimize for hard drives

Tuesday, December 1, 15

slide-4
SLIDE 4

Problems of Data Locality

New environments

➡ fast storage hardware (e.g., SSDs) ➡ servers with many cores and large memory ➡ sharing infrastructure is the reality ➡ virtualization, containers, data centers

Unexpected entanglement

➡ shared failures (e.g.,

VMs, containers)

➡ bundled performance (e.g., apps) ➡ lack flexibility to manage data differently

Tuesday, December 1, 15

slide-5
SLIDE 5

New Technique: Physical Separation

Redesign data layout

➡ rethink existing data layouts ➡ key: separate data structures ➡ apply in both file systems and key-value stores

Many new benefits

➡ IceFS: disentangle structures and transactions ➡ isolated failures, faster recovery ➡ customized performance ➡ WiscKey: key-value separation ➡ minimize I/O amplification ➡ leverage devices’ internal parallelism

Tuesday, December 1, 15

slide-6
SLIDE 6

Research Contributions

A study of Linux file system evolution

➡ the first comprehensive file-system study ➡ published in FAST ’13 (best paper award)

Physical disentanglement in IceFS

➡ localized failure, localized recovery ➡ specialized journaling performance ➡ published in OSDI ’14

Key-value separation in WiscKey

➡ an SSD-conscious LSM-tree ➡ over 100x performance improvement ➡ submitted to FAST ’16

1 2 3

Tuesday, December 1, 15

slide-7
SLIDE 7

Introduction Disentanglement in IceFS

➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation

Key-Value Separation in WiscKey

➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

slide-8
SLIDE 8

Isolation Is Important

Reliability

➡ independent failures and recovery

Performance

➡ isolated performance

Isolation at various scenarios

➡ computing: virtual machines, Linux containers ➡ security: BSD jail, sandbox ➡ cloud: multi-tenant systems

Tuesday, December 1, 15

slide-9
SLIDE 9

File Systems Lack Isolation

Local file systems are core building blocks

➡ manage user data ➡ long-standing and stable ➡ foundation for distributed file systems

Existing abstractions provide logical isolation

➡ file, directory, namespace ➡ just illusion

Physical entanglement in local file systems prevents isolation

➡ entangled data structures and transactions

Tuesday, December 1, 15

slide-10
SLIDE 10

Metadata Entanglement

foo.txt

foo.txt inode

bar.c

  • ne 4KB inode block

bar.c inode

I/O failure Metadata corruption

Shared metadata for multiple files

➡ e.g., multiple files share one inode block ➡ many shared structures: bitmap, directory block

Problem: faults in shared structures lead to shared failures and recovery

Tuesday, December 1, 15

slide-11
SLIDE 11

Transaction Entanglement

data of foo.txt data of bar.c

foo.txt bar.c Disk Mem

fsync(bar.c)

A shared transaction for all updates Problem: shared transactions lead to entangled performance

Tuesday, December 1, 15

slide-12
SLIDE 12

Our Solution: IceFS

Propose a data container abstraction: cube Disentangle data structures and transactions Provide reliability and performance isolation Benefits for local file systems

➡ isolated failures for data containers ➡ up to 8x faster localized recovery ➡ up to 50x higher performance

Benefits for high-level services

➡ virtualized systems: reduce the downtime over 5x ➡ HDFS: improve the recovery efficiency over 7x

Tuesday, December 1, 15

slide-13
SLIDE 13

Data Container Abstraction: Cube

b1 b2 a d1 c c1 / b d b, b1, b2 d, d1 /, a, c, c1

Disk cube1 cube2

An isolated directory in a file system

➡ physically disentangled on disk and in memory

Tuesday, December 1, 15

slide-14
SLIDE 14

Principles of Disentanglement

No shared physical resources

➡ no shared metadata: e.g., block groups ➡ no shared disk blocks or buffers

No dependency

➡ partition linked lists or trees ➡ avoid directory hierarchy dependency

No entangled updates

➡ use separate transactions ➡ enable customized journaling modes

Tuesday, December 1, 15

slide-15
SLIDE 15

Introduction Disentanglement in IceFS

➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation

Key-Value Separation in WiscKey

➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

slide-16
SLIDE 16

IceFS Overview

A data container based file system

➡ isolated reliability and performance for containers

Disentanglement techniques

➡ physical resource isolation ➡ directory indirection ➡ transaction splitting

A prototype based on Ext3

➡ local file system: Ext3/JBD ➡ kernel:

VFS

➡ user level tool: e2fsprogs

Tuesday, December 1, 15

slide-17
SLIDE 17

Ext3 Disk Layout

One block group

block group

SB

Disk

metadata data blocks

group descriptors bitmaps inodes block group block group block group block group block group block group

A disk is divided into block groups

➡ physical partition for disk locality

Tuesday, December 1, 15

slide-18
SLIDE 18

IceFS Disk Layout

SB

Disk

S0

block group block group

S1

block group block group block group

sub super blocks

cube metadata

Each cube has isolated metadata

➡ sub-super block (Si) and isolated block groups

Tuesday, December 1, 15

slide-19
SLIDE 19

Directory Indirection

b1 b2 a d1 c c1 / b d

cube1 cube2

  • 1. load cube pathnames

from sub-super blocks

/a/b/, cube1 dentry /d/, cube2 dentry ... ...

  • 2. pathname prefix match

read file “/a/b/b2” match cube1 jump to cube1 top directory

Tuesday, December 1, 15

slide-20
SLIDE 20

Ext3/4 Transaction

Journal file1 Memory

dirty data

file2

dirty data

file3

dirty data

Disk

commit tx

fsync(file1)

Tuesday, December 1, 15

slide-21
SLIDE 21

IceFS Transaction Splitting

Journal file1 Memory

dirty data

file2

dirty data

file3

dirty data

Disk

commit tx

fsync(file1)

commit tx commit tx

fsync(file2) fsync(file3)

Tuesday, December 1, 15

slide-22
SLIDE 22

Benefits of Disentanglement

Localized reactions to failures

➡ per-cube read-only and crash ➡ encourage more runtime checking

Localized recovery

➡ only check faulty cubes ➡ offline and online

Specialized journaling

➡ concurrent and independent transactions ➡ diverse journal modes (e.g., no journal, no fsync)

Tuesday, December 1, 15

slide-23
SLIDE 23

Introduction Disentanglement in IceFS

➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation

Key-Value Separation in WiscKey

➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

slide-24
SLIDE 24

Evaluation

Does IceFS isolate failures ?

➡ inject around 200 faults ➡ per-cube failure (read-only or crash) in IceFS

Tuesday, December 1, 15

slide-25
SLIDE 25

Evaluation

Does IceFS isolate failures ?

➡ inject around 200 faults ➡ per-cube failure (read-only or crash) in IceFS

Does IceFS have faster recovery ?

Tuesday, December 1, 15

slide-26
SLIDE 26

Recovery In Ext3

Ext3: 20 directories

200 400 600 800 1000 Fsck Time (s) File-system Capacity 200GB 400GB 600GB 800GB 231 476 723 1007 Ext3

Tuesday, December 1, 15

slide-27
SLIDE 27

Fast Recovery In IceFS

Ext3: 20 directories IceFS: 20 cubes

200 400 600 800 1000 Fsck Time (s) File-system Capacity 200GB 400GB 600GB 800GB 231 476 723 1007 35 64 91 122 Ext3 IceFS

Partial recovery for a cube (up to 8x)

Tuesday, December 1, 15

slide-28
SLIDE 28

Evaluation

Does IceFS isolate failures ?

➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS

Does IceFS have faster recovery ?

➡ independent recovery for a cube

Tuesday, December 1, 15

slide-29
SLIDE 29

Evaluation

Does IceFS isolate failures ?

➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS

Does IceFS have faster recovery ?

➡ independent recovery for a cube

Does IceFS have better performance ?

Tuesday, December 1, 15

slide-30
SLIDE 30

Workloads

SQLite

➡ a database application ➡ sequentially write large key/value pairs ➡ asynchronous

Varmail

➡ an email server workload ➡ randomly write small blocks ➡ fsync after each write

Tuesday, December 1, 15

slide-31
SLIDE 31

Ext3 Journaling

30 60 90 120 150 180 Throughput (MB/s) 146.7 76.1 120.6 20 1.9 9.8 SQLite Varmail

Alone Ext3 Ext3 runs with 2 directories

Tuesday, December 1, 15

slide-32
SLIDE 32

Ext3 Journaling

30 60 90 120 150 180 Throughput (MB/s) 146.7 76.1 120.6 20 1.9 9.8 SQLite Varmail

Alone Ext3 Together Ext3

Shared transactions hurt performance (over 10x)

Ext3 runs with 2 directories

Tuesday, December 1, 15

slide-33
SLIDE 33

Isolated Journaling In IceFS

30 60 90 120 150 180 Throughput (MB/s) 146.7 76.1 120.6 20 1.9 9.8 SQLite Varmail

Alone in Ext3 Together in Ext3 Together in IceFS Ext3 runs with 2 directories IceFS runs with 2 cubes

Parallel transactions in IceFS provide isolated performance (over 5x)

Tuesday, December 1, 15

slide-34
SLIDE 34

Specialized Journaling In IceFS

  • rdered

50 100 150 200 250 Throughput (MB/s) 120.6 220.3 125.4 9.8 5.6 103.4 SQLite Varmail

  • rdered

Both cubes use ordered mode

Tuesday, December 1, 15

slide-35
SLIDE 35

Specialized Journaling In IceFS

50 100 150 200 250 Throughput (MB/s) 120.6 220.3 125.4 9.8 5.6 103.4 SQLite Varmail

no journal

  • rdered
  • rdered
  • rdered

SQLite runs with no journal Varmail runs with ordered

Tuesday, December 1, 15

slide-36
SLIDE 36

Specialized Journaling In IceFS

50 100 150 200 250 Throughput (MB/s) 120.6 220.3 125.4 9.8 5.6 103.4 SQLite Varmail

no journal

  • rdered
  • rdered
  • rdered
  • rdered

no journal

SQLite runs with ordered Varmail runs with no journal

Specialized journaling in IceFS provide flexibility between consistency and performance (over 50x)

Tuesday, December 1, 15

slide-37
SLIDE 37

Evaluation

Isolate failures ?

➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS

Faster recovery ?

➡ independent recovery for a cube

Better journaling performance ?

➡ isolated journaling performance ➡ flexibility between consistency and performance

Tuesday, December 1, 15

slide-38
SLIDE 38

Evaluation

Isolate failures ?

➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS

Faster recovery ?

➡ independent recovery for a cube

Better journaling performance ?

➡ isolated journaling performance ➡ flexibility between consistency and performance

Useful for applications ?

Tuesday, December 1, 15

slide-39
SLIDE 39

Server Virtualization

Shared file system

Disk

virtual disk 2 virtual disk 3 virtual disk 1

vm1 vm2 vm3

Failures and recovery of the shared file system impact all virtual machines

Tuesday, December 1, 15

slide-40
SLIDE 40

Virtual Machines

50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100

Time (Second) Throughput (IOPS)

fsck: 496s + bootup: 68s

VM1 VM2 VM3

Inject metadata corruption bootup three vms

Tuesday, December 1, 15

slide-41
SLIDE 41

Server Virtualization with IceFS

Shared file system with cubes

Disk

virtual disk 2 virtual disk 3 virtual disk 1

vm1 vm2 vm3

cube1 cube2 cube3

Tuesday, December 1, 15

slide-42
SLIDE 42

Server Virtualization with IceFS

50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100

Throughput (IOPS) IceFS-Offline

50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100

Time (Second) IceFS-Online

fsck: 35s + bootup: 67s fsck: 74s + bootup: 39s

VM1 VM2 VM3

recover a cube

  • ffline

Inject metadata corruption

bootup three vms

Tuesday, December 1, 15

slide-43
SLIDE 43

Server Virtualization with IceFS

50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100

Throughput (IOPS) IceFS-Offline

50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100

Time (Second) IceFS-Online

fsck: 35s + bootup: 67s fsck: 74s + bootup: 39s

VM1 VM2 VM3

recover a cube

  • ffline

recover a cube

  • nline

Tuesday, December 1, 15

slide-44
SLIDE 44

Evaluation

Isolate failures ?

➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS

Faster recovery ?

➡ independent recovery for a cube

Better journaling performance ?

➡ isolated journaling performance for cubes ➡ flexibility between consistency and performance

Useful for applications ?

➡ significantly reduce system downtime

Tuesday, December 1, 15

slide-45
SLIDE 45

Summary of IceFS

Local file systems lack physical isolation

➡ physical entanglement ➡ reliability and performance problems

IceFS provides isolation with data containers Computing is becoming virtualized, shared, and multi-tenant

➡ isolation is the key

Systems need to rethink isolation

➡ avoid entanglement ➡ provide useful abstractions for applications

Tuesday, December 1, 15

slide-46
SLIDE 46

Introduction Disentanglement in IceFS

➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation

Key-Value Separation in WiscKey

➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

slide-47
SLIDE 47

Key-Value Stores

Key-value stores are important

➡ web indexing, e-commerce, social networks ➡ local and distributed key-value stores ➡ hash table, b-trees ➡ log-structured merge trees (LSM-trees)

LSM-tree based key-value stores are popular

➡ optimize for write intensive workloads ➡ advanced features: range query, snapshot ➡ widely deployed ➡ BigTable and LevelDB at Google ➡ HBase, Cassandra and RocksDB at FaceBook

Tuesday, December 1, 15

slide-48
SLIDE 48

LSM-trees Background

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory 1

KV memT memT

2 3 4 5 disk

Batch and write sequentially Sort data for quick lookups

LevelDB

Tuesday, December 1, 15

slide-49
SLIDE 49

Problems: large write amplification large read amplification

Random load: a 100GB database Random lookup: 100,000 lookups

I/O Amplification in LSM-trees

1 10 100 1000

Amplification Ratio

14 327

100 GB

Write Read

Tuesday, December 1, 15

slide-50
SLIDE 50

Why LSM-trees ?

Good for hard drives

➡ high write throughput ➡ sequential vs random: can be up to 1000

Not optimal for SSDs

➡ large write/read amplification ➡ waste device resource ➡ decrease device’s lifetime ➡ unique characteristics of SSDs ➡ fast random reads ➡ internal parallelism

Tuesday, December 1, 15

slide-51
SLIDE 51

Our Solution: WiscKey

An SSD-conscious LSM-tree store

➡ main idea: separate keys and values ➡ harness SSD’s internal parallelism for range queries ➡ online and light-weight garbage collection ➡ minimize I/O amplification and crash consistent

Performance of WiscKey

➡ 2.5x to 111x for loading, 1.6x to 14x for lookups ➡ both micro and macro benchmarks

LSM-tree

key value

Value Log

Tuesday, December 1, 15

slide-52
SLIDE 52

Key-Value Separation

key

LSM-tree

value

Value Log

k, addr

value

SSD device

Main idea: only keys are required to be sorted, values can be managed separately

Tuesday, December 1, 15

slide-53
SLIDE 53

Introduction Disentanglement in IceFS

➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation

Key-Value Separation in WiscKey

➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

slide-54
SLIDE 54

Parallel Range Query

1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600

Request size: 1KB to 256KB Throughput (MB/s)

Sequential Rand-1thread Rand-32threads

SSD: Samsung 840 EVO 500GB Reads on a 100GB file on ext4

SSD read performance

➡ sequential, random, parallel

Tuesday, December 1, 15

slide-55
SLIDE 55

Parallel Range Query

Challenge

➡ sequential reads in LevelDB ➡ read keys and values separately in WiscKey

Parallel range query

➡ leverage parallel random reads of SSDs ➡ prefetch key-value pairs in advance ➡ range query interface: seek(), next(), prev() ➡ detect a sequential pattern ➡ prefetch concurrently in background

Tuesday, December 1, 15

slide-56
SLIDE 56

Garbage Collection

LSM-tree Value Log

value

k, addr

value value

SSD device

ksize, vsize, key, value

tail head

Online and light-weight

➡ append (ksize, vsize, key, value) in value log ➡ tail and head pointers for the valid range ➡ tail and head are stored in LSM-tree

Tuesday, December 1, 15

slide-57
SLIDE 57

Garbage Collection

LSM-tree Value Log

k, addr

tail head

memory disk addr match ? write back

  • 1. read from the tail
  • 2. check the LSM-tree
  • 3. write back valid kv pairs
  • 4. free space and update pointers

Tuesday, December 1, 15

slide-58
SLIDE 58

Optimizing LSM-tree Log

LSM-tree Value Log

k, addr

ksize, vsize, key, value

tail head log

ksize, vsize, key, value

LSM-tree log

➡ used for recovery in case of a crash ➡ performance overhead for small kv pairs

Remove LSM-tree log in WiscKey

➡ store head in LSM-tree periodically ➡ scan the value log from the head to recover

Tuesday, December 1, 15

slide-59
SLIDE 59

WiscKey Implementation

Based on LevelDB

➡ a separate vLog file for values ➡ modify I/O paths to separate keys and values ➡ straightforward to implement

Range query

➡ a background thread pool ➡ detect sequential pattern with the Iterator interface

File-system support

➡ fadvise to predeclare access patterns ➡ hole-punching to free space

Tuesday, December 1, 15

slide-60
SLIDE 60

Introduction Disentanglement in IceFS

➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation

Key-Value Separation in WiscKey

➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

slide-61
SLIDE 61

Experiment Setup

Testing machine

➡ 16 cores (3.3 GHz), 64 GB memory ➡ Samsung 840 EVO SSD (500 GB) ➡ maximal sequential read: 500 MB/s ➡ maximal sequential write: 400 MB/s

Workloads

➡ micro benchmarks (db_bench) ➡

YCSB benchmark

Tuesday, December 1, 15

slide-62
SLIDE 62

Evaluation

How does key-value separation impact the performance of WiscKey ?

Tuesday, December 1, 15

slide-63
SLIDE 63

Sequential Load

WiscKey is over 3x faster due to its write buffer and removing the LSM-tree log

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

load 100 GB database log writing in LevelDB has high overhead

Tuesday, December 1, 15

slide-64
SLIDE 64

Random Load

  • nly 2 MB/s to 4.1 MB/s

Small write amplification in WiscKey due to key- value separation (up to 111x in throughput)

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

load 100 GB database large write amplification (12 to 16) in LevelDB

Tuesday, December 1, 15

slide-65
SLIDE 65

Random Lookup

Smaller LSM-tree in WiscKey leads to better lookup performance (1.6x - 14x)

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

100,000 lookups on a randomly loaded 100 GB database large read amplification in LevelDB

Tuesday, December 1, 15

slide-66
SLIDE 66

Evaluation

How does key-value separation impact the performance of WiscKey ?

➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)

Is the parallel range query fast enough ?

Tuesday, December 1, 15

slide-67
SLIDE 67

Range Query

Better for large kv pairs, but worse for small kv pairs on an unsorted database

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand

read 4GB from a randomly loaded 100 GB database WiscKey is limited by SSD’s parallel random read performance For large kv pairs, WiscKey can perform better

Tuesday, December 1, 15

slide-68
SLIDE 68

Range Query

Sorted databases help WiscKey’s range query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand LevelDB-Seq WiscKey-Seq

read 4GB from a sequentially loaded 100 GB database Both WiscKey and LevelDB read sequentially

Tuesday, December 1, 15

slide-69
SLIDE 69

Evaluation

How does key-value separation impact the performance of WiscKey ?

➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)

Is the parallel range query fast enough ?

➡ limited by random read performance ➡ sorting helps

How about real workloads ? What is the effect of garbage collection ?

Tuesday, December 1, 15

slide-70
SLIDE 70

YCSB Benchmarks

A: 50% R, 50% U; B: 95% R, 5% U; C: 100% R; D: 95% R, 5% I; E: 95% Scan, 5% I; F: 50% R, 50% RMW

48x-116x 6x-16x 2x-20x 2.6x-25x 1.5x-4x 1x-7x 6x-8x

0.1 1 10 100 1000 Normalized Performance Key size: 16B, Value size: 1KB LOAD A B C D E F

LevelDB RocksDB WiscKey-GC WiscKey

many small range queries

Tuesday, December 1, 15

slide-71
SLIDE 71

Evaluation

How does key-value separation impact the performance of WiscKey ?

➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)

Is the parallel range query fast enough ?

➡ limited by random read performance ➡ sorting helps

How about real workloads ? What is the effect of garbage collection ?

➡ faster on all workloads ➡ performance similar to micro benchmarks

Tuesday, December 1, 15

slide-72
SLIDE 72

Summary of WiscKey

LSM-trees are not optimized for SSD devices WiscKey separates keys from values with an SSD-conscious design Many novel storage systems have been built for hard drives Transition to new storage hardware

➡ leverage existing software ➡ explore new ways to utilize the new hardware ➡ get the best of two worlds

Tuesday, December 1, 15

slide-73
SLIDE 73

Introduction Disentanglement in IceFS

➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation

Key-Value Separation in WiscKey

➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

slide-74
SLIDE 74

Lessons Learned

A large-scale study is feasible and valuable Research should match reality History repeats itself Don’t settle for existing abstraction Isolation should be a fundamental design goal Don’t run old software on new hardware Fundamental details matter Work on systems extremely slow or unreliable

Tuesday, December 1, 15

slide-75
SLIDE 75

Conclusion

Local storage systems are important Physical separation is useful

➡ improve both reliability and performance over 10x ➡ better reliability: isolated failures, localized recovery ➡ better performance: specialized journaling, minimize

I/O amplification

Computing and storage are evolving

➡ virtualized, shared and fast ➡ physical separation is the key ➡ IceFS and WiscKey are just a beginning

Tuesday, December 1, 15