Beyond Block I/O: Rethinking / Traditional Storage Primitives - - PowerPoint PPT Presentation

beyond block i o rethinking traditional storage
SMART_READER_LITE
LIVE PREVIEW

Beyond Block I/O: Rethinking / Traditional Storage Primitives - - PowerPoint PPT Presentation

Beyond Block I/O: Rethinking / Traditional Storage Primitives Traditional Storage Primitives Xiangyong Ouyang * , David Nellans , Robert Wipfel , David Flynn , D. K. Panda * d * D id Fl D K P * The Ohio State University


slide-1
SLIDE 1

/ Beyond Block I/O: Rethinking Traditional Storage Primitives Traditional Storage Primitives

Xiangyong Ouyang* ┼ , David Nellans ┼ , Robert Wipfel ┼, D id Fl

D K P d * David Flynn ┼, D. K. Panda*

* The Ohio State University ┼Fusion‐io

1

slide-2
SLIDE 2

Agenda Agenda

  • Introduction and Motivation
  • Introduction and Motivation

– Solid State Storage (SSS) Characteristics – Duplicated efforts at SSS and upper layers

  • Atomic‐Write Primitive within FTL

Atomic Write Primitive within FTL

  • Leverage Atomic‐Write in DBMS

– Example with MySQL

  • Experimental Results

p

  • Conclusion and Future Work

2

slide-3
SLIDE 3

Evolution of Storage Devices

  • Interface to persistent storage remains

Interface to persistent storage remains unchanged for decades

– seek, read, write – Fits well with mechanical hard disks

  • Solid State Storage (SSS)

Merits

  • Fast random access, high throughput
  • Low power consumption
  • Shock resistance small form factor

Shock resistance, small form factor

– Expose the same disk‐based block I/O interface – Challenges… g

3

slide-4
SLIDE 4

NAND‐flash Based Solid State Storage (SSS)

  • Pitfalls

/ Asymmetric read/write latency

  • Cannot overwrite before erasure
  • Erasure at large unit (64 256 pages) very slow (1+ ms)
  • Erasure at large unit (64‐256 pages), very slow (1+ ms)

Flash Wear‐out: limited write durability

  • SLC: 30K erase/program cycles

MLC: 3K erase/program

  • SLC: 30K erase/program cycles, MLC: 3K erase/program

cycles

il li i

l h l i ( )

Flash Translation Layer

File System OS Applications Flash Media Flash Media

  • Flash Translation Layer (FTL)

– Input: Logical Block Address (LBA) – Output: Physical Block Address (PBA)

4

slide-5
SLIDE 5

Log‐Structured FTL

Mapping LBA‐>PBA

2 3 4 5 10 11 12 13 3

Log head Log tail

2 3 4 5 2 3 4 5

10 11 12 13 14

PBA:

15 16

PBA:

5

slide-6
SLIDE 6

Log‐Structured FTL

Mapping LBA‐>PBA

2 3 4 5 6

i

10 11 12 13 14

6 2 3

15 16

Write Request

3

Log head Log tail

5 6

Log tailLog tail

2 3 4 5 6 2 3 2 3 2 3 4 5

10 11 12 13 14

PBA:

15 16

6 2 3 2 3

PBA: Log‐FTL Advantages

6

Avoid in‐place update (Block Remapping) Even wear‐leveling

slide-7
SLIDE 7

Duplicated Efforts at Upper Layers and FTL

  • Multi‐Version at Upper Layer

– DBMS ( Transactional Log ) – File‐systems (Metadata journaling, Copy‐on‐Write) – To achieve Write Atomicity

  • ACID:

Atomicity, Consistency, Isolation, durability ACID: Atomicity, Consistency, Isolation, durability

  • Block‐Remapping at FTL

A id i l d t i iti l th – Avoid in‐place update in critical path

  • Common Thread: Multi‐versions of same data
  • Why duplicate this effort ?

Why duplicate this effort ?

  • Proposed approach:

– Offload Write‐Atomicity guarantee to FTL P id At i W it i iti t l – Provide Atomic‐Write primitive to upper layers

7

slide-8
SLIDE 8

Agenda Agenda

  • Introduction and Motivation
  • Introduction and Motivation
  • Atomic‐Write Primitive at FTL
  • Leverage Atomic‐Write in DBMS
  • Experimental Results
  • Experimental Results
  • Conclusion and Future Work

8

slide-9
SLIDE 9

Atomic‐Write: a New Block I/O Primitive

  • Offload the Write‐Atomicity guarantee into FTL

Offload the Write Atomicity guarantee into FTL C bi lti bl k it i t l i l

  • Combines multi‐block writes into a logical group

(contiguous , non‐contiguous)

  • Commit the group as an atomic unit, if the

compound operation succeeds

  • Rollback the whole group is any individual fails

9

slide-10
SLIDE 10

Atomic‐Write (1): Flag Bit in Block Header Atomic Write (1): Flag Bit in Block Header

  • One Flag Bit per block header
  • One Flag Bit per block header

Identify blocks belonging to the same atomic‐group

Log tail Atomic Write Flags== 0 0 … 1 Non‐AW: flag == 1

1 1 1 1 1

3 4 5 6 1 8 9 Flag Bit LBA 10 11 12 13 14

PBA

15 16 3 4 5 6 1 8 9 17 LBA

  • Don’t allow Non‐AW to interleave with Atomic‐Write

10

Don t allow Non AW to interleave with Atomic Write

slide-11
SLIDE 11

Atomic‐Write (2): Deferred Mapping Table Update

  • Defer mapping table update

Defer mapping table update

Not expose partial state to readers

Mapping LBA >PBA Mapping LBA‐>PBA

4 6 8 11 13 15 16 17

Incoming Atomic‐Write Group

18

Log tail

4 6 8

Log tail Log tail

1

8

1

7

1

6

1

5

1

4

1

3 4 6

1

8

1

4

1

6

1

8 10 11 12 13 14

PBA:

15 16 17 18

11

slide-12
SLIDE 12

Atomic‐Write (3): Failure Recovery ( ) y

4

Atomic‐Write Group

6 8

Write LBA 4, 6, 8 Update Mapping

1 1 1 1 1 1 1 1 1 1 1 1

(3) Failure when updating

l

1

8

1

6

1

5

1

4

1

4

1

6 4 6

1

8

1

7

1

6

1

5 4 6

1

8

1

4

1

6 (3) Failure when updating FTL

  • Log tail contains “1” flag bit

Incomplete complete Atomic‐Write group

(1) Failure during writing: g g

  • Same as (2)

Incomplete Atomic‐Write group

( ) g g

  • Scan backwards, discard blocks

with “0” flag bits (2) Failure after writing

  • Scan the log from beginning,
  • Rollback the partial blocks to

previous version g g g, rebuild the FTL mapping

12

slide-13
SLIDE 13

Agenda Agenda

  • Introduction and Motivation
  • Introduction and Motivation
  • Atomic‐Write Primitive at FTL
  • Leverage Atomic‐Write in DBMS

Example with MySQL – Example with MySQL

  • Experimental Results
  • Conclusion and Future Work

13

slide-14
SLIDE 14

Proposed Storage Stack Proposed Storage Stack

DBMS Applications DBMS Applications File System File System Generalized Solid State Storage Layer g y Write Atomicity Wear‐ Leveling More … S lid S S Solid State Storage

14

Example: Leverage Atomic‐Write in DBMS ( MySQL )

slide-15
SLIDE 15

DoubleWrite with MySQL InnoDB Storage Engine

Flush dirty buffer pages to TableFile

  • memory pressure

Buffer Pool DoubleWrite Buffer

  • memory pressure
  • commit()
  • Timeout

Memory

Ph I

Stable Storage Table File:

Phase I Phase II

DoubleWrite Area TableSpace Area Table File: DoubleWrite Area

  • Every data page is written twice !

Impact the performance Impact the performance Double amount of writes to Flash media halve device’s lifespan

15

slide-16
SLIDE 16

MySQL InnoDB: Atomic‐Write

Buffer Pool Memory Stable Storage int atomic_write (int fd, void* buf[], long *length[], long * offsets[], int num); Stable Storage Table File:

Reduce the data written by half bl h ff i lif Double the effective wear‐out life Simplify the upper layer design Better performance

16

Better performance Guarantee the same level of data integrity as DoubleWrite

slide-17
SLIDE 17

Agenda Agenda

  • Introduction and Motivation
  • Introduction and Motivation
  • Atomic‐Write Primitive at FTL
  • Leverage Atomic‐Write in DBMS
  • Experimental Results
  • Experimental Results
  • Conclusion and Future Work

17

slide-18
SLIDE 18

Experiment Setup Experiment Setup

  • Fusion‐io 320GB MLC NAND‐flash based device
  • Atomic‐Write implemented in a research branch of

v2 1 Fusion‐io driver v2.1 Fusion‐io driver

  • MySQL 5 1 49 InnoDB (extended with Atomic‐Write)

MySQL 5.1.49 InnoDB (extended with Atomic Write)

– 2 machines connected with 1 GigE – Both Trans. log and table‐file stored on solid state

Processor Xeon X3210 @ 2.13GHz DRAM 8GB DDR2 667MHz, 4X2GB Boot Device 250GB SATA‐II 3.0Gb/s DB Storage Device Fusion io ioDrive 320GB PCIe 1 0 4x Lanes

18

DB Storage Device Fusion‐io ioDrive 320GB PCIe 1.0 4x Lanes OS Ubuntu 9.10 , Linux Kernel 2.6.33

slide-19
SLIDE 19

Micro Benchmark Micro Benchmark

  • Different Write Mechanisms:

– Synchronous: write() + fsync() – Asynchronous: libaio At i W it – Atomic‐Write

  • Different write patterns:

– Sequential – Strided Random – Random

  • Buffer strategies

g

– Buffered_IO: OS page cache – Direct_IO: bypasses OS page cache

19

slide-20
SLIDE 20

I/O Microbenchmark: Latency I/O Microbenchmark: Latency

Write Latency (Lower is Better) ( 64 blocks 512B each) ( 64 blocks, 512B each)

Latency (us) Write Buffering Write Strategy Pattern g gy Sync Async A‐Write Random Buffered 4042 1112 NA DirectIO 3542 851 671 Strided Buffered 4006 1146 NA DirectIO 3447 857 669 Sequential Buffered 3955 330 NA Di tIO 3402 898 685

  • Atomic‐Write : all blocks in one compound write

DirectIO 3402 898 685

Atomic Write : all blocks in one compound write

  • Synchronous Write: write ( ) + fsync( )
  • Asynchronous Write: Linux libaio

20

slide-21
SLIDE 21

I/O Microbenchmark: Bandwidth I/O Microbenchmark: Bandwidth

Write Bandwidth (Higher is Better) ( 64 blocks 16KB each) ( 64 blocks, 16KB each)

Bandwidth (MB/s) Write Buffering Write Strategies Pattern g g Sync Async A‐Write Random Buffered 302 301 NA DirectIO 212 505 513 Strided Buffered 306 300 NA DirectIO 217 503 513 Sequential Buffered 308 304 NA

  • Atomic‐Write : all blocks in one compound write

DirectIO 213 507 514

21

Atomic Write : all blocks in one compound write

  • Synchronous Write: write ( ) + fsync( )
  • Asynchronous Write: Linux libaio
slide-22
SLIDE 22

Transaction Throughput Transaction Throughput

23% improvement (ACID compliant) 8% improvement (not ACID compliant)

1.2 1.4 MySQL DoubleWrite Disabled Atomic‐Write

ut

0 8 1 1.2

hroughpu

0 4 0.6 0.8

nsaction T

0.2 0.4

Tran

TPC‐C TPC‐H SysBench

  • DB workload: TPC‐C (DBT2) , TPC‐H (DBT3) , SysBench
  • Buffer Pool : Database = 1 : 10

22

slide-23
SLIDE 23

Data Written to SSS

1.2 MySQL DoubleWrite Disabled Atomic‐Write

Data Written to SSS

0.8 1 0 4 0.6

a Written

0.2 0.4

Data

TPC‐C TPC‐H sysbench

46% reduction ( t ACID li t) 43% reduction (ACID compliant) (not ACID compliant) 43% reduction (ACID compliant) (High throughput generate more trans. log)

  • DB workload: TPC‐C (DBT2) , TPC‐H (DBT3) , SysBench
  • Buffer Pool : Database = 1 : 10

23

slide-24
SLIDE 24

Transaction Latency

1.2 MySQL DoubleWrite Disabled Atomic‐Write

Transaction Latency

0.8 1

ncy

0.6

ion Laten

0.2 0.4

Transact

TPC‐C TPC‐H sysbench

20% i 9% impro ement 20% improvement (ACID compliant) 9% improvement (not ACID compliant)

  • DB workload: TPC‐C (DBT2) , TPC‐H (DBT3) , SysBench
  • Buffer Pool : Database = 1 : 10

24

slide-25
SLIDE 25

DB‐buffer‐pool size : DB on‐disk size

7% improvement 33% improvement

1.2 1.4 0 6 0.8 1

Do bleWrite Results in previous slides

0 2 0.4 0.6

DoubleWrite as the Baseline

0.2 1:1 1:2 1:4 1:10 1:25 1:100 1:500 1:1000 Trans/Minute (Higher is Better) Data Written (Lower is Better)

  • DB workload: TPC‐C (DBT2)

DB workload: TPC‐C (DBT2)

  • Vary Buffer Pool : Database size
  • Atomic‐Write vs. DoubleWrite

25

slide-26
SLIDE 26

DB Records Update Ratio

33% improvement

1.2 1.4 0 6 0.8 1

DoubleWrite

0 2 0.4 0.6

as the Baseline 28 ‐ 40% Reduction

0.2 0% 10% 33% 50% 67% 90% 100% Trans/Second (Higher is Better) Data Written (Lower is Better)

  • DB workload: SysBench

DB workload: SysBench

  • Vary Update ratio in total workload
  • Atomic‐Write vs. DoubleWrite

26

slide-27
SLIDE 27

Conclusions Conclusions

  • Solid State Storage opens opportunities for higher

g p pp g

  • rder primitives in storage interfaces
  • Atomic‐Write: allows multi‐block write operations to

be completed as an atomic unit

  • Benefit upper layers with ACID requirements

– OS Filesystem DBMS applications OS, Filesystem, DBMS, applications – Reduced complexity – Improved performance – Improved device durability

27

slide-28
SLIDE 28

Future Work Future Work

  • Work with Linux kernel maintainers to integrate
  • Work with Linux kernel maintainers to integrate

atomic‐write in a non‐proprietary way

  • To support multiple outstanding atomic‐write groups

– Full transactional support

  • Explore other higher order I/O primitives

28

slide-29
SLIDE 29

Thank You! Thank You!

29