/ Beyond Block I/O: Rethinking Traditional Storage Primitives Traditional Storage Primitives
Xiangyong Ouyang* ┼ , David Nellans ┼ , Robert Wipfel ┼, D id Fl
┼
D K P d * David Flynn ┼, D. K. Panda*
* The Ohio State University ┼Fusion‐io
1
Beyond Block I/O: Rethinking / Traditional Storage Primitives - - PowerPoint PPT Presentation
Beyond Block I/O: Rethinking / Traditional Storage Primitives Traditional Storage Primitives Xiangyong Ouyang * , David Nellans , Robert Wipfel , David Flynn , D. K. Panda * d * D id Fl D K P * The Ohio State University
┼
* The Ohio State University ┼Fusion‐io
1
2
3
il li i
Flash Translation Layer
File System OS Applications Flash Media Flash Media
4
Mapping LBA‐>PBA
2 3 4 5 10 11 12 13 3
Log head Log tail
10 11 12 13 14
PBA:
15 16
PBA:
5
Mapping LBA‐>PBA
2 3 4 5 6
i
10 11 12 13 14
15 16
Write Request
3
Log head Log tail
5 6
Log tailLog tail
10 11 12 13 14
PBA:
15 16
PBA: Log‐FTL Advantages
6
Avoid in‐place update (Block Remapping) Even wear‐leveling
Atomicity, Consistency, Isolation, durability ACID: Atomicity, Consistency, Isolation, durability
7
8
9
Log tail Atomic Write Flags== 0 0 … 1 Non‐AW: flag == 1
1 1 1 1 1
3 4 5 6 1 8 9 Flag Bit LBA 10 11 12 13 14
PBA
15 16 3 4 5 6 1 8 9 17 LBA
10
Mapping LBA >PBA Mapping LBA‐>PBA
4 6 8 11 13 15 16 17
Incoming Atomic‐Write Group
18
Log tail
Log tail Log tail
1
8
1
7
1
6
1
5
1
4
1
3 4 6
1
8
1
4
1
6
1
8 10 11 12 13 14
PBA:
15 16 17 18
11
Atomic‐Write Group
Write LBA 4, 6, 8 Update Mapping
1 1 1 1 1 1 1 1 1 1 1 1
(3) Failure when updating
l
1
8
1
6
1
5
1
4
1
4
1
6 4 6
1
8
1
7
1
6
1
5 4 6
1
8
1
4
1
6 (3) Failure when updating FTL
Incomplete complete Atomic‐Write group
(1) Failure during writing: g g
Incomplete Atomic‐Write group
( ) g g
with “0” flag bits (2) Failure after writing
previous version g g g, rebuild the FTL mapping
12
13
DBMS Applications DBMS Applications File System File System Generalized Solid State Storage Layer g y Write Atomicity Wear‐ Leveling More … S lid S S Solid State Storage
14
Flush dirty buffer pages to TableFile
Buffer Pool DoubleWrite Buffer
Memory
Ph I
Stable Storage Table File:
Phase I Phase II
DoubleWrite Area TableSpace Area Table File: DoubleWrite Area
Impact the performance Impact the performance Double amount of writes to Flash media halve device’s lifespan
15
Buffer Pool Memory Stable Storage int atomic_write (int fd, void* buf[], long *length[], long * offsets[], int num); Stable Storage Table File:
16
17
Processor Xeon X3210 @ 2.13GHz DRAM 8GB DDR2 667MHz, 4X2GB Boot Device 250GB SATA‐II 3.0Gb/s DB Storage Device Fusion io ioDrive 320GB PCIe 1 0 4x Lanes
18
DB Storage Device Fusion‐io ioDrive 320GB PCIe 1.0 4x Lanes OS Ubuntu 9.10 , Linux Kernel 2.6.33
19
Latency (us) Write Buffering Write Strategy Pattern g gy Sync Async A‐Write Random Buffered 4042 1112 NA DirectIO 3542 851 671 Strided Buffered 4006 1146 NA DirectIO 3447 857 669 Sequential Buffered 3955 330 NA Di tIO 3402 898 685
DirectIO 3402 898 685
Atomic Write : all blocks in one compound write
20
Bandwidth (MB/s) Write Buffering Write Strategies Pattern g g Sync Async A‐Write Random Buffered 302 301 NA DirectIO 212 505 513 Strided Buffered 306 300 NA DirectIO 217 503 513 Sequential Buffered 308 304 NA
DirectIO 213 507 514
21
Atomic Write : all blocks in one compound write
23% improvement (ACID compliant) 8% improvement (not ACID compliant)
1.2 1.4 MySQL DoubleWrite Disabled Atomic‐Write
ut
0 8 1 1.2
hroughpu
0 4 0.6 0.8
nsaction T
0.2 0.4
Tran
TPC‐C TPC‐H SysBench
22
1.2 MySQL DoubleWrite Disabled Atomic‐Write
0.8 1 0 4 0.6
a Written
0.2 0.4
Data
TPC‐C TPC‐H sysbench
46% reduction ( t ACID li t) 43% reduction (ACID compliant) (not ACID compliant) 43% reduction (ACID compliant) (High throughput generate more trans. log)
23
1.2 MySQL DoubleWrite Disabled Atomic‐Write
0.8 1
ncy
0.6
ion Laten
0.2 0.4
Transact
TPC‐C TPC‐H sysbench
20% i 9% impro ement 20% improvement (ACID compliant) 9% improvement (not ACID compliant)
24
7% improvement 33% improvement
1.2 1.4 0 6 0.8 1
Do bleWrite Results in previous slides
0 2 0.4 0.6
DoubleWrite as the Baseline
0.2 1:1 1:2 1:4 1:10 1:25 1:100 1:500 1:1000 Trans/Minute (Higher is Better) Data Written (Lower is Better)
DB workload: TPC‐C (DBT2)
25
33% improvement
1.2 1.4 0 6 0.8 1
DoubleWrite
0 2 0.4 0.6
as the Baseline 28 ‐ 40% Reduction
0.2 0% 10% 33% 50% 67% 90% 100% Trans/Second (Higher is Better) Data Written (Lower is Better)
DB workload: SysBench
26
27
28
29