Towards 0-Latency Durability
1
Sang-Won Lee (swlee@skku.edu)
Ack.: Moon, Yang, Oh and SKKU VLDB Lab. Members
Towards 0-Latency Durability Sang-Won Lee (swlee@skku.edu) Ack.: - - PowerPoint PPT Presentation
Towards 0-Latency Durability Sang-Won Lee (swlee@skku.edu) Ack.: Moon, Yang, Oh and SKKU VLDB Lab. Members NVRAMOS 2014 1 NVRAM is for 0-latency Durability 2 (DB) Transaction and ACID E.g. 100$ transfer from A to B account BUFFER POOL
1
Ack.: Moon, Yang, Oh and SKKU VLDB Lab. Members
2
3
BUFFER POOL MAIN MEMORY (Volatile) DISK (Non- Volatile)
4
5
– 2ms @ HDD – 0.2ms @ SSD – 0-latency @ NVDRAM??
Log Buffer LOG
BUFFER POOL MAIN MEMORY (Volatile) DISK (Non- Volatile)
Begin_tx1; Commit_tx1;
6
7
10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000 200 400 600 800 1000 1200 1400 IOPS Time (sec)
GC/WL
1,000 2,000 3,000 4,000 5,000 6,000 7,000 500 1000 1500 2000 2500 3000 3500 4000 4500 IOPS Time (sec)
Read Write
10
Database
Database Buffer
Tail
Head
D D D
Main LRU List
Free buffer List
Dirty Page Set
D D
Scan LRU List from Tail
Double Write Buffer
D
Issue Technique Problem Latency Buffer pool Read is blocked until dirty pages are written to storage Atomicity Redundant writes One to double write buffer, the other to data pages Durability Write barrier Flush dirty pages from OS to device and then from write cache to media
11
DBMS (Buffer Manager) OS (Write Barrier Enabled) Storage (Cache Enabled ) Write Buffer (volatile) – 16MB ~ 512MB Persistent Storage Media (Flash Memory, Magnetic Disk) Write (P1, .. , Pn) + File Metadata + flush_cache + FTL Address Mapping Data fsync()
Blocked
225 836 1556 2556 5020 6969 10582 12647 15319 59 135 184 234 251 335 375 381 387 15 150 1,500 15,000 1 4 8 16 32 64 128 256 no fsync IOPS (log scale) # of write pages per fsync
DuraSSD - NoBarrier DuraSSD SSD - A SSD - B HDD - 15k rpm
13
Issue Existing Technique Solution Latency Buffer pool Fast write with a write cache Atomicity Redundant writes Single atomic write for small pages (4KB or 8KB) Durability Write barrier
WRITE_BARRIER
command queue
14
1,346 5,809 10,034 13,090
4,000 6,000 8,000 10,000 12,000 14,000 ON/ON ON/OFF OFF/ON OFF/OFF TPS (Transactions Per Second) Write Barrier/Double Write Buffer
15
Pagesize 16KB Buffer 10GB DB 100GB Clients 128
7X 4X
16
– Better read/write IOPS
– Better buffer-pool hit ratio – vs. [SIGMOD09] – no write opt. less effect of page size tuning
17 92% 93% 94% 95% 96% 97% 2GB 4GB 6GB 8GB 10GB Hit Ratio Bufferpool size
MySQL buffer hit ratio (LinkBench)
4KB 8KB 16KB 13,090 22,253 29,974
10,000 15,000 20,000 25,000 30,000 35,000 16KB 8KB 4KB TPS (Transactions Per Second) Page Size
LinkBench (OFF/OFF)
Large
18 1.5 1.2 1.4 1.3 8.9 9.6 9.8 11.2 5.4 11.1 67 45.5 65.3 67.6 51.6 82.2 86.8 214.9 155.4 217.6
50 100 150 200 250 Get Node Cnt Link Get Link_List Mltget Link Add Node Del Node Upd Node Add Link Del Link Upd Link Read Write Latency (millisecond)
LinkBench Transaction Latency (mean)
OFF/OFF with 4KB ON/ON with 16KB
4,845 110,400
40,000 60,000 80,000 100,000 120,000 Barrier ON Barrier OFF TpmC (Transactions per minute Count)
TPC-C - relational database
19
Pagesize 8KB/Buffer 2GB/DB 100GB
195 390 1,400 2,041 4,921 2,406 3,464 4,209 5,461 6,208
2,000 3,000 4,000 5,000 6,000 7,000 8,000 1 2 5 10 100 OPS (Operations per second) Batch Size
YCSB - Couchbase
Barrier ON Barrier OFF
20
– Gap filler between the latency for the durability and the bandwidth
– IOPS crisis is solved? – NVMe = Excessive IOPS/GB ?
– 5 min rule (Jim Gray)
21
22
– WAL log in BigTable, MongoDB, Cassandra, Amazon Dynamo, Netflix Blitz4j, Yahoo WALNUT, Facebook, Twitter
– Two Phase Commit – SAP HANA, Hekaton
– Eventual consistency – Replication
23
Log Buffer LOG
BUFFER POOL MAIN MEMORY (Volatile) DISK (Non- Volatile)
24
Redo Log File
512 Byte Block (include wastage)
– 40 cores: 4 sockets, 10 cores/socket, 2GHz/core – 32GB 1333MHz DDR3 DRAM
25
26
in SAP HANA”, IEEE DE Bulletine, 2013 June
27
Prepare Local Prepare Write Prepare Record In Log (force) yes Local Prepare (lazy) Write Commit Record In Log (force) Commit Ack Local Commit Work Write Completion Record In Log (lazy) Ack when durable. Coordinator Participant Write Completion Record In Log (lazy) State Active Prepared Committing Local Commit Work (lazy) Committed State Active Prepared Committing Committed
28
29
LNKD-SSD and LNKD-DISK demonstrate the importance of write latency in practice. Immediately after write commit, LNKD-SSD had a 97.4% probability of consistent reads, reaching over a 99.999% probability of consistent reads after 5 ms. LNKD-DISK had only a 43.9% probability of consistent reads and, 10 ms later, only a 92.5% probability. This suggests that SSDs may greatly improve consistency due to reduced write variance.
30
31