Physical Separation in Modern Storage Systems
Lanyue Lu
Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael Swift, Xinyu Zhang University of Wisconsin - Madison
Tuesday, December 1, 15
Physical Separation in Modern Storage Systems Lanyue Lu Committee: - - PowerPoint PPT Presentation
Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael Swift, Xinyu Zhang University of Wisconsin - Madison Tuesday, December 1, 15 Local Storage Systems Are
Tuesday, December 1, 15
Tuesday, December 1, 15
➡ how to organize data on disks and in memory ➡ impact both reliability and performance
➡ store relevant data together ➡ locality is pursued in various storage systems ➡ file systems, key-value stores, databases ➡ better performance (caching and prefetching) ➡ high space utilization ➡ optimize for hard drives
Tuesday, December 1, 15
➡ fast storage hardware (e.g., SSDs) ➡ servers with many cores and large memory ➡ sharing infrastructure is the reality ➡ virtualization, containers, data centers
➡ shared failures (e.g.,
➡ bundled performance (e.g., apps) ➡ lack flexibility to manage data differently
Tuesday, December 1, 15
➡ rethink existing data layouts ➡ key: separate data structures ➡ apply in both file systems and key-value stores
➡ IceFS: disentangle structures and transactions ➡ isolated failures, faster recovery ➡ customized performance ➡ WiscKey: key-value separation ➡ minimize I/O amplification ➡ leverage devices’ internal parallelism
Tuesday, December 1, 15
➡ the first comprehensive file-system study ➡ published in FAST ’13 (best paper award)
➡ localized failure, localized recovery ➡ specialized journaling performance ➡ published in OSDI ’14
➡ an SSD-conscious LSM-tree ➡ over 100x performance improvement ➡ submitted to FAST ’16
Tuesday, December 1, 15
➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation
➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation
Tuesday, December 1, 15
➡ independent failures and recovery
➡ isolated performance
➡ computing: virtual machines, Linux containers ➡ security: BSD jail, sandbox ➡ cloud: multi-tenant systems
Tuesday, December 1, 15
➡ manage user data ➡ long-standing and stable ➡ foundation for distributed file systems
➡ file, directory, namespace ➡ just illusion
➡ entangled data structures and transactions
Tuesday, December 1, 15
➡ e.g., multiple files share one inode block ➡ many shared structures: bitmap, directory block
Tuesday, December 1, 15
Tuesday, December 1, 15
➡ isolated failures for data containers ➡ up to 8x faster localized recovery ➡ up to 50x higher performance
➡ virtualized systems: reduce the downtime over 5x ➡ HDFS: improve the recovery efficiency over 7x
Tuesday, December 1, 15
➡ physically disentangled on disk and in memory
Tuesday, December 1, 15
➡ no shared metadata: e.g., block groups ➡ no shared disk blocks or buffers
➡ partition linked lists or trees ➡ avoid directory hierarchy dependency
➡ use separate transactions ➡ enable customized journaling modes
Tuesday, December 1, 15
➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation
➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation
Tuesday, December 1, 15
➡ isolated reliability and performance for containers
➡ physical resource isolation ➡ directory indirection ➡ transaction splitting
➡ local file system: Ext3/JBD ➡ kernel:
➡ user level tool: e2fsprogs
Tuesday, December 1, 15
➡ physical partition for disk locality
Tuesday, December 1, 15
➡ sub-super block (Si) and isolated block groups
Tuesday, December 1, 15
Tuesday, December 1, 15
Tuesday, December 1, 15
commit tx
commit tx commit tx
Tuesday, December 1, 15
➡ per-cube read-only and crash ➡ encourage more runtime checking
➡ only check faulty cubes ➡ offline and online
➡ concurrent and independent transactions ➡ diverse journal modes (e.g., no journal, no fsync)
Tuesday, December 1, 15
➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation
➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation
Tuesday, December 1, 15
➡ inject around 200 faults ➡ per-cube failure (read-only or crash) in IceFS
Tuesday, December 1, 15
➡ inject around 200 faults ➡ per-cube failure (read-only or crash) in IceFS
Tuesday, December 1, 15
200 400 600 800 1000 Fsck Time (s) File-system Capacity 200GB 400GB 600GB 800GB 231 476 723 1007 Ext3
Tuesday, December 1, 15
200 400 600 800 1000 Fsck Time (s) File-system Capacity 200GB 400GB 600GB 800GB 231 476 723 1007 35 64 91 122 Ext3 IceFS
Tuesday, December 1, 15
➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS
➡ independent recovery for a cube
Tuesday, December 1, 15
➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS
➡ independent recovery for a cube
Tuesday, December 1, 15
➡ a database application ➡ sequentially write large key/value pairs ➡ asynchronous
➡ an email server workload ➡ randomly write small blocks ➡ fsync after each write
Tuesday, December 1, 15
30 60 90 120 150 180 Throughput (MB/s) 146.7 76.1 120.6 20 1.9 9.8 SQLite Varmail
Tuesday, December 1, 15
30 60 90 120 150 180 Throughput (MB/s) 146.7 76.1 120.6 20 1.9 9.8 SQLite Varmail
Tuesday, December 1, 15
30 60 90 120 150 180 Throughput (MB/s) 146.7 76.1 120.6 20 1.9 9.8 SQLite Varmail
Tuesday, December 1, 15
50 100 150 200 250 Throughput (MB/s) 120.6 220.3 125.4 9.8 5.6 103.4 SQLite Varmail
Tuesday, December 1, 15
50 100 150 200 250 Throughput (MB/s) 120.6 220.3 125.4 9.8 5.6 103.4 SQLite Varmail
Tuesday, December 1, 15
50 100 150 200 250 Throughput (MB/s) 120.6 220.3 125.4 9.8 5.6 103.4 SQLite Varmail
Tuesday, December 1, 15
➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS
➡ independent recovery for a cube
➡ isolated journaling performance ➡ flexibility between consistency and performance
Tuesday, December 1, 15
➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS
➡ independent recovery for a cube
➡ isolated journaling performance ➡ flexibility between consistency and performance
Tuesday, December 1, 15
Tuesday, December 1, 15
50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100
Time (Second) Throughput (IOPS)
fsck: 496s + bootup: 68s
VM1 VM2 VM3
Tuesday, December 1, 15
Tuesday, December 1, 15
50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100
Throughput (IOPS) IceFS-Offline
50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100
Time (Second) IceFS-Online
fsck: 35s + bootup: 67s fsck: 74s + bootup: 39s
VM1 VM2 VM3
Tuesday, December 1, 15
50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100
Throughput (IOPS) IceFS-Offline
50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 550 600 650 700 20 40 60 80 100
Time (Second) IceFS-Online
fsck: 35s + bootup: 67s fsck: 74s + bootup: 39s
VM1 VM2 VM3
Tuesday, December 1, 15
➡ inject around 200 faults ➡ per-cube failure (read-only or crash) for IceFS
➡ independent recovery for a cube
➡ isolated journaling performance for cubes ➡ flexibility between consistency and performance
➡ significantly reduce system downtime
Tuesday, December 1, 15
➡ physical entanglement ➡ reliability and performance problems
➡ isolation is the key
➡ avoid entanglement ➡ provide useful abstractions for applications
Tuesday, December 1, 15
➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation
➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation
Tuesday, December 1, 15
➡ web indexing, e-commerce, social networks ➡ local and distributed key-value stores ➡ hash table, b-trees ➡ log-structured merge trees (LSM-trees)
➡ optimize for write intensive workloads ➡ advanced features: range query, snapshot ➡ widely deployed ➡ BigTable and LevelDB at Google ➡ HBase, Cassandra and RocksDB at FaceBook
Tuesday, December 1, 15
L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)
Tuesday, December 1, 15
1 10 100 1000
14 327
Write Read
Tuesday, December 1, 15
➡ high write throughput ➡ sequential vs random: can be up to 1000
➡ large write/read amplification ➡ waste device resource ➡ decrease device’s lifetime ➡ unique characteristics of SSDs ➡ fast random reads ➡ internal parallelism
Tuesday, December 1, 15
➡ main idea: separate keys and values ➡ harness SSD’s internal parallelism for range queries ➡ online and light-weight garbage collection ➡ minimize I/O amplification and crash consistent
➡ 2.5x to 111x for loading, 1.6x to 14x for lookups ➡ both micro and macro benchmarks
Tuesday, December 1, 15
Tuesday, December 1, 15
➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation
➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation
Tuesday, December 1, 15
1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600
Request size: 1KB to 256KB Throughput (MB/s)
Sequential Rand-1thread Rand-32threads
➡ sequential, random, parallel
Tuesday, December 1, 15
➡ sequential reads in LevelDB ➡ read keys and values separately in WiscKey
➡ leverage parallel random reads of SSDs ➡ prefetch key-value pairs in advance ➡ range query interface: seek(), next(), prev() ➡ detect a sequential pattern ➡ prefetch concurrently in background
Tuesday, December 1, 15
➡ append (ksize, vsize, key, value) in value log ➡ tail and head pointers for the valid range ➡ tail and head are stored in LSM-tree
Tuesday, December 1, 15
Tuesday, December 1, 15
➡ used for recovery in case of a crash ➡ performance overhead for small kv pairs
➡ store head in LSM-tree periodically ➡ scan the value log from the head to recover
Tuesday, December 1, 15
➡ a separate vLog file for values ➡ modify I/O paths to separate keys and values ➡ straightforward to implement
➡ a background thread pool ➡ detect sequential pattern with the Iterator interface
➡ fadvise to predeclare access patterns ➡ hole-punching to free space
Tuesday, December 1, 15
➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation
➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation
Tuesday, December 1, 15
➡ 16 cores (3.3 GHz), 64 GB memory ➡ Samsung 840 EVO SSD (500 GB) ➡ maximal sequential read: 500 MB/s ➡ maximal sequential write: 400 MB/s
➡ micro benchmarks (db_bench) ➡
Tuesday, December 1, 15
Tuesday, December 1, 15
64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey
Tuesday, December 1, 15
64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey
Tuesday, December 1, 15
64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey
Tuesday, December 1, 15
➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)
Tuesday, December 1, 15
64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand
Tuesday, December 1, 15
64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand LevelDB-Seq WiscKey-Seq
Tuesday, December 1, 15
➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)
➡ limited by random read performance ➡ sorting helps
Tuesday, December 1, 15
0.1 1 10 100 1000 Normalized Performance Key size: 16B, Value size: 1KB LOAD A B C D E F
LevelDB RocksDB WiscKey-GC WiscKey
Tuesday, December 1, 15
➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)
➡ limited by random read performance ➡ sorting helps
➡ faster on all workloads ➡ performance similar to micro benchmarks
Tuesday, December 1, 15
➡ leverage existing software ➡ explore new ways to utilize the new hardware ➡ get the best of two worlds
Tuesday, December 1, 15
➡ File system Disentanglement ➡ The Ice File System ➡ Evaluation
➡ Key-value Separation Idea ➡ Challenges and Optimization ➡ Evaluation
Tuesday, December 1, 15
Tuesday, December 1, 15
➡ improve both reliability and performance over 10x ➡ better reliability: isolated failures, localized recovery ➡ better performance: specialized journaling, minimize
➡ virtualized, shared and fast ➡ physical separation is the key ➡ IceFS and WiscKey are just a beginning
Tuesday, December 1, 15