An Efficient Memory-Mapped Key-Value Store for Flash Storage - PowerPoint PPT Presentation

An Efficient Memory-Mapped Key-Value Store for Flash Storage Anastasios Papagiannis, Giorgos Saloustros, Pilar González-Férez, and Angelos Bilas Institute of Computer Science (ICS) Foundation for Research and Technology – Hellas (FORTH) Greece

Saving CPU Cycles In Data Access  Data grows exponentially  Seagate report claims that data grow 2x every 2 years  Need to process more data with same number of servers  Cannot increase number of servers - power, energy limitations  Data access for data serving/analytics incurs high cost  Today key-value stores used broadly for data access  Social networks, data analytics, IoT  Consume a lot of CPU cycles/operation - Optimized for HDDs  Important to reduce CPU cycles in key value stores 2

Dominant indexing methods  Inserts are important for key-value stores  Reads consist the majority of operations  However, need to handle bursty inserts of variable size items  B-tree optimal for reads  Needs a single I/O per insert as the dataset grows  Main approach: Buffer writes in some manner  … and use single I/O to the device for multiple inserts  Examples: LSM-Tree, B ε -Tree, Fractal Tree  Most popular: LSM-Tree  Used by most key value stores today  Great for HDDs - always perform large sequential I/Os 3

New Opportunities: From HDDs To Flash  In many applications fast devices (SSDs) dominate  Take advantage of device characteristics to increase serving density in key value stores  Serve same amount data with less cycles  High throughput even for random I/Os at high concurrency 4

SSDs Performance For Various Request Sizes 5

User Space Caching Overhead  User space cache: no system calls for hits - explicit I/O for misses  Copies from user to kernel space during I/O  Hits incur overhead in user-space index+data in every traversal 6

Our Key Value Store: Kreon  In this paper we deal with two main sources of overhead  Aggressive data reorganization (compaction)  User-space caching  We increase I/O randomness for reducing CPU cycles  We use memory-mapped I/O instead of a user-space cache 7

Outline of this talk  Motivation  Discuss Kreon design and motivate decisions  Indexing data structure  DRAM caching and I/O to devices  Evaluation  Overall Efficiency – Throughput  I/O amplification  Efficiency breakdown  Tail latency 8

Kreon Persistent Index  Kreon introduces partial reorganization  Allows to eliminate sorting [bLSM’12]  Key value pairs stored in a log [ Atlas’15, WiscKey ‘16, Tucana’16]  Index organized in unsorted levels /B-tree index per level  Efficient merging – Spill  Reads less data from of 𝑀 𝑗+1 compared to LSM  Inserts take place in buffered mode as in LSM 9

Compaction Kreon spill Memory Level(i) Level (i+1) 10

Kreon Performs Adaptive Reorganization  With partial reorganization repeated scans are expensive  With repeated scans, it is worth to fully organize data  Kreon reorganizes data during scans  Based on policy (current threshold based) 13

Reduce caching overheads with memory mapped I/O  Avoid overhead of user-kernel data copies  Lower overhead for hits by using virtual memory mappings  Either served from TLB or page table traversal  Eliminates serialization with common layout in memory and storage  Using memory mapped I/O has two implications  Requires common allocator for memory and device  Linux kernel mmap introduces challenges 14

Challenges of Common Data Layout  Small random read less overhead with mmap  Log writes large – irrelevant  Index updates could cause 4K random writes to device  Kreon generates large writes by using Copy-on-Write and extent allocation on device  Recovery with common data layout  Requires ordering operations in memory and on device  Kreon does this with CoW and sync  Extent allocation works well with common data layout in key value stores  Spills generate large frees for index  Key value stores usually experience group deletes 15

mmap Challenges for Key Value Stores  Cannot pin 𝑀 0 in memory  I/O amortization relies on 𝑀 0 being in memory  Prioritize index nodes across levels and with respect to log  Unnecessary read-modify write operation from device  Writes to newly allocated pages no need to read them  Long pauses during user requests and high tail latency  mmap performs lazy memory cleaning and results in bursty I/O  Persistence requires msync which uses coarse grain locking 16

Kreon Implements a custom mmap path  Introduces per page priorities  Separate LRUs per priority  𝑀 0 most significant priority, index, log  Detects accesses to new pages and eliminates device fetch  Keeps a non persistent bitmap with page status (free/allocated)  Bitmap updated by Kreon’s allocator  Improved tail latency  kmmap adds bounds in memory used  Eager eviction policy  Higher concurrency in msync 17

Kreon increases concurrency during msync  msync orders writing and persisting pages by blocking  Opportunity in Kreon  Due to CoW the same page is never written/persisted concurrently  Kreon orders by using epochs  msync evicts all pages of previous epoch  Newly modified pages belong to new epoch  Epochs are possible in Kreon due to CoW 18

kmmap Operation DRAM 𝑀 0 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 Log |𝑓𝑞 1 Log |𝑓𝑞 1 Log |𝑓𝑞 1 Log |𝑓𝑞 1 Device 19

kmmap Operation DRAM 𝑀 0 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 2 𝑀 0 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 2 Log |𝑓𝑞 2 Log |𝑓𝑞 1 Log |𝑓𝑞 1 Log |𝑓𝑞 1 Device 𝑀 0 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 Log |𝑓𝑞 1 20

kmmap Operation DRAM 𝑀 0 |𝑓𝑞 2 𝑀 1 |𝑓𝑞 2 Log |𝑓𝑞 2 Device 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 1 𝑀 0 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 𝑀 1 |𝑓𝑞 1 Log |𝑓𝑞 1 Log |𝑓𝑞 1 Log |𝑓𝑞 1 Log |𝑓𝑞 1 21

Outline of this talk  Motivation  Discuss Kreon design and motivate decisions  Indexing data structure  DRAM caching and I/O to devices  Persistence and failure atomicity  Evaluation  Overall efficiency – throughput  I/O amplification  Tail latency  Efficiency breakdown 22

Experimental Setup  Compare Κ reon with RocksDB version 5.6.1  Platform  Two Intel Xeon E5-2630 with 256GB DRAM in total  Six Samsung 850 PRO (256GB) in RAID-0 configuration  YCSB  Insert only, read only, and various mixes  We examine two cases  Dataset contains 100M records resulting in a 120 GB dataset  Two configurations: small uses 192 GB of DRAM large uses 16 GB 23

Overall Improvement over RocksDB (b) Throughput (ops/s) (a) Efficiency (cycles/op) Small up to 6x - average 2.7x, Small up to 5x - average 2.8x, Large up to 8.3x - average 3.4x Large up to 14x - average 4.7x 24

I/O amplification to devices 1000 350 I/O amplification Request size 900 300 800 250 700 600 200 GB KB RocksDB RocksDB 500 Kreon Kreon 150 400 300 100 4x 200 6x 50 100 0 0 Write Read Write Read 25

Contribution of individual techniques 100 30 Load A breakdown Run C breakdown 90 25 80 Kcycles/operation 70 Kcycles/operation 20 60 50 RocksDB RocksDB 15 40 Kreon Kreon 2.4x 30 10 2.6x 6.3x 20 4.6x 5 10 0 Index/spill Caching-I/O 0 Index Caching-I/O 26

kmmap impact on tail latency Tail latency load A 10000000 1000000 Latency(us)/op 100000 10000 1000 RocksDB 100 10 1 50 70 90 99 99.9 99.99 (%)percentile 27

kmmap impact on tail latency 10000000 Tail latency load A 1000000 Latency(us)/op 100000 10000 1000 RocksDB Kreon-mmap 100 10 1 (%)percentile 28

kmmap impact on tail latency 10000000  393x lower 99.99% tail Tail latency load A 1000000 latency than RocksDB Latency(us)/op 100000  99x lower 99.99% tail 10000 latency than Kreon-mmap RocksDB 1000 Kreon-mmap Kreon 100 10 1 (%)percentile 29

Conclusions  Kreon: An efficient key-value store in terms of cycles/op  Trades device randomness for CPU efficiency  CPU most important resource today  Main techniques  LSM  Partially organized levels with full index per level  DRAM caching  via custom memory mapped I/O  Up to 8.3x better efficiency compared to RocksDB  Both index and DRAM caching important 30

Questions ? Giorgos Saloustros Institute of Computer Science, FORTH – Heraklion, Greece E-mail: gesalous@ics.forth.gr Web: http://www.ics.forth.gr/carv Supported by EC under Horizon 2020 Vineyard (GA 687628), ExaNest (GA 671553) 31

An Efficient Memory-Mapped Key-Value Store for Flash Storage - PowerPoint PPT Presentation

An Efficient Memory-Mapped Key-Value Store for Flash Storage Anastasios Papagiannis, Giorgos Saloustros, Pilar Gonzlez-Frez, and Angelos Bilas Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH)

Embedded systems: Memory Mapped I/O Memory mapped I/O is a method of performing input/output

EECS 373 Design of Microprocessor-Based Systems Memory-Mapped I/O Example Bus with Memory-Mapped

Demand Paging Code pages are stored in a memory-mapped file on the backing store some are

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Isolated I/O lecture 20 Memory Mapped I/O In MARS simulator of MIPS, we uses syscall for I/O

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Untethered lowRISC, Memory Mapped IO and TileLink/AXI Wei Song 27/07/2015 Time Line expected

Memory Mapped I/O Basic idea: map a part of a file (or other object) into your virtual

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

A Bit of Algebra Massive Amounts of In-memory Key/Value Storage + In-Memory Search + Java == NoSQL

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Programmed I/O: isolated vs. memory-mapped When we discussed physical addresses of memory, we

Simplicity and informativeness in the cultural evolution of language Jon W. Carr, Kenny Smith,

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong Ziqiang Feng Wenjian Xu

Jun He 1 , Huaiming Song 1 , Xian-He Sun 1 , Yanlong Yin 1 , Rajeev Thakur 2 1: Illinois Institute

BORG: Block-reORGanization for Self-optimizing Storage Systems Medha Bhadkamkar Jorge Guerra

Example on 2D potential with 4 wells Simulations by Masha Cameron Monday, October 8, 12 Spectral

the Android Ecosystem Yury Zhauniarovich Advisor: Bruno Crispo University of Trento Agenda

Managing Security Investment Part IV Tyler Moore Computer Science & Engineering Department,

A Journey through iOS Malware Landscape Evolu;on &

An Efficient Memory-Mapped Key-Value Store for Flash Storage - PowerPoint PPT Presentation

An Efficient Memory-Mapped Key-Value Store for Flash Storage Anastasios Papagiannis, Giorgos Saloustros, Pilar Gonzlez-Frez, and Angelos Bilas Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH)

Embedded systems: Memory Mapped I/O Memory mapped I/O is a method of performing input/output

EECS 373 Design of Microprocessor-Based Systems Memory-Mapped I/O Example Bus with Memory-Mapped

Demand Paging Code pages are stored in a memory-mapped file on the backing store some are

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Isolated I/O lecture 20 Memory Mapped I/O In MARS simulator of MIPS, we uses syscall for I/O

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Untethered lowRISC, Memory Mapped IO and TileLink/AXI Wei Song 27/07/2015 Time Line expected

Memory Mapped I/O Basic idea: map a part of a file (or other object) into your virtual

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

A Bit of Algebra Massive Amounts of In-memory Key/Value Storage + In-Memory Search + Java == NoSQL

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Programmed I/O: isolated vs. memory-mapped When we discussed physical addresses of memory, we

Simplicity and informativeness in the cultural evolution of language Jon W. Carr, Kenny Smith,

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong Ziqiang Feng Wenjian Xu

Jun He 1 , Huaiming Song 1 , Xian-He Sun 1 , Yanlong Yin 1 , Rajeev Thakur 2 1: Illinois Institute

BORG: Block-reORGanization for Self-optimizing Storage Systems Medha Bhadkamkar Jorge Guerra

Example on 2D potential with 4 wells Simulations by Masha Cameron Monday, October 8, 12 Spectral

the Android Ecosystem Yury Zhauniarovich Advisor: Bruno Crispo University of Trento Agenda

Managing Security Investment Part IV Tyler Moore Computer Science &amp; Engineering Department,

A Journey through iOS Malware Landscape Evolu;on &amp;

Managing Security Investment Part IV Tyler Moore Computer Science & Engineering Department,

A Journey through iOS Malware Landscape Evolu;on &