lsm trie
play

LSM-trie An LSM-tree-based Ultra-Large Key-Value Store for small - PowerPoint PPT Presentation

LSM-trie An LSM-tree-based Ultra-Large Key-Value Store for small Data by: Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang Daniel Herring Design Goals and Assumptions Goals Efficient KV Store Inserts must support high-throughput


  1. LSM-trie An LSM-tree-based Ultra-Large Key-Value Store for small Data by: Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang Daniel Herring

  2. Design Goals and Assumptions • Goals ▫ Efficient KV Store ▫ Inserts must support high-throughput ▫ Lookups must be fast so store must be sorted • Design Decisions ▫ Write amplification must be minimized to support high write workloadsUtilize exponential fixed growth pattern of an LSM-tree ▫ Use a prefix tree to organize data in the store ▫ Convert variable length keys to fix size keys via hash function ▫ Range searching is not required

  3. Overall Architecture Insert(K, V) Insert(FSK, [K, V]) FSK = MemTable hash(K) Minor Compaction Immutable Data FSK: Fixed Size Key *Table 0.0 Level 0 … *Table 0.N Major Compaction Level 1 - Pile 0 Level 1 - Pile 1 Level 1 - Pile i … *Table 1.00 *Table 1.01 *Table 1.0i Level 1 … … … … *Table 1.N0 *Table 1.N1 *Table 1.Ni *Supports using different Table N = number of entries in level structures based on data size i = pile number in entry

  4. Using a prefix tree • FSK allows bit masking to index data store  This is similar to hardware page table trees ▫ FSK length governs the maximum store size ▫ No. of piles is controlled by bit mask size  Level N has 2 M*N maximum piles, M = bit mask size  i.e. Level 4 with bit mask size 3 has 2 4*3 = 4096 max piles Level 0 Level 1 Level 2 Level 3

  5. Why Write Amplification matters • Scalability ▫ SILT variable write amplification:  Max sorted store size: 100M entries ▫ Max entries to merge: 7.5 M ▫ WA = 2.075+ 100M/7.5M = 13.33  Max sorted store size: 1B entries ▫ Max entries to merge: 10M ▫ WA = 2.075 + 1B/10M = 102.75 ▫ LSM-Trie fixed write amplification: WA ≈ 5 (1) “In the meantime, for some KV stores, such as SILT [24], major efforts are made to optimize reads by minimizing metadata size, while write performance can be compromised without conducting multi-level incremental compactions” Explain how high write amplifications are produced in SILT. SILT data taken from section 5 of [24] LIM, H., FAN, B., ANDERSEN, D. G., ANDKAMINSKY, M. SILT: A memory-efficient, high-performance key-value store.

  6. Measured write amplifications SILT data taken from section 5 of [24] LIM, H., FAN, B., ANDERSEN, D. G., ANDKAMINSKY, M. SILT: A memory-efficient, high-performance key-value store.

  7. Minimizing Write Amplification • Pile architecture ▫ Sorting happens at next level’s major compaction ▫ Only full piles need compacted to next level ▫ Piles can contain non-full HTables • Bit masking forces child containers to be non overlapping ▫ Sorting does not affect other containers in child level ▫ Key 001 110 does not have to affect containers and tables holding key 001 111 (5) Use Figures 2 and 3 to • Sorting of Pile is fixed maximum time describe the LSM-trie’s ▫ Pile only ever has N tables in it structure and how ▫ Sort at each level can discard upper bits compaction is performed  64 bit key at level 4 only has to compare in the trie. 52 bits to sort keys  256 bit key at level 75 only has to compare 31 bits

  8. Growth Patterns • Level DB ▫ Exponential Growth – each level is 10x larger than previous with space for 10x more data • LSM-trie ▫ Combined Linear and Exponential Growth – Levels are 8x larger than previous but have space for 64x data (3) Use Figure 1 to explain the difference between linear and exponential growth patterns.

  9. Trade-offs vs Level DB • Gain (Pro) • Loss (Con) ▫ Minimize time of insert ▫ Increase time of read due to pile searches ▫ Minimize time resorting level ▫ Unable to do range searching ▫ Able to effectively partition over multiple physical stores ▫ Uses different data structures for small data vs larger KB ▫ Able to off load compaction data work of lower levels  Htable for small data ▫ Fast key lookup – only 1 pile  SSTable-Trie for large data at each level can contain key (2) “Note that LSM-trie uses hash functions to organize its data and accordingly does not support range search.” Does LevelDB support range search?

  10. SSTable-trie • Sorted String Table for LSM-trie ▫ Maintains Bloom filter for fast exclusion lookup ▫ Used only for larger data items >KB (4) “Among all compactions moving data from Lk to Lk+1, we must make sure their key ranges are not overlapped to keep any two SSTables at Level Lk+1 from having overlapped key ranges. However, this cannot be achieved with the LevelDB data organization …” Please explain why LevelDB cannot achieve it?

  11. HTables • Hash indexed data store ▫ Uses lower order bits in FSK to determine hash bucket to put data in ▫ Maintains Bloom filter for fast exclusion lookup ▫ Optimized for small data <KB (8) What’s the difference between SSTable in LevelDB and HTable in LSM-trie?

  12. HTable (cont.) • During compaction ▫ HTables may balance buckets and records relocation information in HTable metadata ▫ Limits each HTable to 95% fill to allow for balancing of random sized items • Bloom filters ▫ Size is 16 bits per item ▫ Sized to minimize false positive rate ▫ Filters stored in HTable (9) “However, a challenging issue is whether the buckets can be load balanced in terms of aggregate size of KV items hashed into them” Why may the buckets in an HTable be load unbalanced? How to correct the problem?

  13. Bloom Filter Sizing • Memory Usage ▫ Bloom filters use the majority of memory  A store of 32 sub-levels with average 64B item size has 4.5GB of 16bit bloom filters ▫ Relocation Records use memory  The above store uses an estimate of 0.5GB of memory for relocation records (7) “Therefore, the Bloom filter must be beefed up by using more bits.” Use an example to show why the Bloom filters have to be longer? (6) “The indices and Bloom filters in a KV store can grow very large.” Use an example to show that these metadata in LevelDB may have to be out of core.

  14. Performance Comparisons

  15. One step past this paper

  16. Key points • Keys to effective distributed application 1. Each system works on non-overlapping subsets of the problem 2. Same function running on each processor 3. Reading/writing data is not a bottle neck • LSM-trie ▫ Has non-overlapping data subsets ▫ Data location is not that important provided Low latency access of data  High throughput reading/writing data 

  17. Distributed LSM-trie Architecture Client N = 0 KV Store Front End Level N thread MemTable L0 Piles KV Store L1.111 KV Store L1.111 KV Store L1.111 KV Store L1.111 KV Store L1.111 KV Store L1.111 KV Store L1.001 KV Store L1.000 Level N + 1 thread MemTable L1.00 Piles

  18. Use case: Get(K) • Read value communications Input tables ▫ When Level 0 receives request, it sends a message to MemTable all level servers with key and asks if they have it. Compaction ▫ Servers having data respond with server id, and server level ▫ Level 0 system then Havekey(k)? Piles determines which level is newest if its not in own data store, then requests it from appropriate server. Sort piles to next layer’s ▫ Expected lookup time is close KV store servers to O(1) when server processes number is allowed to grow based on O(log(n))

  19. Use case: Put(K,V) • Write value communications ▫ When Level 0 receives Input tables request, stores data in its MemTable memTable ▫ When memTable is full, it converts to immutable HTable Compaction or SSTable-Trie and puts in local pile ▫ When local pile fills up piles are sorted and data sent to lower Havekey(k)? Piles level servers based on Trie partitioning ▫ Lower level servers store data in to memTable and make immutable when appropriate, Sort piles to next layer’s down the tree. ▫ Expected insert time O(1) KV store servers when server processes number is allowed to grow based on O(log(n))

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend