LSM-trie An LSM-tree-based Ultra-Large Key-Value Store for small - - PowerPoint PPT Presentation

lsm trie
SMART_READER_LITE
LIVE PREVIEW

LSM-trie An LSM-tree-based Ultra-Large Key-Value Store for small - - PowerPoint PPT Presentation

LSM-trie An LSM-tree-based Ultra-Large Key-Value Store for small Data by: Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang Daniel Herring Design Goals and Assumptions Goals Efficient KV Store Inserts must support high-throughput


slide-1
SLIDE 1

LSM-trie

An LSM-tree-based Ultra-Large Key-Value Store for small Data by: Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang

Daniel Herring

slide-2
SLIDE 2

Design Goals and Assumptions

  • Goals

▫ Efficient KV Store ▫ Inserts must support high-throughput ▫ Lookups must be fast so store must be sorted

  • Design Decisions

▫ Write amplification must be minimized to support high write workloadsUtilize exponential fixed growth pattern

  • f an LSM-tree

▫ Use a prefix tree to organize data in the store ▫ Convert variable length keys to fix size keys via hash function ▫ Range searching is not required

slide-3
SLIDE 3

Overall Architecture

FSK = hash(K) Insert(K, V) Insert(FSK, [K, V]) MemTable *Table 0.0 Immutable Data *Table 0.N Level 0 *Table 1.0i *Table 1.01 *Table 1.00 … Level 1 *Table 1.N0 *Table 1.N1 *Table 1.Ni … … Minor Compaction Major Compaction … … … N = number of entries in level i = pile number in entry

FSK: Fixed Size Key

Level 1 - Pile 0 Level 1 - Pile 1 Level 1 - Pile i *Supports using different Table structures based on data size

slide-4
SLIDE 4

Using a prefix tree

  • FSK allows bit masking to index data store

 This is similar to hardware page table trees

▫ FSK length governs the maximum store size ▫ No. of piles is controlled by bit mask size

 Level N has 2M*N maximum piles, M = bit mask size

 i.e. Level 4 with bit mask size 3 has 24*3 = 4096 max piles

Level 0 Level 1 Level 2 Level 3

slide-5
SLIDE 5

Why Write Amplification matters

  • Scalability

▫ SILT variable write amplification:

 Max sorted store size: 100M entries

▫ Max entries to merge: 7.5 M ▫ WA = 2.075+ 100M/7.5M = 13.33

 Max sorted store size: 1B entries

▫ Max entries to merge: 10M ▫ WA = 2.075 + 1B/10M = 102.75

▫ LSM-Trie fixed write amplification: WA ≈ 5

(1) “In the meantime, for some KV stores, such as SILT [24], major efforts are made to optimize reads by minimizing metadata size, while write performance can be compromised without conducting multi-level incremental compactions” Explain how high write amplifications are produced in SILT.

SILT data taken from section 5 of [24] LIM, H., FAN, B., ANDERSEN, D. G., ANDKAMINSKY, M. SILT: A memory-efficient, high-performance key-value store.

slide-6
SLIDE 6

Measured write amplifications

SILT data taken from section 5 of [24] LIM, H., FAN, B., ANDERSEN, D. G., ANDKAMINSKY, M. SILT: A memory-efficient, high-performance key-value store.

slide-7
SLIDE 7

Minimizing Write Amplification

  • Pile architecture

▫ Sorting happens at next level’s major compaction ▫ Only full piles need compacted to next level ▫ Piles can contain non-full HTables

  • Bit masking forces child containers to be

non overlapping

▫ Sorting does not affect other containers in child level ▫ Key 001 110 does not have to affect containers and tables holding key 001 111

  • Sorting of Pile is fixed maximum time

▫ Pile only ever has N tables in it ▫ Sort at each level can discard upper bits

 64 bit key at level 4 only has to compare 52 bits to sort keys  256 bit key at level 75 only has to compare 31 bits

(5) Use Figures 2 and 3 to describe the LSM-trie’s structure and how compaction is performed in the trie.

slide-8
SLIDE 8

Growth Patterns

  • Level DB

▫ Exponential Growth – each level is 10x larger than previous with space for 10x more data

  • LSM-trie

▫ Combined Linear and Exponential Growth – Levels are 8x larger than previous but have space for 64x data

(3) Use Figure 1 to explain the difference between linear and exponential growth patterns.

slide-9
SLIDE 9

Trade-offs vs Level DB

  • Gain (Pro)

▫ Minimize time of insert ▫ Minimize time resorting level ▫ Able to effectively partition

  • ver multiple physical stores

▫ Able to off load compaction work of lower levels ▫ Fast key lookup – only 1 pile at each level can contain key

  • Loss (Con)

▫ Increase time of read due to pile searches ▫ Unable to do range searching ▫ Uses different data structures for small data vs larger KB data

 Htable for small data  SSTable-Trie for large data (2) “Note that LSM-trie uses hash functions to organize its data and accordingly does not support range search.” Does LevelDB support range search?

slide-10
SLIDE 10

SSTable-trie

  • Sorted String Table for LSM-trie

▫ Maintains Bloom filter for fast exclusion lookup ▫ Used only for larger data items >KB

(4) “Among all compactions moving data from Lk to Lk+1, we must make sure their key ranges are not overlapped to keep any two SSTables at Level Lk+1 from having

  • verlapped key ranges. However, this

cannot be achieved with the LevelDB data

  • rganization …” Please explain why

LevelDB cannot achieve it?

slide-11
SLIDE 11

HTables

  • Hash indexed data store

▫ Uses lower order bits in FSK to determine hash bucket to put data in ▫ Maintains Bloom filter for fast exclusion lookup ▫ Optimized for small data <KB

(8) What’s the difference between SSTable in LevelDB and HTable in LSM-trie?

slide-12
SLIDE 12

HTable (cont.)

  • During compaction

▫ HTables may balance buckets and records relocation information in HTable metadata ▫ Limits each HTable to 95% fill to allow for balancing of random sized items

  • Bloom filters

▫ Size is 16 bits per item ▫ Sized to minimize false positive rate ▫ Filters stored in HTable

(9) “However, a challenging issue is whether the buckets can be load balanced in terms of aggregate size

  • f KV items hashed into them” Why may the buckets in

an HTable be load unbalanced? How to correct the problem?

slide-13
SLIDE 13

Bloom Filter Sizing

  • Memory Usage

▫ Bloom filters use the majority of memory

 A store of 32 sub-levels with average 64B item size has 4.5GB of 16bit bloom filters

▫ Relocation Records use memory

 The above store uses an estimate of 0.5GB of memory for relocation records

(7) “Therefore, the Bloom filter must be beefed up by using more bits.” Use an example to show why the Bloom filters have to be longer? (6) “The indices and Bloom filters in a KV store can grow very large.” Use an example to show that these metadata in LevelDB may have to be out of core.

slide-14
SLIDE 14

Performance Comparisons

slide-15
SLIDE 15
slide-16
SLIDE 16

One step past this paper

slide-17
SLIDE 17

Key points

  • Keys to effective distributed application

1. Each system works on non-overlapping subsets of the problem 2. Same function running on each processor 3. Reading/writing data is not a bottle neck

  • LSM-trie

▫ Has non-overlapping data subsets ▫ Data location is not that important provided

 Low latency access of data  High throughput reading/writing data

slide-18
SLIDE 18

KV Store L1.111 KV Store L1.111 KV Store L1.111 KV Store L1.111 KV Store L1.111 KV Store L1.111 KV Store L1.001

Distributed LSM-trie Architecture

KV Store Front End Client MemTable L0 Piles KV Store L1.000 MemTable L1.00 Piles Level N thread Level N + 1 thread N = 0

slide-19
SLIDE 19

Use case: Get(K)

  • Read value communications

▫ When Level 0 receives request, it sends a message to all level servers with key and asks if they have it. ▫ Servers having data respond with server id, and server level ▫ Level 0 system then determines which level is newest if its not in own data store, then requests it from appropriate server. ▫ Expected lookup time is close to O(1) when server processes number is allowed to grow based on O(log(n)) MemTable Input tables Havekey(k)? Piles Compaction Sort piles to next layer’s KV store servers

slide-20
SLIDE 20

Use case: Put(K,V)

  • Write value communications

▫ When Level 0 receives request, stores data in its memTable ▫ When memTable is full, it converts to immutable HTable

  • r SSTable-Trie and puts in

local pile ▫ When local pile fills up piles are sorted and data sent to lower level servers based on Trie partitioning ▫ Lower level servers store data in to memTable and make immutable when appropriate, down the tree. ▫ Expected insert time O(1) when server processes number is allowed to grow based on O(log(n)) MemTable Input tables Havekey(k)? Piles Compaction Sort piles to next layer’s KV store servers