pebblesdb building key value stores using fragmented log
play

PebblesDB: Building Key-Value Stores using Fragmented Log - PowerPoint PPT Presentation

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1 , Rohan Kadekodi 1 , Vijay Chidambaram 1,2 , Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research What is a key-value store?


  1. FLSM structure In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Note how files are logically grouped within guards 40

  2. FLSM structure In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Guards get more fine grained deeper into the tree 41

  3. How does FLSM reduce write amplification? 42

  4. How does FLSM reduce write amplification? In-memory 30 …. 68 Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Max files in level 0 is configured to be 2 43

  5. How does FLSM reduce write amplification? 15 In-memory 15 …. 68 2 …. 68 2 …. 14 Memory Storage 2 …. 37 23 …. 48 30 …. 68 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Compacting level 0 44

  6. How does FLSM reduce write amplification? 15 In-memory 2 …. 14 15 …. 68 Memory Storage Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Fragmented files are just appended to next level 45

  7. How does FLSM reduce write amplification? In-memory 15 …. 68 Memory Storage Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 14 15 …. 68 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Guard 15 in Level 1 is to be compacted 46

  8. How does FLSM reduce write amplification? 40 In-memory 15 …. 68 40 …. 68 15 …. 39 Memory Storage Level 0 15 70 Level 1 1 …. 12 77 …. 87 82 …. 95 2 …. 14 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 Files are combined, sorted and fragmented 47

  9. How does FLSM reduce write amplification? 40 In-memory 40 …. 68 15 …. 39 Memory Storage Level 0 15 70 Level 1 1 …. 12 77 …. 87 82 …. 95 2 …. 14 15 40 95 70 Level 2 2 …. 8 16 …. 32 15 …. 23 45 …. 65 96 …. 99 70 …. 90 Fragmented files are just appended to next level 48

  10. How does FLSM reduce write amplification? FLSM doesn’t re-write data to the same level in most cases How does FLSM maintain read performance? FLSM maintains partially sorted levels to efficiently reduce the search space 49

  11. Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data 50

  12. Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data Keyspace 1 1e+9 51

  13. Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data Keyspace 1 1e+9 52

  14. Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data Keyspace 1 1e+9 53

  15. Operations: Write Put(1, “abc”) Write (key, value) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 FLSM structure 54

  16. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 FLSM structure 55

  17. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Search level by level starting from memory 56

  18. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 All level 0 files need to be searched 57

  19. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Level 1: File under guard 15 is searched 58

  20. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Level 2: Both the files under guard 15 are searched 59

  21. High write throughput in FLSM • Compaction from memory to level 0 is stalled • Writes to memory is also stalled Write (key, value) In-memory Memory Storage Level 0 2 …. 98 23 …. 48 1 …. 37 18 …. 48 If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction 60

  22. High write throughput in FLSM • Compaction from memory to level 0 is stalled • Writes to memory is also stalled Write (key, value) FLSM has faster compaction because of lesser I/O and In-memory Memory hence higher write throughput Storage Level 0 2 …. 98 23 …. 48 1 …. 37 18 …. 48 If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction 61

  23. Challenges in FLSM • Every read/range query operation needs to examine multiple files per level • For example, if every guard has 5 files, read latency is increased by 5x (assuming no cache hits) Trade-off between write I/O and read performance 62

  24. Outline • Log-Structured Merge Tree (LSM) • Fragmented Log-Structured Merge Tree (FLSM) • Building PebblesDB using FLSM • Evaluation • Conclusion 63

  25. PebblesDB • Built by modifying HyperLevelDB ( ± 9100 LOC) to use FLSM • HyperLevelDB, built over LevelDB, to provide improved parallelism and compaction • API compatible with LevelDB, but not with RocksDB 64

  26. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 65

  27. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter Is key 25 Definitely not Bloom filter present? Possibly yes 66

  28. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory 67

  29. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory PebblesDB reads same number of files as any LSM based store 68

  30. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter • Range query performance is improved using parallel threads and better compaction 69

  31. Outline • Log-Structured Merge Tree (LSM) • Fragmented Log-Structured Merge Tree (FLSM) • Building PebblesDB using FLSM • Evaluation • Conclusion 70

  32. Evaluation Real world workloads - YCSB Crash recovery Micro-benchmarks CPU and memory Low memory usage Small dataset Aged file system NoSQL applications 71

  33. Evaluation Real world workloads - YCSB Crash recovery Micro-benchmarks CPU and memory Low memory usage Small dataset Aged file system NoSQL applications 72

  34. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 Throughput ratio wrt HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 73

  35. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 74

  36. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 75

  37. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 76

  38. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 77

  39. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 78

  40. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 Throughput ratio wrt WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 79

  41. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 80

  42. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 81

  43. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 82

  44. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 83

  45. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 PebblesDB combines low write IO of WiredTiger with 1 high performance of RocksDB 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 84

  46. Outline • Log-Structured Merge Tree (LSM) • Fragmented Log-Structured Merge Tree (FLSM) • Building PebblesDB using FLSM • Evaluation • Conclusion 85

  47. Conclusion • PebblesDB: key-value store built on Fragmented Log-Structured Merge Trees • Increases write throughput and reduces write IO at the same time • Obtains 6X the write throughput of RocksDB • As key-value stores become more widely used, there have been several attempts to optimize them • PebblesDB combines algorithmic innovation (the FLSM data structure) with careful systems building 86

  48. https://github.com/utsaslab/pebblesdb

  49. https://github.com/utsaslab/pebblesdb

  50. Backup slides 89

  51. Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) Get(1) Level 0 – 1, 2, 100, 1000 Level 1 – 1, 5, 10, 2000 Level 2 – 5, 300, 500 90

  52. Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) Seek(200) Level 0 – 1, 2, 100, 1000 Level 1 – 1, 5, 10, 2000 Level 2 – 5, 300, 500 91

  53. Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) 92

  54. Operations: Seek Seek(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 FLSM structure 93

  55. Operations: Seek Seek(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 All levels and memtable need to be searched 94

  56. Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set Is key 25 Definitely not Bloom filter present? Possibly yes 95

  57. Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory True Get(97) 96

  58. Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Bloom Filter Bloom Filter Bloom Filter Bloom Filter False True Get(97) 97

  59. Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Bloom Filter Bloom Filter Bloom Filter Bloom Filter PebblesDB reads at most one file per guard with high probability 98

  60. Optimizations in PebblesDB • Challenge with seeks: Multiple sstable reads per level • Parallel seeks: Parallel threads to seek() on files in a guard Thread 1 Thread 2 15 70 Level 1 1 …. 12 15 …. 39 77 …. 97 82 …. 95 Seek(85) 99

  61. Optimizations in PebblesDB • Challenge with seeks: Multiple sstable reads per level • Parallel seeks: Parallel threads to seek() on files in a guard • Seek based compaction: Triggers compaction for a level during a seek-heavy workload • Reduce the average number of sstables per guard • Reduce the number of active levels Seek based compaction increases write I/O but as a trade-off to improve seek performance 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend