SLIDE 1 The Power of the Log
LSM & Append Only Data Structures
Ben Stopford
Confluent Inc
SLIDE 3 The Log
Connectors Connectors
Producer Consumer
Streaming Engine
Kafka: a Streaming Platform
SLIDE 4 KAFKA’s Distributed Log
Linear Scans Append Only
SLIDE 5 Messaging is a Log-Shaped Problem
Linear Scans Append Only
SLIDE 6
Not all problems are Log-Shaped
SLIDE 7
Many problems benefit from being addressed in a “log-shaped” way
SLIDE 8
Supporting Lookups
SLIDE 9
Lookups in a log
Head Tail
SLIDE 10 Trees provide Selectivity
bob dave fred hary mike steve vince
Index
SLIDE 11 But the overarching structure implies Dispersed Writes
bob dave fred hary mike steve vince
Random IO
SLIDE 12
Log Structured Merge Trees 1996
SLIDE 13 Used in a range of modern databases
- BigTable
- HBase
- LevelDB
- SQLite4
- RocksDB
- MongoDB
- WiredTiger
- Cassandra
- MySQL
- InfluxDB ...
SLIDE 14 If a systems have a natural grain, it is one formed of sequential
- perations which favour locality
SLIDE 15 Caching & Prefetching
L3 cache L2 cache L1 cache
Pre-fetch is your friend
CPU Caches Page Cache Application-level caching Disk Controller
SLIDE 16 Write efficiency comes from amortising writes into sequential
SLIDE 17 Taken from ACMQueue: The Pathologies of Big Data
SLIDE 18
So if we go against the grain of the system, RAM can actually be slower than disk
SLIDE 19 Going against the grain means dispersed
- perations that break locality
Poor Locality Good Locality
SLIDE 20 The beauty of the log lies in its sequentially
Linear Scans Append Only
SLIDE 21
LSM is about re-imagining search as as a “log-shaped” problem
SLIDE 22 Arrange writes to be Append Only
Append Only Journal (Sequential IO) Update in Place Ordered File (Random IO)
Bob = Carpenter Bob = Carpenter Bob = Cabinet Maker Bob = Cabinet Maker
SLIDE 23
Avoid dispersed writes
SLIDE 24
Simple LSM
SLIDE 25 Writes are collected in memory
Writes sort write to disk
files small index file RAM
SLIDE 26 When enough have buffered, sort.
Writes write to disk
files small index file Batched sorted RAM
SLIDE 27 Write the sorted file to disk
Writes write to disk
files Small, sorted immutable file Batched sorted
SLIDE 28 Repeat...
Writes write to disk Older files New files Batched sorted
SLIDE 29 Batching -> Fast Sequential IO
Writes write to disk Older files New files Batched Sorted memtable
SLIDE 30
That’s the core write path
SLIDE 31
What about reads?
SLIDE 32 Search reverse-chronologically
files newer files
(1) Is “bob” here? (2) Is “bob” here? (3) Is “bob” here? (4) Is “bob” here?
SLIDE 33 Worst Case
We consult every file
SLIDE 34
We might have a lot of files!
SLIDE 35 LSM naturally optimises for writes,
This is a reasonable tradeoff to make
SLIDE 36 Optimizing reads is easier than
SLIDE 37 Optimisation 1
Bound the number of files
SLIDE 38 Create levels
Level-0 Level-1
SLIDE 39 Separate thread merges old files, de- duplicating them.
Level-0 Level-1
SLIDE 40 Separate thread merges old files, de- duplicating them.
Level-0 Level-1
SLIDE 41
Merging process is reminiscent of merge sort
SLIDE 42 Take this further with levels
Level-0 Level-1 Level-2 Level-3 Memtable
SLIDE 43 But single reads still require many individual lookups:
– 1 per base level – 1 per level above
SLIDE 44 Optimisation 2
Caching & Friends
SLIDE 45 Add Memory
i.e. More Caching / Pre-fetch
SLIDE 46 Read Ahead & Prefetch
L3 cache L2 cache L1 cache
Pre-fetch is your friend
Page Cache Disk Controller
SLIDE 47
If only there was a more efficient way to avoid searching each file!
SLIDE 48
Elven Magic?
SLIDE 49 Bloom Filters
Answers the question: Do I need to look in this file to find the value for this key? Size -> probability of false positive
Key Hash Function Bit Set
SLIDE 50 Bloom Filters
- Space efficient, probabilistic
data structure
– p(collision) increases – Index size is fixed
SLIDE 51 Many more degrees of freedom for
RAM Disk file metadata & bloom filter
SLIDE 52 Log Structured Merge Trees
- A collection of small, immutable indexes
- All sequential operations, de-duplicate by merging files
- Index/Bloom in RAM to increase read performance
SLIDE 53 Subtleties
- Writes are 1 x IO (blind writes) , rather than 2 x IO’s
(read + modify)
- Batching writes decreases write amplification. In trees
leaf pages must be updated.
SLIDE 54 Immutability => Simpler locking semantics
Only memtable is mutable
SLIDE 55 Does it work?
Lots of real world examples
SLIDE 56 Measureable in the real world
- Innodb vs MyRocks results, taken from Mark Callaghan’s blog: http://bit.ly/2mhWT7p
- There are many subtleties. Take all benchmarks with a pinch of salt.
SLIDE 57 Elements of Beauty
- Reframing the problem to be Log-Centric. To go with
the grain of the system.
- Optimise for the harder problem
- Compartmentalises writes (coordination) to a single
- point. Reads -> immutable structures.
SLIDE 58 Applies in many other areas
– Databases: write ahead logs – Columnar databases: Merge Joins – Kafka
– Snapshot isolation over explicit locking. – Replication (state machines replication)
SLIDE 59
Log-Centric Approaches Work in Applications too
SLIDE 60 Event Sourcing
state changes
place” Object Journal + 10.36
+ 23.70 + 13.33
SLIDE 61 CQRS
Client Command Query
Write Optimised Read Optimised
log
SLIDE 62
How Applications or Services share state
SLIDE 63 Log-Centric Services
Writer Read-Replica Read-Replica Read-Replica Writes are localised to a single service
SLIDE 64 Log-Centric Services
Writer Read-Replica Read-Replica Read-Replica Immutable log
SLIDE 65 Log-Centric Services
Writer Read-Replica Read-Replica Read-Replica Many, independent read replicas
SLIDE 66 Elements of Beauty
- Reframing the problem to be Log-Centric. To go with
the grain of the system.
- Optimise for the harder problem
- Compartmentalises writes (coordination) to a single
- point. Reads -> immutable structures.
SLIDE 67 Decentralised Design
In both database design as well as in application development
SLIDE 68 The Log is the central building block
Pushes us towards the natural grain of the system
SLIDE 69 The Log
A single unifying abstraction
SLIDE 70 References
LSM:
- benstopford.com/2015/02/14/log-structured-merge-trees/
- smalldatum.blogspot.co.uk/2017/02/using-modern-sysbench-to-compare.html
- www.quora.com/How-does-the-Log-Structured-Merge-Tree-work
- bLSM paper: http://bit.ly/2mT7Vje
Other
- Pat Helland (Immutability) cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
- Peter Ballis (Coordination Avoidance): http://bit.ly/2m7XxnI
- Jay Kreps: I Heart Logs (O’Reilly 2014)
- The Data Dichotomy: http://bit.ly/2hk9c2K
SLIDE 71 Thank you
@benstopford http://benstopford.com ben@confluent.io