The Power of the Log LSM & Append Only Data Structures Ben - - PowerPoint PPT Presentation

the power of the log lsm append only data structures
SMART_READER_LITE
LIVE PREVIEW

The Power of the Log LSM & Append Only Data Structures Ben - - PowerPoint PPT Presentation

The Power of the Log LSM & Append Only Data Structures Ben Stopford Confluent Inc @benstopford Kafka: a Streaming Platform Producer Consumer Connectors Connectors The Log Streaming Engine KAFKAs Distributed Log Append Only Linear


slide-1
SLIDE 1

The Power of the Log

LSM & Append Only Data Structures

Ben Stopford

Confluent Inc

slide-2
SLIDE 2

@benstopford

slide-3
SLIDE 3

The Log

Connectors Connectors

Producer Consumer

Streaming Engine

Kafka: a Streaming Platform

slide-4
SLIDE 4

KAFKA’s Distributed Log

Linear Scans Append Only

slide-5
SLIDE 5

Messaging is a Log-Shaped Problem

Linear Scans Append Only

slide-6
SLIDE 6

Not all problems are Log-Shaped

slide-7
SLIDE 7

Many problems benefit from being addressed in a “log-shaped” way

slide-8
SLIDE 8

Supporting Lookups

slide-9
SLIDE 9

Lookups in a log

Head Tail

slide-10
SLIDE 10

Trees provide Selectivity

bob dave fred hary mike steve vince

Index

slide-11
SLIDE 11

But the overarching structure implies Dispersed Writes

bob dave fred hary mike steve vince

Random IO

slide-12
SLIDE 12

Log Structured Merge Trees 1996

slide-13
SLIDE 13

Used in a range of modern databases

  • BigTable
  • HBase
  • LevelDB
  • SQLite4
  • RocksDB
  • MongoDB
  • WiredTiger
  • Cassandra
  • MySQL
  • InfluxDB ...
slide-14
SLIDE 14

If a systems have a natural grain, it is one formed of sequential

  • perations which favour locality
slide-15
SLIDE 15

Caching & Prefetching

L3 cache L2 cache L1 cache

Pre-fetch is your friend

CPU Caches Page Cache Application-level caching Disk Controller

slide-16
SLIDE 16

Write efficiency comes from amortising writes into sequential

  • perations
slide-17
SLIDE 17

Taken from ACMQueue: The Pathologies of Big Data

slide-18
SLIDE 18

So if we go against the grain of the system, RAM can actually be slower than disk

slide-19
SLIDE 19

Going against the grain means dispersed

  • perations that break locality

Poor Locality Good Locality

slide-20
SLIDE 20

The beauty of the log lies in its sequentially

Linear Scans Append Only

slide-21
SLIDE 21

LSM is about re-imagining search as as a “log-shaped” problem

slide-22
SLIDE 22

Arrange writes to be Append Only

Append Only Journal (Sequential IO) Update in Place Ordered File (Random IO)

Bob = Carpenter Bob = Carpenter Bob = Cabinet Maker Bob = Cabinet Maker

slide-23
SLIDE 23

Avoid dispersed writes

slide-24
SLIDE 24

Simple LSM

slide-25
SLIDE 25

Writes are collected in memory

Writes sort write to disk

  • lder

files small index file RAM

slide-26
SLIDE 26

When enough have buffered, sort.

Writes write to disk

  • lder

files small index file Batched sorted RAM

slide-27
SLIDE 27

Write the sorted file to disk

Writes write to disk

  • lder

files Small, sorted immutable file Batched sorted

slide-28
SLIDE 28

Repeat...

Writes write to disk Older files New files Batched sorted

slide-29
SLIDE 29

Batching -> Fast Sequential IO

Writes write to disk Older files New files Batched Sorted memtable

slide-30
SLIDE 30

That’s the core write path

slide-31
SLIDE 31

What about reads?

slide-32
SLIDE 32

Search reverse-chronologically

  • lder

files newer files

(1) Is “bob” here? (2) Is “bob” here? (3) Is “bob” here? (4) Is “bob” here?

slide-33
SLIDE 33

Worst Case

We consult every file

slide-34
SLIDE 34

We might have a lot of files!

slide-35
SLIDE 35

LSM naturally optimises for writes,

  • ver reads

This is a reasonable tradeoff to make

slide-36
SLIDE 36

Optimizing reads is easier than

  • ptimising writes
slide-37
SLIDE 37

Optimisation 1

Bound the number of files

slide-38
SLIDE 38

Create levels

Level-0 Level-1

slide-39
SLIDE 39

Separate thread merges old files, de- duplicating them.

Level-0 Level-1

slide-40
SLIDE 40

Separate thread merges old files, de- duplicating them.

Level-0 Level-1

slide-41
SLIDE 41

Merging process is reminiscent of merge sort

slide-42
SLIDE 42

Take this further with levels

Level-0 Level-1 Level-2 Level-3 Memtable

slide-43
SLIDE 43

But single reads still require many individual lookups:

  • Number of searches:

– 1 per base level – 1 per level above

slide-44
SLIDE 44

Optimisation 2

Caching & Friends

slide-45
SLIDE 45

Add Memory

i.e. More Caching / Pre-fetch

slide-46
SLIDE 46

Read Ahead & Prefetch

L3 cache L2 cache L1 cache

Pre-fetch is your friend

Page Cache Disk Controller

slide-47
SLIDE 47

If only there was a more efficient way to avoid searching each file!

slide-48
SLIDE 48

Elven Magic?

slide-49
SLIDE 49

Bloom Filters

Answers the question: Do I need to look in this file to find the value for this key? Size -> probability of false positive

Key Hash Function Bit Set

slide-50
SLIDE 50

Bloom Filters

  • Space efficient, probabilistic

data structure

  • As keyspace grows:

– p(collision) increases – Index size is fixed

slide-51
SLIDE 51

Many more degrees of freedom for

  • ptimising reads

RAM Disk file metadata & bloom filter

slide-52
SLIDE 52

Log Structured Merge Trees

  • A collection of small, immutable indexes
  • All sequential operations, de-duplicate by merging files
  • Index/Bloom in RAM to increase read performance
slide-53
SLIDE 53

Subtleties

  • Writes are 1 x IO (blind writes) , rather than 2 x IO’s

(read + modify)

  • Batching writes decreases write amplification. In trees

leaf pages must be updated.

slide-54
SLIDE 54

Immutability => Simpler locking semantics

Only memtable is mutable

slide-55
SLIDE 55

Does it work?

Lots of real world examples

slide-56
SLIDE 56

Measureable in the real world

  • Innodb vs MyRocks results, taken from Mark Callaghan’s blog: http://bit.ly/2mhWT7p
  • There are many subtleties. Take all benchmarks with a pinch of salt.
slide-57
SLIDE 57

Elements of Beauty

  • Reframing the problem to be Log-Centric. To go with

the grain of the system.

  • Optimise for the harder problem
  • Compartmentalises writes (coordination) to a single
  • point. Reads -> immutable structures.
slide-58
SLIDE 58

Applies in many other areas

  • Sequentiality

– Databases: write ahead logs – Columnar databases: Merge Joins – Kafka

  • Immutability

– Snapshot isolation over explicit locking. – Replication (state machines replication)

slide-59
SLIDE 59

Log-Centric Approaches Work in Applications too

slide-60
SLIDE 60

Event Sourcing

  • Journaling of

state changes

  • No “update in

place” Object Journal + 10.36

  • 12.12

+ 23.70 + 13.33

slide-61
SLIDE 61

CQRS

Client Command Query

Write Optimised Read Optimised

log

slide-62
SLIDE 62

How Applications or Services share state

slide-63
SLIDE 63

Log-Centric Services

Writer Read-Replica Read-Replica Read-Replica Writes are localised to a single service

slide-64
SLIDE 64

Log-Centric Services

Writer Read-Replica Read-Replica Read-Replica Immutable log

slide-65
SLIDE 65

Log-Centric Services

Writer Read-Replica Read-Replica Read-Replica Many, independent read replicas

slide-66
SLIDE 66

Elements of Beauty

  • Reframing the problem to be Log-Centric. To go with

the grain of the system.

  • Optimise for the harder problem
  • Compartmentalises writes (coordination) to a single
  • point. Reads -> immutable structures.
slide-67
SLIDE 67

Decentralised Design

In both database design as well as in application development

slide-68
SLIDE 68

The Log is the central building block

Pushes us towards the natural grain of the system

slide-69
SLIDE 69

The Log

A single unifying abstraction

slide-70
SLIDE 70

References

LSM:

  • benstopford.com/2015/02/14/log-structured-merge-trees/
  • smalldatum.blogspot.co.uk/2017/02/using-modern-sysbench-to-compare.html
  • www.quora.com/How-does-the-Log-Structured-Merge-Tree-work
  • bLSM paper: http://bit.ly/2mT7Vje

Other

  • Pat Helland (Immutability) cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
  • Peter Ballis (Coordination Avoidance): http://bit.ly/2m7XxnI
  • Jay Kreps: I Heart Logs (O’Reilly 2014)
  • The Data Dichotomy: http://bit.ly/2hk9c2K
slide-71
SLIDE 71

Thank you

@benstopford http://benstopford.com ben@confluent.io