Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior - - PowerPoint PPT Presentation

anti entropy using crdts on ha datastores
SMART_READER_LITE
LIVE PREVIEW

Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior - - PowerPoint PPT Presentation

Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix Timeline Cassandra Multi-region Dynomite adoption 2011 2013 2016 NETFLIX Dynomite Makes non-distributed datastores, distributed NETFLIX


slide-1
SLIDE 1

Anti-Entropy using CRDTs on HA Datastores

Sailesh Mukil Senior Software Engineer, Netflix

slide-2
SLIDE 2

Timeline

NETFLIX

Cassandra adoption

2011 2013 2016

Multi-region Dynomite

slide-3
SLIDE 3

Dynomite

NETFLIX

Makes non-distributed datastores, distributed

slide-4
SLIDE 4 NETFLIX

Datastore

33% 33% 33%

Dynomite Overview

slide-5
SLIDE 5 NETFLIX

Replica 1 Replica 2 Replica 3

Dynomite Overview

slide-6
SLIDE 6 NETFLIX

Replica 1 Replica 2 Replica 3 Client

slide-7
SLIDE 7 NETFLIX

Replica 1 Replica 2 Replica 3 Client

slide-8
SLIDE 8 NETFLIX

Replica 1 Replica 2 Replica 3 Client

slide-9
SLIDE 9 NETFLIX

Dynomite

  • verview
  • Global replication
  • High availability
  • Shared nothing
  • Auto-sharding
  • Linear scale
  • Pluggable

datastores (Redis primarily)

  • Multiple quorum

levels

  • Supports

datastore API

slide-10
SLIDE 10 NETFLIX

Dynomite footprint @ Netflix

  • ~1000 customer facing nodes
  • ~1M OPS/s
  • Largest cluster holds ~6 TB
slide-11
SLIDE 11

The problem

NETFLIX

Entropy in the system

slide-12
SLIDE 12 NETFLIX

R-2 R-3 R-1

Entropy in the system SET K 123

slide-13
SLIDE 13 NETFLIX

R-2 R-3 R-1

Entropy in the system SET K 123

K: 123 K: 123 K: 123

slide-14
SLIDE 14 NETFLIX

R-2 R-3 R-1

Entropy in the system

K: 123 K: 123 K: 123

OK

slide-15
SLIDE 15 NETFLIX

R-2 R-3 R-1

Entropy in the system

K: 123 K: 123 K: 123

SET K 456

slide-16
SLIDE 16 NETFLIX

R-2 R-3 R-1

Entropy in the system

K: 123 K: 123 K: 123

SET K 456

K: 456

slide-17
SLIDE 17 NETFLIX

R-2 R-3 R-1

Entropy in the system

K: 123 K: 123 K: 123 K: 456

ERR

slide-18
SLIDE 18 NETFLIX

R-2 R-3 R-1

Entropy in the system

K: 123 K: 123 K: 123 K: 456

SET K 789

slide-19
SLIDE 19 NETFLIX

R-2 R-3 R-1

Entropy in the system

K: 123 K: 123 K: 123 K: 456

SET K 789

K: 789

slide-20
SLIDE 20 NETFLIX

R-2 R-3 R-1

Entropy in the system

K: 123 K: 123 K: 123 K: 456 K: 789

ERR

slide-21
SLIDE 21 NETFLIX

R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456

GET K

K: 789

GET K

slide-22
SLIDE 22 NETFLIX

R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456 K: 789

789 456

slide-23
SLIDE 23 NETFLIX

R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456

GET K (w/quorum)

K: 789

slide-24
SLIDE 24 NETFLIX

R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456

GET K (w/quorum)

K: 789

slide-25
SLIDE 25 NETFLIX

R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456

GET K (w/quorum)

K: 789

123 456

slide-26
SLIDE 26 NETFLIX

R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456 K: 789

ERR: QUORUM FAILED

slide-27
SLIDE 27 NETFLIX

R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456 K: 789

123 456 ERR: QUORUM FAILED

slide-28
SLIDE 28 NETFLIX

Replicas will go out of sync

slide-29
SLIDE 29

Timeline

NETFLIX

Cassandra adoption

2011 2013 2016

Multi-region Dynomite Dynomite w/ CRDTs

2019

slide-30
SLIDE 30 NETFLIX

Last Writer Wins Vector Clocks Achieving anti-entropy

(traditionally)

  • Uses Physical timestamps
  • Clock skew
  • Shows causal relationships
  • But not for concurrent writes
slide-31
SLIDE 31

The solution

NETFLIX

Conflict free replicated data types

slide-32
SLIDE 32

Conflict free replicated data types

NETFLIX SECTION DIVIDER

A CRDT is a data structure which can be replicated across the network, where the replicas can be updated independently and concurrently without coordination between the replicas, and where it is always mathematically possible to resolve inconsistencies which might result.

slide-33
SLIDE 33 NETFLIX

Associative Commutative Idempotent

Grouping of operations does not matter (X + Y) + Z = X + (Y + Z) Order of operations do not matter X + Y = Y + X Duplication of

  • perations does not

matter X + X = X

slide-34
SLIDE 34 NETFLIX

Update Merge Types of operations on CRDTs

  • Updates local state
  • Converges replica states
slide-35
SLIDE 35 NETFLIX

When we write, we update When we repair, we merge Read repair = merge on read path

Introduction to CRDTs

slide-36
SLIDE 36 NETFLIX

CRDTs provide strong eventual consistency

Introduction to CRDTs

slide-37
SLIDE 37 NETFLIX

R-2 R-3 R-1

Naive distributed counter

CTR: 1 CTR: 1 CTR: 1

INCR CTR

slide-38
SLIDE 38 NETFLIX

R-2 R-3 R-1

Naive distributed counter

CTR: 1 CTR: 1 CTR: 1

DECR CTR INCR CTR

CTR: 0 CTR: 2

slide-39
SLIDE 39 NETFLIX

R-2 R-3 R-1

Naive distributed counter

CTR: 1 CTR: 1 CTR: 1 CTR: 0 CTR: 2

Repair based on timestamp? Latest value is 2, which is incorrect

slide-40
SLIDE 40

CRDT: PNCounters

NETFLIX

Each replica maintains 2 “local” counters

  • Positive counter: Tracks increments
  • Negative counter: Tracks decrements

Final counter value:

(Sum of all PCounters - Sum of all NCounters)

slide-41
SLIDE 41 NETFLIX

R-2 R-3 R-1

CRDT: PNCounter INCR CTR

CTR: CTR: CTR: 1 1 1 1 1

slide-42
SLIDE 42 NETFLIX

R-2 R-3 R-1

CRDT: PNCounter

CTR: CTR: CTR: 1 1 1

DECR CTR INCR CTR

1 1

slide-43
SLIDE 43 NETFLIX

R-2 R-3 R-1

CRDT: PNCounter

CTR: CTR: 1 CTR: 1 1 1 1 1

CTR = 0 CTR = 1 CTR = 2

slide-44
SLIDE 44 NETFLIX

R-2 R-3 R-1

CRDT: PNCounter

CTR: CTR: 1 CTR: 1 1 1 1 1

GET CTR

1 1 1 1

slide-45
SLIDE 45 NETFLIX

R-2 R-3 R-1

CRDT: PNCounter

CTR: CTR: 1 CTR: 1 1 1 1 1

GET CTR

1 1

repair (merge) repair (merge) repair (merge)

1 1

slide-46
SLIDE 46 NETFLIX

R-2 R-3 R-1

CRDT: PNCounter

CTR: CTR: 1 CTR: 1 1 1 1 1 1 1 1 1

CTR = 1 CTR = 1 CTR = 1

slide-47
SLIDE 47

CRDT: LWW-Element Set

NETFLIX

Used to maintain key metadata

  • Add set: Latest update timestamps for keys
  • Remove set: Timestamps at which keys were removed

Registers can take arbitrary values

  • Hence we still require LWW to resolve conflicts

Used for registers, hashmaps and sorted sets

slide-48
SLIDE 48 NETFLIX

R-2 R-3 R-1

LWW-Element Set SET K1 123 (t1)

add rem add rem add rem K1 t1 K1 t1 K1 t1

K1: 123 K1: 123 K1: 123

slide-49
SLIDE 49 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 t1 K1 t1 K1 t1

K1: 123 K1: 123 K1: 123

SET K1 456 (t2)

t2

K1: 456

slide-50
SLIDE 50 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 t1 K1 t1 K1 t1

K1: 123 K1: 123

t2

SET K2 999 (t3)

K2 t3

K2: 999

K2 t3

K2: 999 K1: 456

slide-51
SLIDE 51 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 t1 K1 t1 K1 t1

K1: 123 K1: 123

t2 K2 t3

K2: 999

K2 t3

K2: 999 K1: 456

GET K1 K1 = 456 (t2) K1 = 123 (t1)

t2 > t1 => 456 latest value

t2

K1: 456

repair

slide-52
SLIDE 52 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 K1 t1 K1 t1

K1: 123

t2 K2 t3

K2: 999

K2 t3

K2: 999 K1: 456

t2

K1: 456

“456”

repair

t2

K1: 456

slide-53
SLIDE 53 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 K1 K1 t1 t2 K2 t3

K2: 999

K2 t3

K2: 999 K1: 456

t2

K1: 456

t2

K1: 456

GET K2 (nil) K2 = 999 (t3)

slide-54
SLIDE 54 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 K1 K1 t1 t2 K2 t3

K2: 999

K2 t3

K2: 999 K1: 456

t2

K1: 456

t2

K1: 456

“999”

repair

K2 t3

K2: 999

slide-55
SLIDE 55 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 K1 K1 t1 t2 K2 t3

K2: 999

K2 t3

K2: 999 K1: 456

t2

K1: 456

t2

K1: 456

K2 t3

K2: 999

DEL K2 (t4)

K2 t4

slide-56
SLIDE 56 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 K1 K1 t1 t2 K2 t3 K2 t3

K2: 999 K1: 456

t2

K1: 456

t2

K1: 456

K2 t3

K2: 999

GET K2 “999”

K2 t4

slide-57
SLIDE 57 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 K1 K1 t1 t2 K2 t3 K2 t3

K2: 999 K1: 456

t2

K1: 456

t2

K1: 456

K2 t3

K2: 999

GET K2 K2 del @t4

K2 t4

K2 = 999 (t3)

K2 t4

slide-58
SLIDE 58 NETFLIX

R-2 R-3 R-1

LWW-Element Set

add rem add rem add rem K1 K1 K1 t1 t2 K2 t3 K2 t3

K2: 999 K1: 456

t2

K1: 456

t2

K1: 456

K2 t3

(nil)

K2 t4

DEL K2 (t4)

K2 t4 K2 t4

repair

slide-59
SLIDE 59

Implementation challenges (LWW-element set)

NETFLIX

Redis doesn’t maintain timestamps

Dynomite can track the timestamp of the client request

slide-60
SLIDE 60

Implementation challenges (LWW-element set)

NETFLIX

We’d like Dynomite to remain stateless

Store the metadata inside Redis

slide-61
SLIDE 61

Implementation challenges (LWW-element set)

NETFLIX

Operations must modify data and metadata atomically

Rewrite operations into Redis Lua scripts (guarantees atomicity)

slide-62
SLIDE 62

Implementation challenges (LWW-element set)

NETFLIX

Does the remove set grow forever?

Delete metadata ASAP from remove set if ALL replicas agree Background thread cleans rest Maintain remove set as sorted set

slide-63
SLIDE 63

Implementation challenges (LWW-element set)

NETFLIX

What does an example Lua script look like?

Check if update is old Discard if it is Update data + metadata otherwise

slide-64
SLIDE 64 NETFLIX

Repairs occur on read path in Dynomite

Repairs for point reads only

slide-65
SLIDE 65

Background repairs

NETFLIX

(Note: Ongoing work)

slide-66
SLIDE 66 NETFLIX

Repairing on range reads is expensive

Eg: Give me all members of a set Return everything in this hashmap Return me a range from this sorted set Background repairs

slide-67
SLIDE 67 NETFLIX

How do we target keys that need repairing?

Full key walk? (like Cassandra) Background repairs

slide-68
SLIDE 68 NETFLIX

How do we target keys that need repairing?

Maintain list of recently written to keys Background repairs Run merge operation on them (async) But, merge operation on large structures are expensive

slide-69
SLIDE 69 NETFLIX

Delta-state CRDTs

Maintain list of recent mutations done to keys Background repairs Ship only delta-state instead of entire data structure for merge Confirm which replicas have received it

slide-70
SLIDE 70 NETFLIX

CTR: CTR: 1 1 1 1 1

Background repairs

What is a delta-state?

INCR CTR

2 1 2

Full state

R1 R2

slide-71
SLIDE 71 NETFLIX

CTR: CTR: 1 1 1 1

Background repairs

What is a delta-state?

INCR CTR

2 R1 = 2

Delta state

2 R1 R2

slide-72
SLIDE 72 NETFLIX

Background repairs

What is a delta-state?

R1 R3 R2 R2 R3 Mutations

𝜺-1 𝜺-2 𝜺-3 𝜺-4

slide-73
SLIDE 73 NETFLIX

Background repairs

What is a delta-state?

R1 R3 R2 R2 R3 Mutations

𝜺-1 𝜺-2 𝜺-3 𝜺-4

ACK ACK

slide-74
SLIDE 74 NETFLIX

Background repairs

What is a delta-state?

R1 R3 R2 R2 R3 Mutations

𝜺-1 𝜺-2 𝜺-3 𝜺-4

ack ack ack ack

slide-75
SLIDE 75 NETFLIX

Background repairs

What is a delta-state?

R1 R3 R2 R2 R3 Mutations

𝜺-1 𝜺-2 𝜺-3 𝜺-4

ack ack ack ack ACK

slide-76
SLIDE 76 NETFLIX

Background repairs

What is a delta-state?

R1 R3 R2 R2 R3 Mutations

𝜺-1 𝜺-2 𝜺-3 𝜺-4

ack ack ack ack

slide-77
SLIDE 77 NETFLIX

Challenge with Delta-state CRDTs

Durability Background repairs Practical overhead of maintaining list

slide-78
SLIDE 78

Sailesh Mukil smukil@netflix

Thank You.