Anti-Entropy using CRDTs on HA Datastores
Sailesh Mukil Senior Software Engineer, Netflix
Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior - - PowerPoint PPT Presentation
Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix Timeline Cassandra Multi-region Dynomite adoption 2011 2013 2016 NETFLIX Dynomite Makes non-distributed datastores, distributed NETFLIX
Sailesh Mukil Senior Software Engineer, Netflix
Cassandra adoption
2011 2013 2016
Multi-region Dynomite
Makes non-distributed datastores, distributed
Datastore
33% 33% 33%
Replica 1 Replica 2 Replica 3
Replica 1 Replica 2 Replica 3 Client
Replica 1 Replica 2 Replica 3 Client
Replica 1 Replica 2 Replica 3 Client
Dynomite
Entropy in the system
R-2 R-3 R-1
Entropy in the system SET K 123
R-2 R-3 R-1
Entropy in the system SET K 123
K: 123 K: 123 K: 123
R-2 R-3 R-1
Entropy in the system
K: 123 K: 123 K: 123
OK
R-2 R-3 R-1
Entropy in the system
K: 123 K: 123 K: 123
SET K 456
R-2 R-3 R-1
Entropy in the system
K: 123 K: 123 K: 123
SET K 456
K: 456
R-2 R-3 R-1
Entropy in the system
K: 123 K: 123 K: 123 K: 456
ERR
R-2 R-3 R-1
Entropy in the system
K: 123 K: 123 K: 123 K: 456
SET K 789
R-2 R-3 R-1
Entropy in the system
K: 123 K: 123 K: 123 K: 456
SET K 789
K: 789
R-2 R-3 R-1
Entropy in the system
K: 123 K: 123 K: 123 K: 456 K: 789
ERR
R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456
GET K
K: 789
GET K
R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456 K: 789
789 456
R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456
GET K (w/quorum)
K: 789
R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456
GET K (w/quorum)
K: 789
R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456
GET K (w/quorum)
K: 789
123 456
R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456 K: 789
ERR: QUORUM FAILED
R-2 R-3 R-1 K: 123 K: 123 K: 123 K: 456 K: 789
123 456 ERR: QUORUM FAILED
Replicas will go out of sync
Cassandra adoption
2011 2013 2016
Multi-region Dynomite Dynomite w/ CRDTs
2019
Last Writer Wins Vector Clocks Achieving anti-entropy
(traditionally)
Conflict free replicated data types
Conflict free replicated data types
NETFLIX SECTION DIVIDERA CRDT is a data structure which can be replicated across the network, where the replicas can be updated independently and concurrently without coordination between the replicas, and where it is always mathematically possible to resolve inconsistencies which might result.
Associative Commutative Idempotent
Grouping of operations does not matter (X + Y) + Z = X + (Y + Z) Order of operations do not matter X + Y = Y + X Duplication of
matter X + X = X
Update Merge Types of operations on CRDTs
When we write, we update When we repair, we merge Read repair = merge on read path
Introduction to CRDTs
CRDTs provide strong eventual consistency
Introduction to CRDTs
R-2 R-3 R-1
Naive distributed counter
CTR: 1 CTR: 1 CTR: 1
INCR CTR
R-2 R-3 R-1
Naive distributed counter
CTR: 1 CTR: 1 CTR: 1
DECR CTR INCR CTR
CTR: 0 CTR: 2
R-2 R-3 R-1
Naive distributed counter
CTR: 1 CTR: 1 CTR: 1 CTR: 0 CTR: 2
Repair based on timestamp? Latest value is 2, which is incorrect
Each replica maintains 2 “local” counters
Final counter value:
(Sum of all PCounters - Sum of all NCounters)
R-2 R-3 R-1
CRDT: PNCounter INCR CTR
CTR: CTR: CTR: 1 1 1 1 1
R-2 R-3 R-1
CRDT: PNCounter
CTR: CTR: CTR: 1 1 1
DECR CTR INCR CTR
1 1
R-2 R-3 R-1
CRDT: PNCounter
CTR: CTR: 1 CTR: 1 1 1 1 1
CTR = 0 CTR = 1 CTR = 2
R-2 R-3 R-1
CRDT: PNCounter
CTR: CTR: 1 CTR: 1 1 1 1 1
GET CTR
1 1 1 1
R-2 R-3 R-1
CRDT: PNCounter
CTR: CTR: 1 CTR: 1 1 1 1 1
GET CTR
1 1
repair (merge) repair (merge) repair (merge)
1 1
R-2 R-3 R-1
CRDT: PNCounter
CTR: CTR: 1 CTR: 1 1 1 1 1 1 1 1 1
CTR = 1 CTR = 1 CTR = 1
Used to maintain key metadata
Registers can take arbitrary values
Used for registers, hashmaps and sorted sets
R-2 R-3 R-1
LWW-Element Set SET K1 123 (t1)
add rem add rem add rem K1 t1 K1 t1 K1 t1
K1: 123 K1: 123 K1: 123
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 t1 K1 t1 K1 t1
K1: 123 K1: 123 K1: 123
SET K1 456 (t2)
t2
K1: 456
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 t1 K1 t1 K1 t1
K1: 123 K1: 123
t2
SET K2 999 (t3)
K2 t3
K2: 999
K2 t3
K2: 999 K1: 456
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 t1 K1 t1 K1 t1
K1: 123 K1: 123
t2 K2 t3
K2: 999
K2 t3
K2: 999 K1: 456
GET K1 K1 = 456 (t2) K1 = 123 (t1)
t2 > t1 => 456 latest value
t2
K1: 456
repair
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 K1 t1 K1 t1
K1: 123
t2 K2 t3
K2: 999
K2 t3
K2: 999 K1: 456
t2
K1: 456
“456”
repair
t2
K1: 456
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 K1 K1 t1 t2 K2 t3
K2: 999
K2 t3
K2: 999 K1: 456
t2
K1: 456
t2
K1: 456
GET K2 (nil) K2 = 999 (t3)
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 K1 K1 t1 t2 K2 t3
K2: 999
K2 t3
K2: 999 K1: 456
t2
K1: 456
t2
K1: 456
“999”
repair
K2 t3
K2: 999
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 K1 K1 t1 t2 K2 t3
K2: 999
K2 t3
K2: 999 K1: 456
t2
K1: 456
t2
K1: 456
K2 t3
K2: 999
DEL K2 (t4)
K2 t4
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 K1 K1 t1 t2 K2 t3 K2 t3
K2: 999 K1: 456
t2
K1: 456
t2
K1: 456
K2 t3
K2: 999
GET K2 “999”
K2 t4
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 K1 K1 t1 t2 K2 t3 K2 t3
K2: 999 K1: 456
t2
K1: 456
t2
K1: 456
K2 t3
K2: 999
GET K2 K2 del @t4
K2 t4
K2 = 999 (t3)
K2 t4
R-2 R-3 R-1
LWW-Element Set
add rem add rem add rem K1 K1 K1 t1 t2 K2 t3 K2 t3
K2: 999 K1: 456
t2
K1: 456
t2
K1: 456
K2 t3
(nil)
K2 t4
DEL K2 (t4)
K2 t4 K2 t4
repair
Redis doesn’t maintain timestamps
Dynomite can track the timestamp of the client request
We’d like Dynomite to remain stateless
Store the metadata inside Redis
Operations must modify data and metadata atomically
Rewrite operations into Redis Lua scripts (guarantees atomicity)
Does the remove set grow forever?
Delete metadata ASAP from remove set if ALL replicas agree Background thread cleans rest Maintain remove set as sorted set
What does an example Lua script look like?
Check if update is old Discard if it is Update data + metadata otherwise
Repairs occur on read path in Dynomite
Repairs for point reads only
(Note: Ongoing work)
Repairing on range reads is expensive
Eg: Give me all members of a set Return everything in this hashmap Return me a range from this sorted set Background repairs
How do we target keys that need repairing?
Full key walk? (like Cassandra) Background repairs
How do we target keys that need repairing?
Maintain list of recently written to keys Background repairs Run merge operation on them (async) But, merge operation on large structures are expensive
Delta-state CRDTs
Maintain list of recent mutations done to keys Background repairs Ship only delta-state instead of entire data structure for merge Confirm which replicas have received it
CTR: CTR: 1 1 1 1 1
Background repairs
What is a delta-state?
INCR CTR
2 1 2
Full state
R1 R2
CTR: CTR: 1 1 1 1
Background repairs
What is a delta-state?
INCR CTR
2 R1 = 2
Delta state
2 R1 R2
Background repairs
What is a delta-state?
R1 R3 R2 R2 R3 Mutations
𝜺-1 𝜺-2 𝜺-3 𝜺-4
Background repairs
What is a delta-state?
R1 R3 R2 R2 R3 Mutations
𝜺-1 𝜺-2 𝜺-3 𝜺-4
ACK ACK
Background repairs
What is a delta-state?
R1 R3 R2 R2 R3 Mutations
𝜺-1 𝜺-2 𝜺-3 𝜺-4
ack ack ack ack
Background repairs
What is a delta-state?
R1 R3 R2 R2 R3 Mutations
𝜺-1 𝜺-2 𝜺-3 𝜺-4
ack ack ack ack ACK
Background repairs
What is a delta-state?
R1 R3 R2 R2 R3 Mutations
𝜺-1 𝜺-2 𝜺-3 𝜺-4
ack ack ack ack
Challenge with Delta-state CRDTs
Durability Background repairs Practical overhead of maintaining list
Sailesh Mukil smukil@netflix