Just-Right Consistency Closing the CAP Gap
Christopher S. Meiklejohn (@cmeik), Peter Lash
LIGHT ONE
Just-Right Consistency Closing the CAP Gap Christopher S. Meiklejohn - - PowerPoint PPT Presentation
Just-Right Consistency Closing the CAP Gap Christopher S. Meiklejohn (@cmeik), Peter Lash LIGHT ONE Outline: Closing the CAP Gap Just-Right Consistency Available as possible, and consistent when necessary 2 Outline: Closing the CAP
Christopher S. Meiklejohn (@cmeik), Peter Lash
LIGHT ONE
Available as possible, and consistent when necessary
2
Available as possible, and consistent when necessary
The first database that provides transactions with strong semantics, targeted at the JRC approach
2
Available as possible, and consistent when necessary
The first database that provides transactions with strong semantics, targeted at the JRC approach
Antidote’s path forward from research to company and product
2
3
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
Centralized database.
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
Clients read and write against the primary copy.
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
Geo-replicated for both fault-tolerance and high-availability.
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
Clients read and write locally for low-latency.
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
What happens if C can’t communicate with other replicas?
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
Maintains “single system image”
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
Maintains “single system image”
Coordination is expensive; Spanner typically has to wait 100ms to commit an update transaction
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
Maintains “single system image”
Coordination is expensive; Spanner typically has to wait 100ms to commit an update transaction
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
Operations issued against local copy, and across the cluster in parallel
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
Operations issued against local copy, and across the cluster in parallel
Stale reads and write conflicts will occur without synchronization
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
Operations issued against local copy, and across the cluster in parallel
Stale reads and write conflicts will occur without synchronization
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
High cost
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
High cost Low availability
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
High cost Low availability Synchronization
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
High cost Low availability Synchronization Low cost
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
High cost Low availability Synchronization Low cost High availability
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
High cost Low availability Synchronization Low cost High availability Anomalies
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
High cost Low availability Synchronization Low cost High availability Anomalies
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A B C
High cost Low availability Synchronization Low cost High availability Anomalies
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
Choosing either model will either be over-conservative or risk anomalies
A B C
High cost Low availability Synchronization Low cost High availability Anomalies
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
Choosing either model will either be over-conservative or risk anomalies
Instead, tailor consistency choices based on application- level invariants for each operation
Applications written sequentially that are correct should maintain correctness under concurrency
13
Applications written sequentially that are correct should maintain correctness under concurrency
Strongest AP model; invariants that only require “one way” communications
13
Applications written sequentially that are correct should maintain correctness under concurrency
Strongest AP model; invariants that only require “one way” communications
Transactions that require coordination; “two way” communication invariants
13
Applications written sequentially that are correct should maintain correctness under concurrency
Strongest AP model; invariants that only require “one way” communications
Transactions that require coordination; “two way” communication invariants
Identify and verify application has sufficient synchronization to ensure application invariants
13
14
Danish National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens
15
Danish National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens
Involves patient, pharmacy, and doctor management around active prescriptions in Denmark
15
Danish National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens
Involves patient, pharmacy, and doctor management around active prescriptions in Denmark
“Correct-Individually”, C in ACID, each operation ensures application-level invariants
15
Danish National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens
Involves patient, pharmacy, and doctor management around active prescriptions in Denmark
“Correct-Individually”, C in ACID, each operation ensures application-level invariants
15
Create prescription for patient, doctor, pharmacy
Add or increase medication to prescription
Deliver a medication by a pharmacy
Query functions to return information about prescriptions
Create a prescription and reference it by a patient
16
Create a prescription and reference it by a patient
Create prescription, then update doctor, patient, and pharmacy
16
Create a prescription and reference it by a patient
Create prescription, then update doctor, patient, and pharmacy
Medication should not be over delivered
16
17
Updates occur locally without blocking, no synchronization in the critical path
18
Updates occur locally without blocking, no synchronization in the critical path
Updates are fast, available, and exploit concurrency
18
Updates occur locally without blocking, no synchronization in the critical path
Updates are fast, available, and exploit concurrency
Relative order and joint update invariants can be preserved
18
19
RA RB
RA RB 1 set(1)
RA RB 1 set(1) 3 2 set(2) set(3)
RA RB 1 set(1) 3 2 set(2) set(3) 2 3
Concurrent assignments don’t commute!
RA RB 1 set(1) 3 2 set(2) set(3) 2 3
Concurrent assignments don’t commute!
24
24
RA RB 1 set(1) 3 2 set(2) set(3) ? ?
How do we deterministically pick a value to keep?
RA RB 1 set(1) 3 2 set(2) set(3) ? ?
How do we deterministically pick a value to keep? Do we use a timestamp? (like Cassandra, and drop a value?)
RA RB 1 set(1) 3 2 set(2) set(3) ? ?
How do we deterministically pick a value to keep? Do we use a timestamp? (like Cassandra, and drop a value?)
26
RA RB 1 set(1) 3 2 set(2) set(3) 3 3 max(2,3) max(2,3)
Deterministic conflict resolution function.
RA RB 1 set(1) 3 2 set(2) set(3) 3 3 max(2,3) max(2,3)
Deterministic conflict resolution function. CRDTs generalize this framework.
Extension of sequential data type that encapsulates deterministic merge function
28
Extension of sequential data type that encapsulates deterministic merge function
Sets, counters, registers, flags, maps
28
29
RA RB
RA RB
Maintain program order implication invariant.
RA RB
Maintain program order implication invariant. For instance, P => Q.
RA RB Q true(Q)
Make Q true.
RA RB Q true(Q) P true(P)
Make P true.
RA RB Q true(Q) P true(P)
Program order implies ordering relationship.
RA RB Q true(Q) P true(P)
Ordering is respected at other replicas.
RA RB Q true(Q) P true(P)
Out of order propagation violates invariant!
RA RB Q true(Q) P true(P)
P is true, Q is NOT true!
37
RA RB
RA RB Q true(Q)
Change default administrator password.
RA RB Q true(Q) P true(P)
Enable administrator login.
RA RB Q true(Q) P true(P)
Replica A is secure.
RA RB Q true(Q) P true(P)
Replica B is secure.
RA RB Q true(Q) P true(P)
Reordering allows default password to be used to login!
Ensure updates are delivered in the causal order [Lamport 78]
44
Ensure updates are delivered in the causal order [Lamport 78]
Always able to return some compatible version for an object
44
Ensure updates are delivered in the causal order [Lamport 78]
Always able to return some compatible version for an object
Causal consistency is sufficient for providing referential integrity in an AP database
44
45
46
RA RB C1
Client performing reads.
RA RB C1 Rx create Rx
Create prescription.
RA RB C1 Rx create Rx Dr update Dr(Rx)
Add reference in doctor record.
RA RB C1 Rx create Rx Dr update Dr(Rx) Pt update Pt(Rx)
Add reference in patient record.
RA RB C1 Rx create Rx Dr update Dr(Rx) Pt update Pt(Rx) Ph update Ph(Rx)
Add reference in pharmacy record.
RA RB C1 Rx create Rx Dr update Dr(Rx) Pt update Pt(Rx) Ph update Ph(Rx)
Updates are causally consistent.
RA RB C1 Rx create Rx Dr update Dr(Rx) Pt update Pt(Rx) Ph update Ph(Rx)
Client can read inconsistent state.
RA RB C1 Rx create Rx Dr update Dr(Rx) Pt update Pt(Rx) Ph update Ph(Rx)
Client is missing update to pharmacy.
55
RA RB C1 T1 create Rx update Dr(Rx) update Pt(Rx) update Ph(Rx)
Group updates into an atomic transaction.
RA RB C1 T1 create Rx update Dr(Rx) update Pt(Rx) update Ph(Rx)
Updates reflect “All-Or-Nothing” property through snapshots.
RA RB C1 T1 create Rx update Dr(Rx) update Pt(Rx) update Ph(Rx) T2
Transactions are delivered in causal order.
RA RB C1 T1 create Rx update Dr(Rx) update Pt(Rx) update Ph(Rx) T2
Therefore, snapshots are causally consistent.
60
61
62
63
RA(2) RB(2) ? ? RC(2) ?
Three replicas each with two available medications.
RA(2) RB(2) 1 1 1 pp(1) RC(2) 1
Replica A checks precondition and delivers medication.
RA(2) RB(2) 1 1 1 pp(1) RC(2) 1
Correct outcome where one medication remains.
67
RA(2) RB(2) ? ? RC(2) ?
Three replicas each with two available medications.
RA(2) RB(2) 4 4 1 pp(1) RC(2) 4 4 add(3)
Replica A checks precondition and delivers medication.
RA(2) RB(2) 4 4 1 pp(1) RC(2) 4 4 add(3)
Replica C adds three medications to the prescription.
RA(2) RB(2) 4 4 1 pp(1) RC(2) 4 4 add(3)
Correct outcome with four remaining medications.
RA(2) RB(2) 4 4 1 pp(1) RC(2) 4 4 add(3)
Correct outcome with four remaining medications.
72
RA(2) RB(2) ? ? RC(2) ?
Three replicas each with two available medications.
RA(2) RB(2)
1 pp(1) RC(2)
pp(2)
Replica A checks precondition and delivers medication.
RA(2) RB(2)
1 pp(1) RC(2)
pp(2)
Replica C concurrently checks precondition and delivers two medications.
RA(2) RB(2)
1 pp(1) RC(2)
pp(2)
Incorrect outcome violating non-negative invariant.
RA(2) RB(2)
1 pp(1) RC(2)
pp(2)
Incorrect outcome violating non-negative invariant.
RA(2) RB(2)
1 pp(1) RC(2)
pp(2)
Incorrect outcome violating non-negative invariant.
Prevent operations from proceeding without synchronization to enforce invariant
Allow operation to proceed, knowing that the invariant may be violated under concurrent operations
77
78
RA RB I? I? ? Upre? RC I? ? Vpre?
Analyze possible pairs
RA RB I? I? ? Upre? RC I? ? Vpre?
…to identify operations where the invariant can be violated.
Individual operations never violate the invariant
81
Individual operations never violate the invariant
Concurrent effects commute
81
Individual operations never violate the invariant
Concurrent effects commute
Preconditions are stable under every pair
81
Individual operations never violate the invariant
Concurrent effects commute
Preconditions are stable under every pair
81
82
Developed in Erlang, on top of the Riak Core distributed systems framework
83
Developed in Erlang, on top of the Riak Core distributed systems framework
Only industrial-grade database providing both causal consistency and all-or-nothing transactions
83
Developed in Erlang, on top of the Riak Core distributed systems framework
Only industrial-grade database providing both causal consistency and all-or-nothing transactions
Currently under development, but an alpha release of the product is available on GitHub
83
A B N1 N2 TxnMgr Materializer Log InterDC-Repl
Each data center…
A B N1 N2 TxnMgr Materializer Log InterDC-Repl
…contains multiple nodes…
A B N1 N2 TxnMgr Materializer Log InterDC-Repl
…each operating a transaction manager, materializers, log.
A B N1 N2 TxnMgr Materializer Log InterDC-Repl
Strong consistency inside of the data center…
A B N1 N2 TxnMgr Materializer Log InterDC-Repl
…with a causal consistency protocol running in the wide area.
89
Register
Set
Map Counter
Graph
Sequence
90
User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “michel@blub.org”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
91
User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “michel@blub.org”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Identify an object by object identifier.
92
User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “michel@blub.org”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Use the update API to assign a value to this register.
93
User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “michel@blub.org”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Read the object, providing a minimum snapshot time.
93
User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “michel@blub.org”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Read the object, providing a minimum snapshot time.
93
User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “michel@blub.org”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Read the object, providing a minimum snapshot time.
94
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),
{ok, _} = antidote:commit_transaction(TxId).
95
Start a transaction with the transaction API, with a given snapshot time and return a transaction identifier.
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),
{ok, _} = antidote:commit_transaction(TxId).
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),
{ok, _} = antidote:commit_transaction(TxId).
96
Read objects using the interactive transaction API.
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),
{ok, _} = antidote:commit_transaction(TxId).
97
Update objects using the interactive transaction API.
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),
{ok, _} = antidote:commit_transaction(TxId).
98
Once finished updating, commit the transaction.
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),
{ok, _} = antidote:commit_transaction(TxId).
98
Once finished updating, commit the transaction.
99
Kops / s 100 200 300 400 500 600 700 800 1 x 5 1 x 10 1 x 25 2 x 25 3 x 25 1 x 5 1 x 10 1 x 25 2 x 25 3 x 25 1 x 5 1 x 10 1 x 25 2 x 25 3 x 25 1 x 5 1 x 10 1 x 25 2 x 25 3 x 25
99(1) 90(10) 75(25) 50(50)
read(update) ratio DCs × Servers LWW registers 100k keys/partition power law distribution
100
Kops / s
100 200 300 400 500 600 700 800 900 1000 1100 Eiger GR Cure EC Eiger GR Cure EC Eiger GR Cure EC Eiger GR Cure EC 99(1) 90(10) 75(25) 50(50) read(update) ratio
3 DCs × 25 Servers LWW registers
101
Kops / s 100 200 300 400 500 600 700 800 900 1000 1100 1200 Cure, 1KB EC, 1KB Cure, 10KB EC, 10KB Cure, 1KB EC, 1KB Cure, 10KB EC, 10KB Cure, 1KB EC, 1KB Cure, 10KB EC, 10KB Cure, 1KB EC, 1KB Cure, 10KB EC, 10KB 99(1) 90(10) 75(25) 50(50) read(update) ratio
3 DCs x 25 Servers CRDT sets
Antidote provides no replication within the datacenter and assumes only geo- replication at the moment
102
Antidote provides no replication within the datacenter and assumes only geo- replication at the moment
For Antidote to provide all of JRC, it needs ACID transaction support: no research needed, only implementation
102
Originally a research prototype to build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork
103
Originally a research prototype to build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork
LightKone (H2020) will investigate moving AntidoteDB close to the edge to provide DDN services
103
Originally a research prototype to build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork
LightKone (H2020) will investigate moving AntidoteDB close to the edge to provide DDN services
Obtaining seed funding to start a company to industrialize AntidoteDB
103
AntidoteDB
104
AntidoteDB
Documentation for AntidoteDB
104
AntidoteDB
Documentation for AntidoteDB
Website
104
AntidoteDB
Documentation for AntidoteDB
Website
Try out Antidote!
104
105