SLIDE 1 Paxos Consensus, Abstracted and Deconstructed
Álvaro García Pérez, Alexey Gotsman, Yuri Meshamn, and Ilya Sergey
April 19th 2008
SLIDE 2 Consensus
- Several nodes, which can crash
SLIDE 3 Consensus
v1
- Several nodes, which can crash
- Each node proposes a value
v2 v3
SLIDE 4 Consensus
v1 v2
- Several nodes, which can crash
- Each node proposes a value
- All non-crashed nodes agree on a
single value v2 v3 v2
✘
SLIDE 5
Deterministic state machine
c1 c2 c3
Clients submit commands
SLIDE 6
c1 c2 c3
Machine totally orders commands and computes the sequence of results
Deterministic state machine
r1, r2, r3 c1, c2, c3
SLIDE 7
c1 c2 c3
Machine totally orders commands and computes the sequence of results
Deterministic state machine ✘
c1, c2, c3
SLIDE 8
State machine replication
c3, c2, c1 c1 c2 c3 c1, c2, c3 c2, c1, c3
Clients send commands to all replicas Replicas may receive commands in difgerent orders
SLIDE 9 State machine replication
c3, c2, c1 c1 c2 c3 c1, c2, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3
T
- tally order commands via a sequence of
consensus instances
SLIDE 10
State machine replication
c3, c2, c1 r2, r1, r3 c1 c2 c3 c1, c2, c3 r2, r1, r3 c2, c1, c3 r2, r1, r3 c2, c1, c3 c2, c1, c3 c2, c1, c3
Replicas compute the same sequence of results
SLIDE 11
State machine replication
c3, c2, c1 r2, r1, r3 c1 c2 c3 c1, c2, c3 r2, r1, r3 c2, c1, c3 c2, c1, c3 c2, c1, c3
Replicas compute the same sequence of results
✘
SLIDE 12
State machine replication
c3, c2, c1 r2, r1, r3 c1 c2 c3 c1, c2, c3 r2, r1, r3 c2, c1, c3 c2, c1, c3 c2, c1, c3
Replicas compute the same sequence of results
✘
Correctness: replicated implementation is linearizable wrt single-server one: replication transparent to clients
SLIDE 13 The zoo of consensus protocols
replication (1988)
- Paxos (1998)
- Disk Paxos (2003)
- Cheap Paxos (2004)
- Generalized Paxos
(2004)
- Paxos Commit (2004)
- Fast Paxos (2006)
- Stoppable Paxos
(2008)
- Mencius (2008)
- Vertical Paxos (2009)
- ZAB (2009)
- Ring Paxos (2010)
- Egalitarian Paxos
(2013)
- Raft (2014)
- M2Paxos (2016)
- Flexible Paxos (2016)
- Caesar (2017)
SLIDE 14 The zoo of consensus protocols
replication (1988)
- Paxos (1998)
- Disk Paxos (2003)
- Cheap Paxos (2004)
- Generalized Paxos
(2004)
- Paxos Commit (2004)
- Fast Paxos (2006)
- Stoppable Paxos
(2008)
- Mencius (2008)
- Vertical Paxos (2009)
- ZAB (2009)
- Ring Paxos (2010)
- Egalitarian Paxos
(2013)
- Raft (2014)
- M2Paxos (2016)
- Flexible Paxos (2016)
- Caesar (2017)
Complex protocols: constant fjght for better performance
SLIDE 15 The zoo of consensus protocols
replication (1988)
- Paxos (1998)
- Disk Paxos (2003)
- Cheap Paxos (2004)
- Generalized Paxos
(2004)
- Paxos Commit (2004)
- Fast Paxos (2006)
- Stoppable Paxos
(2008)
- Mencius (2008)
- Vertical Paxos (2009)
- ZAB (2009)
- Ring Paxos (2010)
- Egalitarian Paxos
(2013)
- Raft (2014)
- M2Paxos (2016)
- Flexible Paxos (2016)
- Caesar (2017)
Complex protocols: constant fjght for better performance
SLIDE 16
SLIDE 17
SLIDE 18
SLIDE 19
SLIDE 20
SLIDE 21
SLIDE 22
SLIDE 23
SLIDE 24
Broken [Michael+ 2016]
SLIDE 25 Goals
- Develop methods for proving
protocols correct, including realistic deployments
- Get insights into their structure
SLIDE 26 Goals
- Develop methods for proving
protocols correct, including realistic deployments
- Get insights into their structure
- Focus on single-decree Paxos and
Multi-Paxos
SLIDE 27 Approach
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
SLIDE 28 Approach
P1 P2 P3
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
SLIDE 29 Approach
P1 P2 P3
P1 ⊑ S1
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
SLIDE 30 Approach
P1 P2 P3
P1 ⊑ S1
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
SLIDE 31 Approach
S1 P2 P3
P1 ⊑ S1
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
atomic { ... }
SLIDE 32 Approach
S1 P2 P3
P1 ⊑ S1
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
atomic { ... }
P2(S1) ⊑ S2
SLIDE 33 Approach
S2 P3
P2(S1) ⊑ S2 P1 ⊑ S1
atomic { ... ... }
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
SLIDE 34 Approach
S2 P3
P2(S1) ⊑ S2 P1 ⊑ S1
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
atomic { ... ... }
P3(S2) ⊑ S3
SLIDE 35 Approach
S3
P2(S1) ⊑ S2 P3(S2) ⊑ S3 P1 ⊑ S1
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
atomic { ... ... ... }
SLIDE 36 Approach
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
- Transformations of the network
semantics, à la Verifjed System Transformers of the Verdi framework [Wilcox+ 2015]
SLIDE 37 Approach
- Modular reasoning: verify parts of the
protocol separately instead of the whole thing
- Linearizability implies refjnement
[Filipovic+ 2009]
- Transformations of the network
semantics, à la Verifjed System Transformers of the Verdi framework [Wilcox+ 2015]
Prove one variant of the protocol without unpacking the proof of a simpler variant
SLIDE 38 v1 v2 v3
- Acceptors = members of parliament:
can vote to accept a value,majority wins
- Proposer = parliament speaker:
proposes its value to vote on
1 2 3
Acceptor Acceptor Acceptor Acceptor Acceptor Acceptor Proposer Proposer
SLIDE 39 1 2 3
- Phase 1: a proposer choses a round r and
convinces a majority of acceptors to switch to r
- Acceptor switches only if it’s current
round is less Round#: 0 Accepted: ? Round#: 0 Accepted: ? Round#: 0 Accepted: ?
SLIDE 40 1 2 3
Round#: r Accepted: ? Round#: 0 Accepted: ?
r
Round#: 0 Accepted: ?
- Phase 1: a proposer choses a round r and
convinces a majority of acceptors to switch to r
- Acceptor switches only if it’s current
round is less
SLIDE 41 1 2 3
Round#: r Accepted: ? Round#: r Accepted: ? Round#: 0 Accepted: ?
- k
- Phase 1: a proposer choses a round r and
convinces a majority of acceptors to switch to r
- Acceptor switches only if it’s current
round is less
SLIDE 42 1 2 3
Round#: r Accepted: v2 Round#: r Accepted: ? Round#: 0 Accepted: ?
r, v2
- Phase 2: the proposer sends its value
tagged with the round number
- Acceptor only accepts a value tagged
with the round it is in
SLIDE 43 1 2 3
Round#: r Accepted: v2 ✔ Reply v2 to client Round#: r Accepted: v2 Round#: 0 Accepted: ?
- k
- Phase 2: the proposer sends its value
tagged with the round number
- Acceptor only accepts a value tagged
with the round it is in
SLIDE 44 1 2 3
Round#: r Accepted: v2 ✔ Reply v2 to client Round#: r Accepted: v2 Round#: rʹ Accepted: ?
- Phase 1: a proposer choses a round r’ and
convinces a majority of acceptors to switch to r’
rʹ
SLIDE 45 1 2 3
Round#: r Accepted: v2 ✔ Reply v2 to client Round#: rʹ Accepted: v2 Round#: rʹ Accepted: ?
- k, r, v2
- Phase 1: a proposer choses a round r’ and
convinces a majority of acceptors to switch to r’
- Acceptor sends to the proposer its round
number and value
SLIDE 46 1 2 3
Round#: r Accepted: v2 ✔ Reply v2 to client Round#: rʹ Accepted: v2 Round#: rʹ Accepted: v2
- k, r, v2
- Phase 1: a proposer choses a round r’ and
convinces a majority of acceptors to switch to r’
- Acceptor sends to the proposer its round
number and value
- If some acceptor has accepted a value,
the proposer proposes the value with the highest round number
SLIDE 47 1 2 3
Round#: r Accepted: v2 ✔ Reply v2 to client Round#: rʹ Accepted: v2 Round#: rʹ Accepted: v2
- k, r, v2
- Phase 1: a proposer choses a round r’ and
convinces a majority of acceptors to switch to r’
- Acceptor sends to the proposer its round
number and value
- If some acceptor has accepted a value,
the proposer proposes the value with the highest round number
Ensures that the chosen value v2 will not be changed
SLIDE 48 Modular structure in single-decree Paxos
- Steal abstractions from an existing
analysis of Paxos [Boichat+ 2003]
- Show their linearizability ➜
modular proof of Paxos
SLIDE 49 Round Based Register
[Boichat+ 2003]
encapsulating the state of acceptors
Phase 1 of Paxos
Phase 2 of Paxos
RB Register RB Consensus Paxos
SLIDE 50 read(int k) { query acceptors and switch them to round k; if (majority of acceptors acknowledge) { if (no acceptor has accepted a value) { return (false, undef); } else { v ::= value at acceptor with highest round; return (true, v); } } else { return (false, undef); } }
Round Based Register - read
SLIDE 51 read(int k) { query acceptors and switch them to round k; if (majority of acceptors acknowledge) { if (no acceptor has accepted a value) { return (false, undef); } else { v ::= value at acceptor with highest round; return (true, v); } } else { return (false, undef); } }
Round Based Register - read
SLIDE 52 read(int k) { query acceptors and switch them to round k; if (majority of acceptors acknowledge) { if (no acceptor has accepted a value) { return (false, undef); } else { v ::= value at acceptor with highest round; return (true, v); } } else { return (false, undef); } }
Round Based Register - read
SLIDE 53 write(int k, val v) { update acceptors at round k with value v; if (majority of acceptors acknowledges) { return true; } else { return false; } }
Round Based Register - write
SLIDE 54 write(int k, val v) { update acceptors at round k with value v; if (majority of acceptors acknowledges) { return true; } else { return false; } }
Round Based Register - write
SLIDE 55 write(int k, val v) { update acceptors at round k with value v; if (majority of acceptors acknowledges) { return true; } else { return false; } }
Round Based Register - write
SLIDE 56 Round Based Consensus
[Boichat+ 2003]
- Routine leading Phase 1 and
Phase 2 in the Paxos algorithm
RB Consensus Paxos
SLIDE 57 proposeRC(int k, val v0) { (res, v) := read(k); if (res) { if (v = undef) { v := v0; } res := write(k, v); if (res) { return (true, v); } } return (false, undef); }
Round Based Consensus - proposeRC
SLIDE 58 proposeRC(int k, val v0) { (res, v) := read(k); if (res) { if (v = undef) { v := v0; } res := write(k, v); if (res) { return (true, v); } } return (false, undef); }
Round Based Consensus - proposeRC
SLIDE 59 Paxos
- Entry module, encapsulates
rounds
Paxos
SLIDE 60 Paxos - proposeP
proposeP(val v0) { pick a round k; do { (res, v) := proposeRC(k, v0); increase round k; } while (!res); return v; }
SLIDE 61 Paxos - proposeP
proposeP(val v0) { pick a round k; do { (res, v) := proposeRC(k, v0); increase round k; } while (!res); return v; }
SLIDE 62 Contribution
Round-based register is linearizable wrt an atomic, single-server specifjcation strong enough to prove single-decree Paxos correct
RB Register RB Consensus Paxos
replicated impl.
RB Register RB Consensus Paxos
atomic spec.
SLIDE 63 abs_v := undef; abs_round := 0; vals := {undef};
SLIDE 64 read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef};
Atomic, non-deterministic methods
SLIDE 65 read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW} ∪ ; if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; Every proposed value vW is stored in vals, whether write fails or no
SLIDE 66 read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; Write can fail even if k is higher or equal than the current round ➜ a round will be “stolen” by posterior write, modeled by boolean b
SLIDE 67 read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; Read may succeed even if k is lower than the current round ➜ a failing write “contaminates” some acceptor, modeled by value vR and boolean b
SLIDE 68 read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; If a write succeeds, a succeeding read will pick the written value ➜ a decision taken in consensus cannot be changed
SLIDE 69 read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; If a write succeeds, a succeeding read will pick the written value ➜ a decision taken in consensus cannot be changed proposeRC(int k, val v0) { (res, v) := read(k); if (res) { if (v = undef) { v := v0; } res := write(k, v); if (res) { return (true, v); } } return (false, undef); } proposeRC(int k, val v0) { (res, v) := read(k); if (res) { if (v = undef) { v := v0; } res := write(k, v); if (res) { return (true, v); } } return (false, undef); }
SLIDE 70
Multi-Paxos
c3, c2, c1 c1, c2, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3
State machine replication requires solving a sequence of consensus instances
SLIDE 71 Multi-Paxos
c3, c2, c1 c1, c2, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3
State machine replication requires solving a sequence of consensus instances
- Naive solution: execute a separate Paxos
instance for each sequence element
SLIDE 72 Multi-Paxos
c3, c2, c1 c1, c2, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3
State machine replication requires solving a sequence of consensus instances
- Naive solution: execute a separate Paxos
instance for each sequence element
- Multi-Paxos: execute Phase 1 once for
multiple sequence elements
SLIDE 73
Contribution
Multi-Paxos refjnes the naive solution, shown by transformations of the network semantics à la Verdi [Wilcox+ 2015]
SLIDE 74 Simple Semantics
snd(2, P1(r))
1 2 3
rcv(1, P1(r))
SLIDE 75 Out-of-Thin-Air Semantics
rcv(1, P1(r))
1 2 3
SLIDE 76 Out-of-Thin-Air Semantics
rcv(1, P1(r))
1 2 3
Pred(δ1, P1(r))
SLIDE 77 Slot-Replicating Semantics
. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1)
11 21 31 1i 2i 3i
snd(2, P1(r),1)
SLIDE 78 Slot-Replicating Semantics
. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1)
11 21 31 1i 2i 3i
snd(2, P1(r),1)
SLIDE 79 Widening Semantics
. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1) rcv(1, P1(r), i)
11 21 31 1i 2i 3i
... ... snd(2, P1(r),1)
SLIDE 80 Widening Semantics
. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1) rcv(1, P1(r), i)
11 21 31 1i 2i 3i
... ... P1(r) ∈ T snd(2, P1(r),1)
SLIDE 81 Widening Semantics
. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1) rcv(1, P1(r), i)
11 21 31 1i 2i 3i
... ...
Out-of-thin-air compliant: if slot i receives m ∈ T from p, then Pred(δp, m)
P1(r) ∈ T snd(2, P1(r),1)
SLIDE 82 Optimised Widening Semantics
. . . . . . . . . . . . . . . . . . rcv(1, P1(r))
11 21 31 1i 2i 3i
snd(2, P1(r),1)
SLIDE 83 Optimised Widening Semantics
. . . . . . . . . . . . . . . . . . rcv(1, P1(r))
11 21 31 1i 2i 3i
snd(2, P1(r),1) . . .
2i
...
Transformations amortise Phase1 of single-decree Paxos ➜ results in Multi-Paxos
SLIDE 84 Summary
- Modular reasoning to verify each component
separately
- Linearisability as a correctness criterium for
refjnement
- Deconstruction of single-decree Paxos by [Boichat+
2003] linearises wrt non-deterministic specifjcations
- Behaviour-preserving transformations of the network
semantics à la Verdi [Wilcox+ 2015]
- Multi-Paxos can be verifjed without unpacking the
correctness proof of single-decree Paxos
Thanks!
SLIDE 85 Summary
- Modular reasoning to verify each component
separately
- Linearisability as a correctness criterium for
refjnement
- Deconstruction of single-decree Paxos by [Boichat+
2003] linearises wrt non-deterministic specifjcations
- Behaviour-preserving transformations of the network
semantics à la Verdi [Wilcox+ 2015]
- Multi-Paxos can be verifjed without unpacking the
correctness proof of single-decree Paxos
Thanks!