Paxos Consensus, Abstracted and Deconstructed lvaro Garca Prez , - - PowerPoint PPT Presentation

paxos consensus abstracted and deconstructed
SMART_READER_LITE
LIVE PREVIEW

Paxos Consensus, Abstracted and Deconstructed lvaro Garca Prez , - - PowerPoint PPT Presentation

Paxos Consensus, Abstracted and Deconstructed lvaro Garca Prez , Alexey Gotsman, Yuri Meshamn, and Ilya Sergey April 19 th 2008 Consensus Several nodes, which can crash Consensus v 1 v 2 v 3 Several nodes, which can crash Each


slide-1
SLIDE 1

Paxos Consensus, Abstracted and Deconstructed

Álvaro García Pérez, Alexey Gotsman, Yuri Meshamn, and Ilya Sergey

April 19th 2008

slide-2
SLIDE 2

Consensus

  • Several nodes, which can crash
slide-3
SLIDE 3

Consensus

v1

  • Several nodes, which can crash
  • Each node proposes a value

v2 v3

slide-4
SLIDE 4

Consensus

v1 v2

  • Several nodes, which can crash
  • Each node proposes a value
  • All non-crashed nodes agree on a

single value v2 v3 v2

slide-5
SLIDE 5

Deterministic state machine

c1 c2 c3

Clients submit commands

slide-6
SLIDE 6

c1 c2 c3

Machine totally orders commands and computes the sequence of results

Deterministic state machine

r1, r2, r3 c1, c2, c3

slide-7
SLIDE 7

c1 c2 c3

Machine totally orders commands and computes the sequence of results

Deterministic state machine ✘

c1, c2, c3

slide-8
SLIDE 8

State machine replication

c3, c2, c1 c1 c2 c3 c1, c2, c3 c2, c1, c3

Clients send commands to all replicas Replicas may receive commands in difgerent orders

slide-9
SLIDE 9

State machine replication

c3, c2, c1 c1 c2 c3 c1, c2, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3

T

  • tally order commands via a sequence of

consensus instances

slide-10
SLIDE 10

State machine replication

c3, c2, c1 r2, r1, r3 c1 c2 c3 c1, c2, c3 r2, r1, r3 c2, c1, c3 r2, r1, r3 c2, c1, c3 c2, c1, c3 c2, c1, c3

Replicas compute the same sequence of results

slide-11
SLIDE 11

State machine replication

c3, c2, c1 r2, r1, r3 c1 c2 c3 c1, c2, c3 r2, r1, r3 c2, c1, c3 c2, c1, c3 c2, c1, c3

Replicas compute the same sequence of results

slide-12
SLIDE 12

State machine replication

c3, c2, c1 r2, r1, r3 c1 c2 c3 c1, c2, c3 r2, r1, r3 c2, c1, c3 c2, c1, c3 c2, c1, c3

Replicas compute the same sequence of results

Correctness: replicated implementation is linearizable wrt single-server one: replication transparent to clients

slide-13
SLIDE 13

The zoo of consensus protocols

  • Viewstamped

replication (1988)

  • Paxos (1998)
  • Disk Paxos (2003)
  • Cheap Paxos (2004)
  • Generalized Paxos

(2004)

  • Paxos Commit (2004)
  • Fast Paxos (2006)
  • Stoppable Paxos

(2008)

  • Mencius (2008)
  • Vertical Paxos (2009)
  • ZAB (2009)
  • Ring Paxos (2010)
  • Egalitarian Paxos

(2013)

  • Raft (2014)
  • M2Paxos (2016)
  • Flexible Paxos (2016)
  • Caesar (2017)
slide-14
SLIDE 14

The zoo of consensus protocols

  • Viewstamped

replication (1988)

  • Paxos (1998)
  • Disk Paxos (2003)
  • Cheap Paxos (2004)
  • Generalized Paxos

(2004)

  • Paxos Commit (2004)
  • Fast Paxos (2006)
  • Stoppable Paxos

(2008)

  • Mencius (2008)
  • Vertical Paxos (2009)
  • ZAB (2009)
  • Ring Paxos (2010)
  • Egalitarian Paxos

(2013)

  • Raft (2014)
  • M2Paxos (2016)
  • Flexible Paxos (2016)
  • Caesar (2017)

Complex protocols: constant fjght for better performance

slide-15
SLIDE 15

The zoo of consensus protocols

  • Viewstamped

replication (1988)

  • Paxos (1998)
  • Disk Paxos (2003)
  • Cheap Paxos (2004)
  • Generalized Paxos

(2004)

  • Paxos Commit (2004)
  • Fast Paxos (2006)
  • Stoppable Paxos

(2008)

  • Mencius (2008)
  • Vertical Paxos (2009)
  • ZAB (2009)
  • Ring Paxos (2010)
  • Egalitarian Paxos

(2013)

  • Raft (2014)
  • M2Paxos (2016)
  • Flexible Paxos (2016)
  • Caesar (2017)

Complex protocols: constant fjght for better performance

slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24

Broken [Michael+ 2016]

slide-25
SLIDE 25

Goals

  • Develop methods for proving

protocols correct, including realistic deployments

  • Get insights into their structure
slide-26
SLIDE 26

Goals

  • Develop methods for proving

protocols correct, including realistic deployments

  • Get insights into their structure
  • Focus on single-decree Paxos and

Multi-Paxos

slide-27
SLIDE 27

Approach

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

slide-28
SLIDE 28

Approach

P1 P2 P3

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

slide-29
SLIDE 29

Approach

P1 P2 P3

P1 ⊑ S1

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

slide-30
SLIDE 30

Approach

P1 P2 P3

P1 ⊑ S1

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

slide-31
SLIDE 31

Approach

S1 P2 P3

P1 ⊑ S1

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

atomic { ... }

slide-32
SLIDE 32

Approach

S1 P2 P3

P1 ⊑ S1

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

atomic { ... }

P2(S1) ⊑ S2

slide-33
SLIDE 33

Approach

S2 P3

P2(S1) ⊑ S2 P1 ⊑ S1

atomic { ... ... }

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

slide-34
SLIDE 34

Approach

S2 P3

P2(S1) ⊑ S2 P1 ⊑ S1

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

atomic { ... ... }

P3(S2) ⊑ S3

slide-35
SLIDE 35

Approach

S3

P2(S1) ⊑ S2 P3(S2) ⊑ S3 P1 ⊑ S1

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

atomic { ... ... ... }

slide-36
SLIDE 36

Approach

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

  • Transformations of the network

semantics, à la Verifjed System Transformers of the Verdi framework [Wilcox+ 2015]

slide-37
SLIDE 37

Approach

  • Modular reasoning: verify parts of the

protocol separately instead of the whole thing

  • Linearizability implies refjnement

[Filipovic+ 2009]

  • Transformations of the network

semantics, à la Verifjed System Transformers of the Verdi framework [Wilcox+ 2015]

Prove one variant of the protocol without unpacking the proof of a simpler variant

slide-38
SLIDE 38

v1 v2 v3

  • Acceptors = members of parliament:

can vote to accept a value,majority wins

  • Proposer = parliament speaker:

proposes its value to vote on

1 2 3

Acceptor Acceptor Acceptor Acceptor Acceptor Acceptor Proposer Proposer

slide-39
SLIDE 39

1 2 3

  • Phase 1: a proposer choses a round r and

convinces a majority of acceptors to switch to r

  • Acceptor switches only if it’s current

round is less Round#: 0 Accepted: ? Round#: 0 Accepted: ? Round#: 0 Accepted: ?

slide-40
SLIDE 40

1 2 3

Round#: r Accepted: ? Round#: 0 Accepted: ?

r

Round#: 0 Accepted: ?

  • Phase 1: a proposer choses a round r and

convinces a majority of acceptors to switch to r

  • Acceptor switches only if it’s current

round is less

slide-41
SLIDE 41

1 2 3

Round#: r Accepted: ? Round#: r Accepted: ? Round#: 0 Accepted: ?

  • k
  • Phase 1: a proposer choses a round r and

convinces a majority of acceptors to switch to r

  • Acceptor switches only if it’s current

round is less

slide-42
SLIDE 42

1 2 3

Round#: r Accepted: v2 Round#: r Accepted: ? Round#: 0 Accepted: ?

r, v2

  • Phase 2: the proposer sends its value

tagged with the round number

  • Acceptor only accepts a value tagged

with the round it is in

slide-43
SLIDE 43

1 2 3

Round#: r Accepted: v2 ✔ Reply v2 to client Round#: r Accepted: v2 Round#: 0 Accepted: ?

  • k
  • Phase 2: the proposer sends its value

tagged with the round number

  • Acceptor only accepts a value tagged

with the round it is in

slide-44
SLIDE 44

1 2 3

Round#: r Accepted: v2 ✔ Reply v2 to client Round#: r Accepted: v2 Round#: rʹ Accepted: ?

  • Phase 1: a proposer choses a round r’ and

convinces a majority of acceptors to switch to r’

slide-45
SLIDE 45

1 2 3

Round#: r Accepted: v2 ✔ Reply v2 to client Round#: rʹ Accepted: v2 Round#: rʹ Accepted: ?

  • k, r, v2
  • Phase 1: a proposer choses a round r’ and

convinces a majority of acceptors to switch to r’

  • Acceptor sends to the proposer its round

number and value

slide-46
SLIDE 46

1 2 3

Round#: r Accepted: v2 ✔ Reply v2 to client Round#: rʹ Accepted: v2 Round#: rʹ Accepted: v2

  • k, r, v2
  • Phase 1: a proposer choses a round r’ and

convinces a majority of acceptors to switch to r’

  • Acceptor sends to the proposer its round

number and value

  • If some acceptor has accepted a value,

the proposer proposes the value with the highest round number

slide-47
SLIDE 47

1 2 3

Round#: r Accepted: v2 ✔ Reply v2 to client Round#: rʹ Accepted: v2 Round#: rʹ Accepted: v2

  • k, r, v2
  • Phase 1: a proposer choses a round r’ and

convinces a majority of acceptors to switch to r’

  • Acceptor sends to the proposer its round

number and value

  • If some acceptor has accepted a value,

the proposer proposes the value with the highest round number

Ensures that the chosen value v2 will not be changed

slide-48
SLIDE 48

Modular structure in single-decree Paxos

  • Steal abstractions from an existing

analysis of Paxos [Boichat+ 2003]

  • Show their linearizability ➜

modular proof of Paxos

slide-49
SLIDE 49

Round Based Register

[Boichat+ 2003]

  • Data type

encapsulating the state of acceptors

  • read(int k)

Phase 1 of Paxos

  • write(int k, val v)

Phase 2 of Paxos

RB Register RB Consensus Paxos

slide-50
SLIDE 50

read(int k) { query acceptors and switch them to round k; if (majority of acceptors acknowledge) { if (no acceptor has accepted a value) { return (false, undef); } else { v ::= value at acceptor with highest round; return (true, v); } } else { return (false, undef); } }

Round Based Register - read

slide-51
SLIDE 51

read(int k) { query acceptors and switch them to round k; if (majority of acceptors acknowledge) { if (no acceptor has accepted a value) { return (false, undef); } else { v ::= value at acceptor with highest round; return (true, v); } } else { return (false, undef); } }

Round Based Register - read

slide-52
SLIDE 52

read(int k) { query acceptors and switch them to round k; if (majority of acceptors acknowledge) { if (no acceptor has accepted a value) { return (false, undef); } else { v ::= value at acceptor with highest round; return (true, v); } } else { return (false, undef); } }

Round Based Register - read

slide-53
SLIDE 53

write(int k, val v) { update acceptors at round k with value v; if (majority of acceptors acknowledges) { return true; } else { return false; } }

Round Based Register - write

slide-54
SLIDE 54

write(int k, val v) { update acceptors at round k with value v; if (majority of acceptors acknowledges) { return true; } else { return false; } }

Round Based Register - write

slide-55
SLIDE 55

write(int k, val v) { update acceptors at round k with value v; if (majority of acceptors acknowledges) { return true; } else { return false; } }

Round Based Register - write

slide-56
SLIDE 56

Round Based Consensus

[Boichat+ 2003]

  • Routine leading Phase 1 and

Phase 2 in the Paxos algorithm

  • proposeRC(int k, val v)

RB Consensus Paxos

slide-57
SLIDE 57

proposeRC(int k, val v0) { (res, v) := read(k); if (res) { if (v = undef) { v := v0; } res := write(k, v); if (res) { return (true, v); } } return (false, undef); }

Round Based Consensus - proposeRC

slide-58
SLIDE 58

proposeRC(int k, val v0) { (res, v) := read(k); if (res) { if (v = undef) { v := v0; } res := write(k, v); if (res) { return (true, v); } } return (false, undef); }

Round Based Consensus - proposeRC

slide-59
SLIDE 59

Paxos

  • Entry module, encapsulates

rounds

  • proposeP(val v)

Paxos

slide-60
SLIDE 60

Paxos - proposeP

proposeP(val v0) { pick a round k; do { (res, v) := proposeRC(k, v0); increase round k; } while (!res); return v; }

slide-61
SLIDE 61

Paxos - proposeP

proposeP(val v0) { pick a round k; do { (res, v) := proposeRC(k, v0); increase round k; } while (!res); return v; }

slide-62
SLIDE 62

Contribution

Round-based register is linearizable wrt an atomic, single-server specifjcation strong enough to prove single-decree Paxos correct

RB Register RB Consensus Paxos

replicated impl.

RB Register RB Consensus Paxos

atomic spec.

slide-63
SLIDE 63

abs_v := undef; abs_round := 0; vals := {undef};

slide-64
SLIDE 64

read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef};

Atomic, non-deterministic methods

slide-65
SLIDE 65

read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW} ∪ ; if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; Every proposed value vW is stored in vals, whether write fails or no

slide-66
SLIDE 66

read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; Write can fail even if k is higher or equal than the current round ➜ a round will be “stolen” by posterior write, modeled by boolean b

slide-67
SLIDE 67

read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; Read may succeed even if k is lower than the current round ➜ a failing write “contaminates” some acceptor, modeled by value vR and boolean b

slide-68
SLIDE 68

read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; If a write succeeds, a succeeding read will pick the written value ➜ a decision taken in consensus cannot be changed

slide-69
SLIDE 69

read(int k) { atomic { pick random vR from vals; pick random boolean b; if (b) { if (k >= abs_round) { abs_round := k; if (!(abs_v = undef)) { v := abs_v; } else { v := vR; } } else { v := vR; } return (true, v); } else { return (false, undef); } } } write(int k, val vW) { atomic { pick random boolean b; vals := vals {vW}; ∪ if (b && (k >= abs_round)) { abs_round := k; abs_v := vW; return true; } else { return false; } } } abs_v := undef; abs_round := 0; vals := {undef}; If a write succeeds, a succeeding read will pick the written value ➜ a decision taken in consensus cannot be changed proposeRC(int k, val v0) { (res, v) := read(k); if (res) { if (v = undef) { v := v0; } res := write(k, v); if (res) { return (true, v); } } return (false, undef); } proposeRC(int k, val v0) { (res, v) := read(k); if (res) { if (v = undef) { v := v0; } res := write(k, v); if (res) { return (true, v); } } return (false, undef); }

slide-70
SLIDE 70

Multi-Paxos

c3, c2, c1 c1, c2, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3

State machine replication requires solving a sequence of consensus instances

slide-71
SLIDE 71

Multi-Paxos

c3, c2, c1 c1, c2, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3

State machine replication requires solving a sequence of consensus instances

  • Naive solution: execute a separate Paxos

instance for each sequence element

slide-72
SLIDE 72

Multi-Paxos

c3, c2, c1 c1, c2, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3 c2, c1, c3

State machine replication requires solving a sequence of consensus instances

  • Naive solution: execute a separate Paxos

instance for each sequence element

  • Multi-Paxos: execute Phase 1 once for

multiple sequence elements

slide-73
SLIDE 73

Contribution

Multi-Paxos refjnes the naive solution, shown by transformations of the network semantics à la Verdi [Wilcox+ 2015]

slide-74
SLIDE 74

Simple Semantics

snd(2, P1(r))

1 2 3

rcv(1, P1(r))

slide-75
SLIDE 75

Out-of-Thin-Air Semantics

rcv(1, P1(r))

1 2 3

slide-76
SLIDE 76

Out-of-Thin-Air Semantics

rcv(1, P1(r))

1 2 3

Pred(δ1, P1(r))

slide-77
SLIDE 77

Slot-Replicating Semantics

. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1)

11 21 31 1i 2i 3i

snd(2, P1(r),1)

slide-78
SLIDE 78

Slot-Replicating Semantics

. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1)

11 21 31 1i 2i 3i

snd(2, P1(r),1)

slide-79
SLIDE 79

Widening Semantics

. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1) rcv(1, P1(r), i)

11 21 31 1i 2i 3i

... ... snd(2, P1(r),1)

slide-80
SLIDE 80

Widening Semantics

. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1) rcv(1, P1(r), i)

11 21 31 1i 2i 3i

... ... P1(r) ∈ T snd(2, P1(r),1)

slide-81
SLIDE 81

Widening Semantics

. . . . . . . . . . . . . . . . . . rcv(1, P1(r), 1) rcv(1, P1(r), i)

11 21 31 1i 2i 3i

... ...

Out-of-thin-air compliant: if slot i receives m ∈ T from p, then Pred(δp, m)

P1(r) ∈ T snd(2, P1(r),1)

slide-82
SLIDE 82

Optimised Widening Semantics

. . . . . . . . . . . . . . . . . . rcv(1, P1(r))

11 21 31 1i 2i 3i

snd(2, P1(r),1)

slide-83
SLIDE 83

Optimised Widening Semantics

. . . . . . . . . . . . . . . . . . rcv(1, P1(r))

11 21 31 1i 2i 3i

snd(2, P1(r),1) . . .

2i

...

Transformations amortise Phase1 of single-decree Paxos ➜ results in Multi-Paxos

slide-84
SLIDE 84

Summary

  • Modular reasoning to verify each component

separately

  • Linearisability as a correctness criterium for

refjnement

  • Deconstruction of single-decree Paxos by [Boichat+

2003] linearises wrt non-deterministic specifjcations

  • Behaviour-preserving transformations of the network

semantics à la Verdi [Wilcox+ 2015]

  • Multi-Paxos can be verifjed without unpacking the

correctness proof of single-decree Paxos

Thanks!

slide-85
SLIDE 85

Summary

  • Modular reasoning to verify each component

separately

  • Linearisability as a correctness criterium for

refjnement

  • Deconstruction of single-decree Paxos by [Boichat+

2003] linearises wrt non-deterministic specifjcations

  • Behaviour-preserving transformations of the network

semantics à la Verdi [Wilcox+ 2015]

  • Multi-Paxos can be verifjed without unpacking the

correctness proof of single-decree Paxos

Thanks!