Model Checking of Fault-Tolerant Distributed Algorithms Igor Konnov - - PowerPoint PPT Presentation

model checking of fault tolerant distributed algorithms
SMART_READER_LITE
LIVE PREVIEW

Model Checking of Fault-Tolerant Distributed Algorithms Igor Konnov - - PowerPoint PPT Presentation

Model Checking of Fault-Tolerant Distributed Algorithms Igor Konnov joint work with Annu Gmeiner Ulrich Schmid Helmut Veith Josef Widder Igor Konnov Distributed Systems Are they always working? 2/63 No. . . some failing systems


slide-1
SLIDE 1

Model Checking of Fault-Tolerant Distributed Algorithms

Igor Konnov

joint work with Annu Gmeiner Ulrich Schmid Helmut Veith Josef Widder

slide-2
SLIDE 2

Distributed Systems

Are they always working?

Igor Konnov 2/63

slide-3
SLIDE 3
  • No. . . some failing systems

Therac-25 (1985)

radiation therapy machine gave massive overdoses, e.g., due to race conditions of concurrent tasks

Quantas Airbus in-flight Learmonth upset (2008)

1 out of 3 replicated components failed computer initiated dangerous altitude drop

Ariane 501 maiden flight (1996)

primary/backup, i.e., 2 replicated computers both run into the same integer overflow

Netflix outages due to Amazon’s cloud (ongoing)

  • ne is not sure what is going on there

hundreds of computers involved

Igor Konnov 3/63

slide-4
SLIDE 4

Why do they fail?

Igor Konnov 4/63

slide-5
SLIDE 5

Why do they fail?

faults at design/implementation phase faults at runtime

  • utside of control of designer/developer

e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ⇒ led to erroneous messages being sent

Driscoll (Honeywell)

Igor Konnov 5/63

slide-6
SLIDE 6

Why do they fail?

faults at design/implementation phase approach: find and fix faults before operation ⇒ model checking faults at runtime

  • utside of control of designer/developer

e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ⇒ led to erroneous messages being sent

Driscoll (Honeywell)

Igor Konnov 6/63

slide-7
SLIDE 7

Why do they fail?

faults at design/implementation phase approach: find and fix faults before operation ⇒ model checking faults at runtime

  • utside of control of designer/developer

e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ⇒ led to erroneous messages being sent

approach: keep system operational despite faults ⇒ fault-tolerant distributed algorithms

Driscoll (Honeywell)

Igor Konnov 7/63

slide-8
SLIDE 8

Bringing both together

Goal: automatically verified fault-tolerant distributed algorithms e.g., Paxos, Fast Byzantine Consensus, etc.

Igor Konnov 8/63

slide-9
SLIDE 9

Bringing both together

Goal: automatically verified fault-tolerant distributed algorithms e.g., Paxos, Fast Byzantine Consensus, etc. model checking FTDAs is a research challenge: computers run independently at different speeds exchange messages with uncertain delays faults parameterization . . . fault-tolerance makes model checking harder

Igor Konnov 9/63

slide-10
SLIDE 10

Why Model Checking?

an alternative proof approach useful counter-examples ability to define and vary assumptions about the system and see why it breaks closer to code level good degree of automation Transition system:

s4 : {g} s1 : {y} s2 : {y} s3 : {r, y, g} s0 : {r}

Linear Temporal Logic:

F ( ) G ( ) s0 s1 s2 s4 s3 s′ s′

1

s′

2

s′

3

s′

4 Igor Konnov 10/63

slide-11
SLIDE 11

Distributed Algorithms: Model Checking Challenges

unbounded data types

unbounded number of rounds (round numbers part of messages)

parameterization in multiple parameters

among n processes f ≤ t are faulty with n > 3t

contrast to concurrent programs

diverse fault models (adverse environments)

continuous time

fault-tolerant clock synchronization

degrees of concurrency: synchronous, asynchronous partially

synchronous a process makes at most 5 steps between 2 steps

  • f any other process

Igor Konnov 11/63

slide-12
SLIDE 12

Challenge #1: fault models

clean crashes: least severe

faulty processes prematurely halt after/before “send to all”

crash faults:

faulty processes prematurely halt (also) in the middle of “send to all”

  • mission faults:

faulty processes follow the algorithm, but some messages sent by them might be lost

symmetric faults:

faulty processes send arbitrarily to all or nobody

Byzantine faults: most severe

faulty processes can do anything

encompass all behaviors of above models

Igor Konnov 12/63

slide-13
SLIDE 13

Challenges #2 & #3: Pseudo-code and Communication

Translate pseudo-code to a formal description that allows us to verify the algorithm and does not oversimplify the original algorithm. Assumptions about the communication medium are usually written in plain English, spread across research papers, constitute folklore knowledge.

Igor Konnov 13/63

slide-14
SLIDE 14

Typical Structure of a Computation Step

receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages

atomic

Igor Konnov 14/63

slide-15
SLIDE 15

Typical Structure of a Computation Step

receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages

atomic

i m p l i c i t p s e u d

  • c
  • d

e

Igor Konnov 15/63

slide-16
SLIDE 16

Challenge #4: Parameterized Model Checking

Parameterized model checking problem: given a process template P(n, t, f), resilience condition RC : n > 3t ∧ t ≥ f ≥ 0, fairness constraints Φ, e.g., “all messages will be delivered” and an LTL-X formula ϕ show for all n, t, and f satisfying RC (P(n, t, f))n−f + f faults | = (Φ → ϕ)

n ? ? ? t n ? ? ? t f

Igor Konnov 16/63

slide-17
SLIDE 17

Challenge #5: Liveness in Distributed Algorithms

Interplay of safety and liveness is a central challenge in DAs achieving safety and liveness is non-trivial asynchrony and faults lead to impossibility results [Fischer, Lynch, Paterson’85]

Igor Konnov 17/63

slide-18
SLIDE 18

Challenge #5: Liveness in Distributed Algorithms

Interplay of safety and liveness is a central challenge in DAs achieving safety and liveness is non-trivial asynchrony and faults lead to impossibility results [Fischer, Lynch, Paterson’85] Rich literature to verify safety (e.g. in concurrent systems) Distributed algorithms perspective: “doing nothing is always safe” “tools verify algorithms that actually might do nothing” Verification efforts often have to simplify assumptions

Igor Konnov 18/63

slide-19
SLIDE 19

Summary We have to model:

faults, communication medium captured in English, algorithms written in pseudo-code.

and check:

safety and liveness

  • f parameterized systems

with unbounded integers, non-standard fairness constraints,

Igor Konnov 19/63

slide-20
SLIDE 20

Model Checking for Small System Sizes

Igor Konnov 20/63

slide-21
SLIDE 21

Fault-tolerant distributed algorithms

n

n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty

Igor Konnov 21/63

slide-22
SLIDE 22

Fault-tolerant distributed algorithms

n ? ? ? t

n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty

Igor Konnov 22/63

slide-23
SLIDE 23

Fault-tolerant distributed algorithms

n ? ? ? t f

n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty

Igor Konnov 23/63

slide-24
SLIDE 24

Asynchronous Reliable Broadcast [Srikanth & Toueg’87]

The core of the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs vi. Variables of process i

vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y

An atomic step:

i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1 ;

Igor Konnov 24/63

slide-25
SLIDE 25

Asynchronous Reliable Broadcast [Srikanth & Toueg’87]

The core of the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs vi. Variables of process i

vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y

An atomic step:

i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1 ;

asynchronous t Byzantine faults correct if n > 3t the code is parameterized in n and t ⇒ process template P(n, t, f)

Igor Konnov 25/63

slide-26
SLIDE 26

Threshold-Guarded Distributed Algorithms

Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ...

Igor Konnov 26/63

slide-27
SLIDE 27

Threshold-Guarded Distributed Algorithms

Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur?

Igor Konnov 27/63

slide-28
SLIDE 28

Threshold-Guarded Distributed Algorithms

Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Fault-Tolerant Algorithms: n processes, at most t are Byzantine Threshold Guard if received m from n − t processes then ... (the processes cannot refer to f!)

Igor Konnov 28/63

slide-29
SLIDE 29

Counting Argument in Threshold-Guarded Algorithms

n t f

t + 1

Correct processes count incoming messages from distinct processes

Igor Konnov 29/63

slide-30
SLIDE 30

Counting Argument in Threshold-Guarded Algorithms

n t f

t + 1

Correct processes count incoming messages from distinct processes

Igor Konnov 30/63

slide-31
SLIDE 31

Counting Argument in Threshold-Guarded Algorithms

n t f

t + 1

at least one non-faulty sent the message

Correct processes count incoming messages from distinct processes

Igor Konnov 31/63

slide-32
SLIDE 32

Modeling threshold-based algorithms in Promela

As the distributed algorithms are given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received <m> from n − t distinct processes” faults

Igor Konnov 32/63

slide-33
SLIDE 33

Modeling threshold-based algorithms in Promela

As the distributed algorithms are given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received <m> from n − t distinct processes” faults In what follows, we compare side-by-side two solutions: A straightforward encoding with PROMELA channels and explicit representation of faulty processes. [Solution 1] An advanced encoding with shared variables and fault injection. [Solution 2]

Igor Konnov 33/63

slide-34
SLIDE 34

Modeling threshold-based algorithms in Promela

As the distributed algorithms are given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received <m> from n − t distinct processes” faults In what follows, we compare side-by-side two solutions: A straightforward encoding with PROMELA channels and explicit representation of faulty processes. [Solution 1] An advanced encoding with shared variables and fault injection. [Solution 2]

Igor Konnov 34/63

slide-35
SLIDE 35

Modeling Message Passing

All our case studies are designed with the assumption of classic reliable asynchronous message passing as in [FLP85]: non-blocking communication,

  • perations “receive” and “send” are executed immediately.

if a message can be received now, it may be also received later, a process does not have to receive a message as soon as it is able to. every sent message is eventually received, but there are no bounds on the delays.

Igor Konnov 35/63

slide-36
SLIDE 36

Experiments: Solution 1 vs. Solution 2

States (logscale)

10 100 1000 10000 100000 1e+06 1e+07 1e+08 3 4 5 6 7 8

states (logscale) number of processes, N

Memory (MB, logscale, ≤ 12 GB)

100 1000 10000 3 4 5 6 7 8

memory, MB (logscale) number of processes, N

Solution 1: Channels + explicit Byzantine processes (blue) Solution 2: shared variables + fault injection (red) in the presence of one Byzantine faulty process (f = 1) (case f = 2 runs out of memory too fast)

Igor Konnov 36/63

slide-37
SLIDE 37

Case Studies

We consider a number of threshold-based algorithms. Our running example ST87 for

1 Byzantine faults (BYZ) 2 omission faults (OMIT) 3 symmetric faults (SYMM) 4 clean crashes (CLEAN). 5 Forklore reliable broadcast for clean crashes

[Chandra & Toueg 96, CT96] (to be continued)

Igor Konnov 37/63

slide-38
SLIDE 38

Case Studies (cont.): Larger Algorithms

more involved algorithms in the purely asynchronous setting:

6 Asynchronous Byzantine Agreement (Bracha & Toueg 85, BT85)

Byzantine faults two phases and two message types five status values properties: unforgeability, correctness (liveness), agreement (liveness)

7 Condition-based Consensus (Most´

efaoui et al. 01, MRRR01)

crash faults two phases and four message types nine status variables properties: validity, agreement, termination (liveness)

8 Fast Byzantine Consensus: common case (Martin, Alvisi 06,

MA06)

Byzantine faults the core part of the algorithm no cryptography

Igor Konnov 38/63

slide-39
SLIDE 39

Experimental Results at Glance

Algorithm Fault Parameters Resilience Properties Time

  • 1. ST87

BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec.

  • 1. ST87

BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec.

  • 1. ST87

BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec.

  • 2. ST87

OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec.

  • 2. ST87

OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec.

  • 3. ST87

SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec.

  • 3. ST87

SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec.

  • 4. ST87

CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec.

  • 5. CT96

CRASH n = 2 — U, C, R 1 sec.

  • 6. BT85

BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec.

  • 6. BT85

BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec.

  • 6. BT85

BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec.

  • 7. MRRR01

CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec.

  • 7. MRRR01

CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec.

  • 8. MA06

BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs.

  • 8. MA06

BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min.

  • 8. MA06

BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec.

Igor Konnov 39/63

slide-40
SLIDE 40

Summary

We show how to model threshold-based fault-tolerant algorithms starting with an imprecise description [Spin’13] We create PROMELA models using expert advice. The tool demonstrates that the model behaves as predicted by theory (for fixed parameters) This reference implementation allows us to optimize the encoding ... and to make the model amenable to parameterized verification

Igor Konnov 40/63

slide-41
SLIDE 41

Verifying for all system sizes

Igor Konnov 41/63

slide-42
SLIDE 42

qI q0 q1 q2 q3 sv = V1 ¬(sv = V1) inc nsnt sv := SE q4 q5 q6 q7 q8 qF rcvd := z where (rcvd ≤ z ∧ z ≤ nsnt + f) ¬(t + 1 ≤ rcvd) t + 1 ≤ rcvd sv = V0 ¬(sv = V0) inc nsnt n − t ≤ rcvd ¬(n − t ≤ rcvd) sv := SE sv := AC

concrete values are not important thresholds are essential:

0, 1, t + 1, n − t

Igor Konnov 42/63

slide-43
SLIDE 43

qI q0 q1 q2 q3 sv = V1 ¬(sv = V1) inc nsnt sv := SE q4 q5 q6 q7 q8 qF rcvd := z where (rcvd ≤ z ∧ z ≤ nsnt + f) ¬(t + 1 ≤ rcvd) t + 1 ≤ rcvd sv = V0 ¬(sv = V0) inc nsnt n − t ≤ rcvd ¬(n − t ≤ rcvd) sv := SE sv := AC

concrete values are not important thresholds are essential:

0, 1, t + 1, n − t

intervals with symbolic boundaries:

I0 = [0, 1) I1 = [1, t + 1) It+1 = [t + 1, n − t) In−t = [n − t, ∞)

Igor Konnov 43/63

slide-44
SLIDE 44

qI q0 q1 q2 q3 sv = V1 ¬(sv = V1) inc nsnt sv := SE q4 q5 q6 q7 q8 qF rcvd := z where (rcvd ≤ z ∧ z ≤ nsnt + f) ¬(t + 1 ≤ rcvd) t + 1 ≤ rcvd sv = V0 ¬(sv = V0) inc nsnt n − t ≤ rcvd ¬(n − t ≤ rcvd) sv := SE sv := AC

concrete values are not important thresholds are essential:

0, 1, t + 1, n − t

intervals with symbolic boundaries:

I0 = [0, 1) I1 = [1, t + 1) It+1 = [t + 1, n − t) In−t = [n − t, ∞)

Parameteric Interval Abstraction (PIA) Similar to interval abstraction: [t + 1, n − t) rather than [4, 10). Total order: 0 < 1 < t + 1 < n − t for all parameters satisfying RC: n > 3t, t ≥ f ≥ 0.

Igor Konnov 44/63

slide-45
SLIDE 45

Technical challenges

We have to reduce the verification of an infinite number of instances where

1 the process code is parameterized 2 the number of processes is parameterized

to one finite state model checking instance

Igor Konnov 45/63

slide-46
SLIDE 46

Technical challenges

We have to reduce the verification of an infinite number of instances where

1 the process code is parameterized 2 the number of processes is parameterized

to one finite state model checking instance We do that by:

1 PIA data abstraction 2 PIA counter abstraction

Igor Konnov 46/63

slide-47
SLIDE 47

Technical challenges

We have to reduce the verification of an infinite number of instances where

1 the process code is parameterized 2 the number of processes is parameterized

to one finite state model checking instance We do that by:

1 PIA data abstraction 2 PIA counter abstraction

abstraction is an over approximation ⇒ possible abstract behavior that does not correspond to a concrete behavior.

3 Refining spurious counter-examples

Igor Konnov 47/63

slide-48
SLIDE 48

Abstraction overview

Parameterized family

  • M(p) = P(p) · · · P(p)
  • size(p) processes

: n > 3t, t ≥ f, f ≥ 0}

EXTRACT

Parametric Interval Domain

  • D

PARAMETRIC INTERVAL DATA ABSTRACTION

Uniform parameterized family

  • ˆ

M(p) = ˆ P · · · ˆ P

  • size(p) processes

: n > 3t, t ≥ f, f ≥ 0}

  • P does not depend on p
  • P simulates P(p)

CHANGE REPRESENTATION

Counter representation

PARAMETRIC INTERVAL COUNTER ABSTRACTION

  • ne abstract system A that

simulates for every p the behavior of M(p)

Igor Konnov 48/63

slide-49
SLIDE 49

Abstraction overview

Parameterized family

  • M(p) = P(p) · · · P(p)
  • size(p) processes

: n > 3t, t ≥ f, f ≥ 0}

EXTRACT

Parametric Interval Domain

  • D

PARAMETRIC INTERVAL DATA ABSTRACTION

Uniform parameterized family

  • ˆ

M(p) = ˆ P · · · ˆ P

  • size(p) processes

: n > 3t, t ≥ f, f ≥ 0}

  • P does not depend on p
  • P simulates P(p)

CHANGE REPRESENTATION

Counter representation

PARAMETRIC INTERVAL COUNTER ABSTRACTION

  • ne abstract system A that

simulates for every p the behavior of M(p)

finite-state model check- ing replay the counter-example refine the system

Igor Konnov 49/63

slide-50
SLIDE 50

Data + counter abstraction over parametric intervals

n = 6, t = 1, f = 1 t + 1 = 2, n − t = 5

  • nr. processes (counters)

received received

sent accepted

  • 1
  • 1
  • 2
  • 2
  • 3
  • 3
  • 4
  • 4
  • 5
  • 5
  • 6
  • 6
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • Local state is (sv, rcvd),

where sv ∈ {sent, accepted} and 0 ≤ rcvd ≤ n 3 processes at (sent, received=3) 1 process at (accepted, received=5)

Igor Konnov 50/63

slide-51
SLIDE 51

Data + counter abstraction over parametric intervals

n = 6, t = 1, f = 1 t + 1 = 2, n − t = 5

  • nr. processes (counters)

received received

sent accepted

  • 1
  • 1
  • 2
  • 2
  • 3
  • 3
  • 4
  • 4
  • 5
  • 5
  • 6
  • 6
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • Local state is (sv, rcvd),

where sv ∈ {sent, accepted} and 0 ≤ rcvd ≤ n

Igor Konnov 51/63

slide-52
SLIDE 52

Data + counter abstraction over parametric intervals

n = 6, t = 1, f = 1 t + 1 = 2, n − t = 5

  • nr. processes (counters)

received received

sent accepted

  • 1
  • 1
  • 2
  • 2
  • 3
  • 3
  • 4
  • 4
  • 5
  • 5
  • 6
  • 6
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • Local state is (sv, rcvd),

where sv ∈ {sent, accepted} and 0 ≤ rcvd ≤ n

Igor Konnov 52/63

slide-53
SLIDE 53

Data + counter abstraction over parametric intervals

✘✘✘✘✘✘ ✘ ❳❳❳❳❳❳ ❳

n = 6, ✘✘✘✘✘

✘ ❳❳❳❳❳ ❳

t = 1, ✘✘✘✘✘

✘ ❳❳❳❳❳ ❳

f = 1 n > 3 · t ∧ t ≥ f Parametricintervals: I0 = [0, 1) I1 = [1, t + 1) It+1 = [t + 1, n − t) In−t = [n − t, ∞)

  • nr. processes (counters)

received received

sent accepted

  • I0

I1 It+1 In−t

  • I0

I1 It+1 In−t

  • I0

I1 It+1 In−t A local state is (sv, rcvd), where sv ∈ {sent, accepted} and rcvd ∈ {I0, I1, It+1, In−t}

Igor Konnov 53/63

slide-54
SLIDE 54

Data + counter abstraction over parametric intervals

n > 3 · t ∧ t ≥ f Parametricintervals: I0 = [0, 1) I1 = [1, t + 1) It+1 = [t + 1, n − t) In−t = [n − t, ∞)

  • nr. processes (counters)

received received

sent accepted

  • I0

I1 It+1 In−t

  • I0

I1 It+1 In−t

  • I0

I1 It+1 In−t when all correct processes accepted, all non-zero counters are in this area A local state is (sv, rcvd), where sv ∈ {sent, accepted} and rcvd ∈ {I0, I1, It+1, In−t}

Igor Konnov 54/63

slide-55
SLIDE 55

Concrete vs. parameterized (Byzantine case)

Time to check relay (sec, logscale) Memory to check relay (MB, logscale)

Parameterized model checking performs well (the red line). Experiments for fixed parameters quickly degrade (n = 9 runs out of memory). We found counter-examples for the cases n = 3t and f > t, where the resilience condition is violated.

Igor Konnov 55/63

slide-56
SLIDE 56

Experimental results at a glance

Algorithm Fault Resilience Property Valid? #Refinements Time

ST87 BYZ n > 3t U ✓ 4 sec. ST87 BYZ n > 3t C ✓ 10 32 sec. ST87 BYZ n > 3t R ✓ 10 24 sec. ST87 SYMM n > 2t U ✓ 1 sec. ST87 SYMM n > 2t C ✓ 2 3 sec. ST87 SYMM n > 2t R ✓ 12 16 sec. ST87 OMIT n > 2t U ✓ 1 sec. ST87 OMIT n > 2t C ✓ 5 6 sec. ST87 OMIT n > 2t R ✓ 5 10 sec. ST87 CLEAN n > t U ✓ 2 sec. ST87 CLEAN n > t C ✓ 4 8 sec. ST87 CLEAN n > t R ✓ 13 31 sec. CT96 CLEAN n > t U ✓ 1 sec. CT96 CLEAN n > t A ✓ 1 sec. CT96 CLEAN n > t R ✓ 1 sec. CT96 CLEAN n > t C ✗ 1 sec.

Igor Konnov 56/63

slide-57
SLIDE 57

When resilience condition is wrong...

Algorithm Fault Resilience Property Valid? #Refinements Time

ST87 BYZ n > 3t ∧ f ≤ t+1 U ✗ 9 56 sec. ST87 BYZ n > 3t ∧ f ≤ t+1 C ✗ 11 52 sec. ST87 BYZ n > 3t ∧ f ≤ t+1 R ✗ 10 17 sec. ST87 BYZ n ≥ 3t ∧ f ≤ t U ✓ 5 sec. ST87 BYZ n ≥ 3t ∧ f ≤ t C ✓ 9 32 sec. ST87 BYZ n ≥ 3t ∧ f ≤ t R ✗ 30 78 sec. ST87 SYMM n > 2t ∧ f ≤ t+1 U ✗ 2 sec. ST87 SYMM n > 2t ∧ f ≤ t+1 C ✗ 2 4 sec. ST87 SYMM n > 2t ∧ f ≤ t+1 R ✓ 8 12 sec. ST87 OMIT n ≥ 2t ∧ f ≤ t U ✓ 1 sec. ST87 OMIT n ≥ 2t ∧ f ≤ t C ✗ 2 sec. ST87 OMIT n ≥ 2t ∧ f ≤ t R ✗ 2 sec.

Igor Konnov 57/63

slide-58
SLIDE 58

What’s next?

Igor Konnov 58/63

slide-59
SLIDE 59

Scaling: acceleration and partial orders

partial orders: we need to check computations of bounded length complete SAT-based model checking (safety) [CONCUR’14] sort the transitions between the milestones:

true true x++ x++ x ≥ n − f, y++ y ≥ t t1 t3 t2 t4 t5 t6

accelerate adjacent transitions of the same type:

true x++ x ≥ n − f, y++ y ≥ t

×2 ×2 ×1

t′

1

t′

2

t′

5

t′

6

Igor Konnov 59/63

slide-60
SLIDE 60

Scaling further: partial orders without counter abstraction

encode representative executions in linear integer arithmetics (SMT) [submitted to CAV’15] Now we can verify safety of: Reliable broadcast (FRB, STRB, ABA) Condition-based consensus (CBC) One-step consensus (CF1S, C1CS, BOSCO)

Liveness?

Liveness is whatever prevents an empty system from being correct. Orna Kupferman

Igor Konnov 60/63

slide-61
SLIDE 61

Partial orders and SMT beat counter abstraction

10^0 10^1 10^2 10^3 10^4 10^5

5 10 15 20 25 Number of checked benchmarks Time to verify an instance, sec. (logscale)

smt sat:lingeling sat:minisat bdd fast

Igor Konnov 61/63

slide-62
SLIDE 62

Summary

Standard model checking tools are not tuned to computational models

  • f fault-tolerant distributed algorithms

Computational primitives in FTDAs are simpler than the standard ones Thinking in terms of parameterized systems helps to develop efficient techniques

85 ABA 87 STRB 96 FRB 97 NBAC 01 CBC, C1CS 02 NBACG 06 CF1S,FBC 08 BOSCO

Igor Konnov 62/63

slide-63
SLIDE 63

Igor Konnov 63/63

slide-64
SLIDE 64

PBFT? RAFT? What else?

Igor Konnov 64/63

slide-65
SLIDE 65

PBFT? RAFT? What else?

Thank you!

  • http://forsyte.at/software/bymc
  • Igor Konnov

65/63

slide-66
SLIDE 66

Our current work

Discrete synchronous Discrete partially synchronous Discrete asynchronous Continuous synchronous Continuous partially synchronous One instance/ finite payload Many inst./ finite payload Many inst./ unbounded payload Messages with reals

core of {ST87, BT87, CT96}, CBC, CF1S, C1CS, BOSCO

  • ne-shot broadcast, c.b.consensus

Igor Konnov 66/63

slide-67
SLIDE 67

Future work: threshold guards + orthogonal features

Discrete synchronous Discrete partially synchronous Discrete asynchronous Continuous synchronous Continuous partially synchronous One instance/ finite payload Many inst./ finite payload Many inst./ unbounded payload Messages with reals

core of {ST87, BT87, CT96}, CBC, CF1S, C1CS, BOSCO

  • ne-shot broadcast, c.b.consensus

DHM12 ST87 AK00 CT96 (failure detector) DLS86, MA06, L98 (Paxos) ST87, BT87, CT96, DAs with failure-detectors DLPSW86 DFLPS13 WS07 ST87 (JACM) FSFK06 WS09

clock sync broadcast

  • approx. agreement

Igor Konnov 67/63

slide-68
SLIDE 68

Template in Promela

We implement the following loop

  • n the right.

receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages

atomic

/∗ shared s t a t e : a v a r i a b l e

  • r

a channel ∗/ active proctype[N(n,t,f)] P(){ /∗ l o c a l v a r i a b l e to count messages from d i s t i n c t p r o c e s s e s ∗/ int nrcvd; /∗ i n i t i a l i z a t i o n ∗/ loop: atomic { /∗ 1 . r e c e i v e and count messages 2 . compute using nrcvd 3 . send messages ∗/ } goto loop; }

Igor Konnov 68/63

slide-69
SLIDE 69

Modeling Message Passing

All our case studies are designed with the assumption of classic reliable asynchronous message passing as in (?): non-blocking communication,

  • perations “receive” and “send” are executed immediately.

if a message can be received now, it may be also received later, a process does not have to receive a message as soon as it is able to. every sent message is eventually received, but there are no bounds on the delays.

Igor Konnov 69/63

slide-70
SLIDE 70

Solution 1: Message Passing using Promela channels

A straightforward encoding using message channels:

/∗ message type ∗/ mtype = { ECHO }; /∗ point −to−point channels ∗/ chan p2p[N][N] = [1] of { mtype }; /∗ tag r e c e i v e d messages ∗/ bit rx[N][N];

Sending a message to all processes:

for (i : 1 .. N) { p2p[_pid][i]!ECHO; }

Note: pid denotes the process identifier in PROMELA (we use it solely to encode message passing).

Igor Konnov 70/63

slide-71
SLIDE 71

Solution 1: Message Passing (cont.)

Receiving and counting messages from distinct processes (no faults yet):

/∗ l o c a l ∗/ int nrcvd = 0; /∗ i n i t i a l l y , no messages ∗/ ... i = 0; do /∗ i s t h e r e a message from p r o c e s s i? ∗/ :: (i < N) && nempty(p2p[i][_pid]) -> p2p[i][_pid]?ECHO; /∗ remove i t ∗/ if :: !rx[i][_pid] -> /∗ 1 . the f i r s t time : ∗/ rx[i][_pid] = 1; /∗ a . mark as r e c e i v e d ∗/ nrcvd++; break; /∗ b . i n c r e a s e l o c a l counter ∗/ :: rx[i][_pid]; /∗ 2 . ignore a d u p l i c a t e ∗/ fi; i++; /∗ next p r o c e s s ∗/ :: (i < N) -> i++; /∗ r e c e i v e nothing from i ∗/ :: i == N -> break;

  • d

Igor Konnov 71/63

slide-72
SLIDE 72

Solution 2: Simulating message passing with variables

Keeping the number of send-to-all’s by (correct) processes:

int nsnt; /∗ shared v a r i a b l e ∗/ /∗ number

  • f

send−to−a l l ’ s sent by c o r r e c t p r o c e s s e s ∗/

Sending a message to all:

nsnt++;

Receiving and counting messages from distinct processes (no faults):

if /∗ p i c k a l a r g e r value ≤ nsnt ∗/ :: ((nrcvd + 1) < nsnt) -> nrcvd++; /∗

  • ne more message

∗/ :: skip; /∗

  • r

nothing ∗/ fi;

Reliable communication as a fairness property: F G [∀i.nrcvdi ≥ nsnt]

Igor Konnov 72/63