SLIDE 1
Computing in a Distributed System in the Presence of Benign Failures
Bernadette CHARRON-BOST, CNRS (joint work with Andr´ e SCHIPER , EPFL)
SLIDE 2 Distributed System
computational unit
medium of communication
No universal computational model for distributed systems
SLIDE 3 Two Basic Principles
- The model must specify why faults occur
Causes of two different natures:
- Degree of synchronism
- Failure model
SLIDE 4 Two Basic Principles
- The model must specify why faults occur
Causes of two different natures:
- Degree of synchronism
- Failure model
SLIDE 5 Two Basic Principles
- The model must specify why faults occur
- The model must specify by whom (culprit) faults occur
SLIDE 6 Two Basic Principles
- The model must specify why faults occur
- The model must specify by whom faults occur
The notion of faulty component is necessary and useful for the analysis of distributed computations
SLIDE 7 First Principle
bounded delays (failure) (synchronous) arbitrary delays (asynchronous) finite delays
. . . breaks the natural continuum from bounded to
infinite delays !
SLIDE 8
A classical type of systems Synchronous system + crash failures
SLIDE 9 A classical type of systems Synchronous system + crash failures
- transmission delays bounded
- process speeds bounded or infinite
SLIDE 10 First Principle
- breaks the natural continuum from bounded to infinite delays
- synchronism degree and failure model are not independant
SLIDE 11 Second Principle
- may lead to undesirable conclusions
Only one transmission fault from each node Send omission model
- each process is considered faulty
(no algorithm when the entire system is faulty)
SLIDE 12 Second Principle
- may lead to undesirable conclusions
- faulty processes are allowed to have deviant behaviors
“Every correct process eventually decides” One transmission failure for a message sent by p to q Send omission model:
- p is allowed to make no decision
Link failure model:
- p and q must make a decision
Receive omission model:
- q is allowed to make no decision
SLIDE 13 Second Principle
- may lead to undesirable conclusions
- faulty processes are allowed to have deviant behaviors
- real causes of transmission failures are often unknown
SLIDE 14 Second Principle
- may lead to undesirable conclusions
- faulty processes are allowed to have deviant behaviors
- real causes of transmission failures are often unknown
- no evidence that the notion of faulty component is helpful
SLIDE 15
The Heard-Of Model
We just specify transmission faults: we don’t consider anymore by whom nor why faults occur
SLIDE 16 HO: a Round-Based Model
sending phase local
p
computation receive phase
round r
(to all)
At each round, every process sends messages to all
- allows us to distinguish semantic and operational
features of computations
SLIDE 17
HO: a Round-Based Model
sending phase local
p
computation receive phase
round r
(to all)
If m is received at round r then m has been sent at round r Rounds are communication-closed layers
SLIDE 18 First Principle
bounded delays (failure) (synchronous) arbitrary delays
- late messages are discarded
[Dwork, Lynch & Stockmeyer, 1988] and [Gafni, 1998]
SLIDE 19
HO Process
Statesp, Initp ⊆ Statesp Sp : ( s, q ) → mq Tp : ( s, µ ) → s′
round r p s s′
At round r, process p receives messages from HO(p, r) supp( µ) = HO(p, r)
SLIDE 20 Second Principle
- Faults are specified but not the culprits
[Santoro & Widmayer 1989]
SLIDE 21 HO Algorithm
- Distributed algorithm on Π
A = (Statesp, Initp, Sp, Tp) p∈Π
(s0
p)p∈Π
with s0
p ∈ Initp
(HO(p, r))p∈Π,r>0
SLIDE 22
K(r) =
HO(p, r)
coK(r) = Π \ K(r)
- Global kernel (of a run):
K =
HO(p, r) =
K(r)
- Global coKernel (of a run):
coK = Π \ K
SLIDE 23
Communication Predicate
Predicate over collections of heard-of sets Pnosplit :: ∀p, q, ∀r : HO(p, r) ∩ HO(q, r) = ∅ Psp unif :: ∀p, q, ∀r : HO(p, r) = HO(q, r)
SLIDE 24 Communication Predicate
Predicate over collections of heard-of sets
- endogenous definition of the system properties
( = Failure Detector model )
SLIDE 25
Pf
K ::
|K| ≥ n − f Pf
HO ::
∀p, ∀r : |HO(p, r)| ≥ n − f Preg :: ∀p, q, ∀r : HO(p, r + 1) ⊆ HO(q, r) Punif :: ∃Π0, ∀p, ∀r : HO(p, r) = Π0 P♦unif :: ∃Π0, ∃r0, ∀p, ∀r > r0 : HO(p, r) = Π0
SLIDE 26 system type communication predicate Synchronous, reliable links Pf
K
at most f faulty senders Synchronous, reliable links, Pf
K ∧ Preg
at most f crash failures Asynchronous, reliable links, Pf
HO
at most f crash failures Asynchronous, reliable links, Pf
HO ∧ P♦unif
at most f initial crash failures Idem with n > 2f Pf
K ∧ Punif
Asynchronous, reliable links, P1
K
and failure detector S ♦ synchronous, reliable links, at most f crash failures Pf
HO ∧ P♦unif
0-25
SLIDE 27 Our Results
- Shorter and simpler proofs of important computability results
- Communication predicates for which Consensus is solvable
What is necessary and sufficient to solve Consensus?
- Interrelationships between communication predicates
(or, how to be not lost in translation ...)
- Agreement problems: new algorithms for new systems
- Realistic solutions to cope with transient and
dynamic failures