Computing in a Distributed System in the Presence of Benign Failures - - PowerPoint PPT Presentation

computing in a distributed system in the presence of
SMART_READER_LITE
LIVE PREVIEW

Computing in a Distributed System in the Presence of Benign Failures - - PowerPoint PPT Presentation

Computing in a Distributed System in the Presence of Benign Failures Bernadette CHARRON-BOST, CNRS (joint work with Andr e SCHIPER , EPFL) Distributed System medium of communication computational unit No universal computational model for


slide-1
SLIDE 1

Computing in a Distributed System in the Presence of Benign Failures

Bernadette CHARRON-BOST, CNRS (joint work with Andr´ e SCHIPER , EPFL)

slide-2
SLIDE 2

Distributed System

computational unit

medium of communication

No universal computational model for distributed systems

slide-3
SLIDE 3

Two Basic Principles

  • The model must specify why faults occur

Causes of two different natures:

  • Degree of synchronism
  • Failure model
slide-4
SLIDE 4

Two Basic Principles

  • The model must specify why faults occur

Causes of two different natures:

  • Degree of synchronism
  • Failure model
slide-5
SLIDE 5

Two Basic Principles

  • The model must specify why faults occur
  • The model must specify by whom (culprit) faults occur
slide-6
SLIDE 6

Two Basic Principles

  • The model must specify why faults occur
  • The model must specify by whom faults occur

The notion of faulty component is necessary and useful for the analysis of distributed computations

slide-7
SLIDE 7

First Principle

bounded delays (failure) (synchronous) arbitrary delays (asynchronous) finite delays

. . . breaks the natural continuum from bounded to

infinite delays !

slide-8
SLIDE 8

A classical type of systems Synchronous system + crash failures

slide-9
SLIDE 9

A classical type of systems Synchronous system + crash failures

  • transmission delays bounded
  • process speeds bounded or infinite
slide-10
SLIDE 10

First Principle

  • breaks the natural continuum from bounded to infinite delays
  • synchronism degree and failure model are not independant
slide-11
SLIDE 11

Second Principle

  • may lead to undesirable conclusions

Only one transmission fault from each node Send omission model

  • each process is considered faulty

(no algorithm when the entire system is faulty)

slide-12
SLIDE 12

Second Principle

  • may lead to undesirable conclusions
  • faulty processes are allowed to have deviant behaviors

“Every correct process eventually decides” One transmission failure for a message sent by p to q Send omission model:

  • p is allowed to make no decision

Link failure model:

  • p and q must make a decision

Receive omission model:

  • q is allowed to make no decision
slide-13
SLIDE 13

Second Principle

  • may lead to undesirable conclusions
  • faulty processes are allowed to have deviant behaviors
  • real causes of transmission failures are often unknown
slide-14
SLIDE 14

Second Principle

  • may lead to undesirable conclusions
  • faulty processes are allowed to have deviant behaviors
  • real causes of transmission failures are often unknown
  • no evidence that the notion of faulty component is helpful
slide-15
SLIDE 15

The Heard-Of Model

We just specify transmission faults: we don’t consider anymore by whom nor why faults occur

slide-16
SLIDE 16

HO: a Round-Based Model

sending phase local

p

computation receive phase

round r

(to all)

At each round, every process sends messages to all

  • allows us to distinguish semantic and operational

features of computations

slide-17
SLIDE 17

HO: a Round-Based Model

sending phase local

p

computation receive phase

round r

(to all)

If m is received at round r then m has been sent at round r Rounds are communication-closed layers

slide-18
SLIDE 18

First Principle

bounded delays (failure) (synchronous) arbitrary delays

  • late messages are discarded

[Dwork, Lynch & Stockmeyer, 1988] and [Gafni, 1998]

slide-19
SLIDE 19

HO Process

   Statesp, Initp ⊆ Statesp Sp : ( s, q ) → mq Tp : ( s, µ ) → s′

round r p s s′

At round r, process p receives messages from HO(p, r) supp( µ) = HO(p, r)

slide-20
SLIDE 20

Second Principle

  • Faults are specified but not the culprits

[Santoro & Widmayer 1989]

slide-21
SLIDE 21

HO Algorithm

  • Distributed algorithm on Π

A = (Statesp, Initp, Sp, Tp) p∈Π

  • Run of algorithm A

   (s0

p)p∈Π

with s0

p ∈ Initp

(HO(p, r))p∈Π,r>0

slide-22
SLIDE 22
  • Kernel of round r:

K(r) =

  • p∈Π

HO(p, r)

  • coKernel of round r:

coK(r) = Π \ K(r)

  • Global kernel (of a run):

K =

  • p∈Π,r>0

HO(p, r) =

  • r>0

K(r)

  • Global coKernel (of a run):

coK = Π \ K

slide-23
SLIDE 23

Communication Predicate

Predicate over collections of heard-of sets Pnosplit :: ∀p, q, ∀r : HO(p, r) ∩ HO(q, r) = ∅ Psp unif :: ∀p, q, ∀r : HO(p, r) = HO(q, r)

slide-24
SLIDE 24

Communication Predicate

Predicate over collections of heard-of sets

  • endogenous definition of the system properties

( = Failure Detector model )

slide-25
SLIDE 25

Pf

K ::

|K| ≥ n − f Pf

HO ::

∀p, ∀r : |HO(p, r)| ≥ n − f Preg :: ∀p, q, ∀r : HO(p, r + 1) ⊆ HO(q, r) Punif :: ∃Π0, ∀p, ∀r : HO(p, r) = Π0 P♦unif :: ∃Π0, ∃r0, ∀p, ∀r > r0 : HO(p, r) = Π0

slide-26
SLIDE 26

system type communication predicate Synchronous, reliable links Pf

K

at most f faulty senders Synchronous, reliable links, Pf

K ∧ Preg

at most f crash failures Asynchronous, reliable links, Pf

HO

at most f crash failures Asynchronous, reliable links, Pf

HO ∧ P♦unif

at most f initial crash failures Idem with n > 2f Pf

K ∧ Punif

Asynchronous, reliable links, P1

K

and failure detector S ♦ synchronous, reliable links, at most f crash failures Pf

HO ∧ P♦unif

0-25

slide-27
SLIDE 27

Our Results

  • Shorter and simpler proofs of important computability results
  • Communication predicates for which Consensus is solvable

What is necessary and sufficient to solve Consensus?

  • Interrelationships between communication predicates

(or, how to be not lost in translation ...)

  • Agreement problems: new algorithms for new systems
  • Realistic solutions to cope with transient and

dynamic failures