DistributedAlgorithms computa@onandmessagedelivery Asynchronous - - PDF document

distributed algorithms
SMART_READER_LITE
LIVE PREVIEW

DistributedAlgorithms computa@onandmessagedelivery Asynchronous - - PDF document

3/4/09 Distributed AsynchronousSystems Bertinoro, March 2009 Algorithms Part II 2 Synchronous lockstepsynchrony,rounds DistributedAlgorithms computa@onandmessagedelivery


slide-1
SLIDE 1

3/4/09
 1


Distributed
Algorithms
 PART II
 ASYNCHRONOUS
 SYSTEMS


1
 Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
Systems


  • Synchronous


– lock‐step
synchrony,
rounds


  • computa@on
and
message
delivery

  • Asynchronous


– we
lose
the
no@on
of
“@me”


  • processes
can
take
any
amount
of
@me
in
performing


computa@on
steps


  • channels
can
take
any
amount
of
@me
in
delivering
a


message


  • Difficult
to
deal
with


– many
impossibility
results


2


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
Model


  • I/O
automaton
model


– a
very
general
model
for
describing
asynchronous
 systems


  • Modular
descrip@on


– composi@on
opera@on


  • can
combine
automata
to
create
larger
automaton

  • Invariant
asser@ons
proofs


3


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automata
Model


  • Every
component
of
a
distributed
system
is


described
by
means
of
an
automaton


4


  • I/O


– automaton
has
input
and
output
ac@ons
 – but
also
internal
ac@ons


State
(variables)
 a=1
 msg=“Hello”
 Transi:ons
(ac:ons)
 Send(msg)
 
Precondi@on:
a<5
 
Effect:
a:=a+1


init(v)i decide(v)i send(m)i,j recv(m)j,i

Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Formally,
an
IOA
A
consists
of
5
components


– sig(A),
set
of
ac@ons
(input,
output,
internal)
 – states(A)


  • a
(not
necessarily
finite)
set
of
states


– start(A)


  • a
nonempty
subset
of
states(A)


– trans(A),
a
state‐transi@on
func@on


  • trans(A)
⊆
states(A)
×
acts(sig(A))
×
states(A)

  • ∀
state
s,
input
ac@on
π,
∃
state
s’
s.t.
(s,
π,
s’)
∈


trans(A)


– tasks(A)


  • equivalence
rela@on
on
“locally
controlled”
ac@ons
of
A


5


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


Signature:
 Input: 
 
 
 
 
 
 
 
Output:
 
send(m)i,j m ∈ M receive(m)i,j m ∈
M States:
 queue, a
FIFO
queue
of
elements
of M,
ini@ally
empty. Transi:ons: 

 send(m)i,j

receive(m)i,j Ef

Effect: fect: add
m
to
queue

Pr

Precondition econdition: 
m
is
first
on
queue Ef Effect: fect: 
remove
first
element
of
queue Tasks:
 {receive(m)i,j : m ∈
M}

6


Channeli,j
automaton


slide-2
SLIDE 2

3/4/09
 2


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Execu@on


– s0,
π1,s1,
π2,…,sr,
πr,…


  • (sk,
πk, sk+1)
is
a
transi@on
of
A

  • s0
is
a
start
state
of
A

  • Examples
of
execu@ons
for
Channeli,j

α1
=
[λ],
send(1)i,j,[1],
receive(1)i,j,[λ],
send(2)i,j,[2],
receive(2)i,j,[λ]
 α2
=
[λ],
send(1)i,j,[1],
receive(1)i,j,[λ],
send(2)i,j,[2]
 α3
=
[λ],
send(1)i,j,[1],
send(1)i,j,[11],
send(1)i,j,[111],
…


7


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Opera@ons
on
automata


– Composi@on
 – Hiding


  • Composi@on


– A
complex
automaton
described
by
smaller
automata


  • each
describing
a
simpler
piece
of
the
system


– Smaller
automata
combined
together
 – Ac@ons
with
same
name
executed
together


  • only
one
automaton
has
“control”
of
the
ac@on

  • INPUT
ac@on
of
one
automaton
gets
executed
when
the


OUTPUT
ac@on
with
the
same
name
of
another
automaton
 gets
executed


8


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


Signature:
 Input: 
 
 
 
 
 
 
 
Output:
 
init(v)i v ∈ V decide(v)i, v ∈
V receive(m)j,i m ∈
M, j≠i send(m)i,j m ∈
M, j≠i States:
 val, a
vector
indexed
by
{1,2,…,n} of
elements
in
V
∪
{⊥},
all
ini@ally
⊥ Transi:ons: 

 init(v)i,, v ∈ V

receive(v)J,i,, v ∈ V Ef

Effect: fect: val[i]=v Ef Effect: fect: 
val[j]=v send(v)i,j, v ∈ V

decide(v)i, v ∈ V Pr

Precond econd: 
val[i]=v

Pr

Precond econd: ∀j=1,…,n,
val[j]≠⊥ Ef Effect: fect: 
none 
 
 
 
 
 
 




 
 



v=f(val[1],…,val[n])
 
 
 
 
 
 
 
 
 




Ef Effect: fect: none 
 Tasks:
 ∀j ≠ i :
 {send(v)i,j,: v ∈ V}






{receive(v)i,j,: v ∈ V}

9


Processi
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Now
we
can
compose
Channeli,j and pi
for
every
i,j.


– when
p1
executes
send(m)1,2,
channel
C1,2
executes
send(m)1,2.
 – when
channel
C1,2
executes
receive(m)1,2


p2
executes
 receive(m)1,2



  • Internal
ac@ons


– are
unobservable
externally,
so
they
are
local
and
cannot
be
 involved
in
the
composi@on


  • to
compose
two
automata,
the
two
sets
of
internal
ac@ons
must
be


disjoint


  • Output
ac@ons


– also
disjoint


  • otherwise
two
automata
“control”
the
same
ac@on;
an
output
ac@on


can
correspond
to
one
(or
more)
input
ac@ons
is
other
automata.


  • Input
ac@ons


– can
be
executed
at
any
@me


10


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


11


p1
 p2
 p3
 C

1,2


C

2,1


C2,3
 C3,2
 C

1,3


C

3,1


send(m)2,3
 send(m)1,3
 send(m)1,2
 send(m)2,1
 send(m)3,1
 send(m)3,2
 receive(m)2,1
 receive(m)1,2
 receive(m)2,3
 receive(m)3,2
 receive(m)3,1
 receive(m)1,3


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


12


p1
 p2
 p3
 C1,2
 C2,1
 C2,3
 C3,2
 C1,3
 C3,1


init(v)1
 decide(v)1
 init(v)2
 decide(v)2
 init(v)3
 decide(v)3


  • Hiding


– we
are
“hiding”
the
“output”
ac@ons
send
and
receive
 – those
ac@ons
are
reclassified
as
internal
to
A


slide-3
SLIDE 3

3/4/09
 3


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Fairness


– in
a
distributed
systems
all
“components”
should
get
 fair
turns
to
execute
steps


  • Tasks


– par@@on
of
locally
controlled
ac@ons
 – each
equivalence
class
represent
a
“task”
that
the
 automaton
is
supposed
to
perform


  • An
execu@on
α
is
fair
if
for
each
class
C
of


tasks(A):


– If
α
is
finite,
then
C
is
not
enabled
in
the
final
state
 – If
α
is
infinite,
then
α
contains
either
infinitely
many
 events
from
C
or
infinitely
many
occurrences
of
states
 in
which
C
is
not
enabled.


13


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Example,
a
clock
with
two
tasks


– in
a
fair
execu@on
the
clock
@cks
forever
and
responds
 to
the
requests


14


Signature:
 Input: 
 
 
 
 
 
Internal:
 
 
 
Output:
 
request Bck clock(t), t ∈
Nat States:
 counter ∈
Nat, ini@ally
0 
 
 
 
flag, a
Boolean, ini@ally false Transi:ons: 

 inp request

int
Bck

  • ut
clock(t)

Ef

Effect: fect: flag := true

Pr

Pre: e: 
true Pr Pre: e: 
counter=t Ef Effect: fect: 
counter++ 
 
 
flag=true Ef Effect: fect: 
flag:=false Tasks:
 
{Bck}






{clock(t): t ∈ Nat} Clock
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Invariant
asser@ons


– a
property
true
in
all
reachable
states
 – usually
proved
by
induc@on
on
the
number
of
 steps
in
the
execu@on


  • Proper@es


– Safety
proper@es


  • something
“bad”
does
not
happen


– Liveness
proper@es


  • something
will
eventually
happen


15


Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Simula@on
proofs


– Used
to
prove
that
one
automaton
(usually
lower‐level)
 “implements”
another
automaton
(higher‐level)


  • A
simula@on
rela@on


– one‐way
rela@onship
f
between
the
two
automata


16


s 
 
 s’ π 
 
 
 
 
 
 
 
 u u’ f(s) f(s’) π

Bertinoro, March 2009

Distributed Algorithms

Part II

I/O
Automaton
Model


  • Complexity
measures


– asynchronous
system:
no
@me
bounds
 – We
can
assume
that
a
given
par@cular
task
is
 executed
within
some
@me
bound


  • a
process
performs
an
enabled
step
with
d
@me

  • a
channel
delivers
a
message
within
l

@me

  • Time
bounds
not
known
to
the
processes


– Used
only
for
analysis


  • Messages


– as
before


17


Bertinoro, March 2009

Distributed Algorithms

Part II

Reliable
FIFO
channel


18


Signature:
 Input: 
 
 
 
 
 
 
 
Output:
 





send(m)i,j m ∈ M receive(m)i,j m ∈
M States:
 queue, a
FIFO
queue
of
elements
of M,
ini@ally
empty. Transi:ons: 

 send(m)i,j

receive(m)i,j Ef

Effect: fect: add
m
to
queue

Pr

Precondition econdition: 
m
is
first
on
queue Ef Effect: fect: 
remove
first
element
of
queue Tasks:
 {receive(m)i,j : m ∈
M} Channeli,j
automaton


slide-4
SLIDE 4

3/4/09
 4


Bertinoro, March 2009

Distributed Algorithms

Part II

Reliable
reordering
channel


19


Signature:
 Input: 
 
 
 
 
 
 
 
Output:
 






send(m)i,j m ∈ M receive(m)i,j m ∈
M States:
 queue, a
(mul@)set
of
elements
of M,
ini@ally
empty. Transi:ons: 

 send(m)i,j

receive(m)i,j Ef

Effect: fect: add
m
to
queue

Pr

Precondition econdition: 
m
is
in
queue Ef Effect: fect: 
remove
m
from
queue Tasks:
 {receive(m)i,j : m ∈
M} Channeli,j
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

A
lossy
reordering
channel


20


Signature:
 Input: 
 
 
 
 
 
 
 
Output:
 
send(m)i,j m ∈ M receive(m)i,j m ∈
M States:
 queue, a
(mul@)set
of
elements
of M,
ini@ally
empty. Transi:ons: 

 send(m)i,j

receive(m)i,j Ef

Effect: fect: add
any
finite
number
of

Pr

Precondition econdition: 
m
is
in
queue copies
of m to queue Ef Effect: fect: 
remove
m
from
queue Tasks:
 {receive(m)i,j : m ∈
M} Channeli,j
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

A
broadcast
channel
B


21


Signature:
 Input: 
 
 
 
 
 
 
 
Output:
 
bcast(m)i m ∈ M receive(m)i,j m ∈
M States:
 for
every i,j, queue(i,j),
a
FIFO
queue
of
elements
of
M,
ini@ally
empty Transi:ons: 

 bcast(m)i,

receive(m)i,j Ef

Effect: fect: for
all
j
add
m
to
queue(i,j)


Pr

Precondition econdition: 
m
is
first
on
queue(i,j) Ef Effect: fect: 
remove
1st
element
of queue(i,j) Tasks:
 for
every i,j, {receive(m)i,j : m ∈
M} BCAST
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
algorithms


LEADER ELECTION IN ASYNCHRONOUS SYSTEMS


22


Bertinoro, March 2009

Distributed Algorithms

Part II

LCR
algorithm


  • The
same
LCR
algorithm
we
have
seen
for


synchronous
rings
works
for
asynchronous


– Each
process
sends
its
iden@fier
 – When
a
process
receives
an
iden@fier


  • compares
it
to
its
own

  • if
incoming
iden@fier
is
greater,
it
keeps
passing
the


iden@fier,



  • if
incoming
iden@fier
is
smaller,
it
discards
the
iden@fier

  • if
it
is
equal,
it
outputs
“leader”

  • Differences


– each
process
has
to
maintain
a
buffer
for
incoming
 messages


23


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
LCR


24


Signature: 
 
 
 
 
 
 
Output
 Input: 
 
 
 
 
 
 
 



leaderi,
 
receive(v)i‐1,i v ∈ UID send(v)i,i+1 v ∈
UID States:
 


u,
a
UID,
ini@ally
i’s
UID send,
a
FIFO
queue
of
elements
of
UIDs,
ini@ally
containg
only
i’s
UID
 


status
∈
{unknown,
chosen,
reported},
ini@ally
unknown
 Transi:ons: 



  • ut send(v)i,i+1,

inp receive(v)i‐1,i Pr

Pre: e: v
is
first
on
send


Ef

Effect fect: 
case(v)
 Ef Effect: fect:
remove
first
element
of
send v>u:
add
v
to
send
 
 
 
 
 
 
 
 
 
 

 

v=u: status = chosen

  • ut leaderi

v<u:
do
nothing

Pr

Pre: e: status = chosen endcase
 Ef Effect: fect:
status
:=
reported 
 
 
 
 
 
 
 
 
 Tasks:
 {leaderi},
{send(v)i,i+1 : v ∈
UID} AsynchLCRi
automaton


slide-5
SLIDE 5

3/4/09
 5


Bertinoro, March 2009

Distributed Algorithms

Part II

Complete
system
(reliable
channels)


25


Channel1,2


send(v)1,2
 receive(v)1,2
 p1
 p2


Channel2,3


p3


C h a n n e l3,4
 C h a n n e ln,1


pn
 send(v)n,1
 send(v)2,3
 send(v)3,4
 receive(v)2,3
 receive(v)n,1


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
LCR


Proof:
By
induc@on
on
the
length
of
the
execu@on.

 Basis:

In
the
start
state
queuej,j+1 is
empty and
sendj={uj}.
We
 need
to
worry
only
if
i=j,
but
this
cannot
be
since
j
∈
[imax,i).
 Induc@ve
step.
Assume
true
for
a
reachable
state
s
of
the
 execu@on.
Need
to
prove
that
it
is
true
for
s’,
for
any
step
 (s,π,s’).
Proceed
by
case
analysis:
 π=leaderk:
nothing
is
added
to
sendk and
queuek,k+1.


 π=send(v)k‐1,k:
an
element
is
taken
out
of
sendk‐1.
The
first
 statement
cannot
be
made
false. The
same
value
(v)
is
added
to
 queuek‐1,k. But
since
the
statement
is
true
in
s, we
have
ui
∉
 s.sendk. Hence
also
the
second
statement
remains
true
in
s’.


26


Invariant
1:
In
any
reachable
state:
 


1.
If
i
≠
imax
and
j
∈
[imax,i)
then
ui
∉
sendj 


2.
If
i
≠
imax
and
j
∈
[imax,i)
then
ui
∉
queuej,j+1

Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
LCR


Proof
(ctnd):
 π=receive(v)k‐1,k:
v
is
(might
be)
added
to
sendk.
If
v≠ui,
 the
statements
cannot
be
made
false.
So
consider
the
 case
v=ui.
We
need
to
worry
only
when
imax≤
k <i.

 Consider
first
imax=k.
By
the
code
of
process
k=imax,
since
 i≠imax, ui
gets
discarded.
 Consider
now
the
case
imax<
k <
i.
We
claim
that
this
is
not
 possible.
Indeed
since
the
invariant
is
true
in
state
s,
we
 have
that
ui
∉
s.queuek‐1,k. But
by
the
precondi@on
of
π
we
 have
that
ui
∈
s.queuek‐1,k.
Contradic@on.
 ☐ Invariant
2:
In
any
reachable
state,
if
i
≠
imax
then
 statusi=unknown.


27


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
LCR
 Lemma
3:
In
any
fair
execu@on
process

imax eventually
performs
a
leader
output.
 Hence
AsynchLCR
solves
leader
elec@on.


  • Messages:



– O(n2)
as
in
the
synchronous
case


  • Time?


– Naive
analysis:
O(n2(d
+
l
))
 – It
can
be
proven
that
the
@me
is
O(n(d
+
l
))


28


Bertinoro, March 2009

Distributed Algorithms

Part II

Other
algorithms


  • HS
synchronous
algorithm


– can
be
easily
adapted
to
the
asynchronous
case
 – uses
bidirec@onal
communica@on
 – O(n log
n)
messages
 – Exercise:
determine
an
upper
bound
for
@me


  • The
Peterson
algorithm


– uses
only
unidirec@onal
communica@on
 – O(n
log
n)
messages
 – O(n(d
+
l
))
@me


29


Bertinoro, March 2009

Distributed Algorithms

Part II

Peterson’s
algorithm


  • Informal


– each
process
can
be
in
one
of
two
modes


  • “ac@ve”
or
“relay”


– Ac@ve
process
carries
out
the
“real
work”


  • relay
processes
just
pass
messages
along


– Computa@on
proceeds
in
(logical)
phases
 – In
each
phase
the
number
of
ac@ve
processes
is
 reduced
by
a
factor
of
at
least
2


  • so,
there
are
at
most
log
n
phases


– Ini@ally
each
process
is
ac@ve


30


slide-6
SLIDE 6

3/4/09
 6


Bertinoro, March 2009

Distributed Algorithms

Part II

Peterson’s
algorithm


  • Informal
(cnt’d)


– In
each
phase


  • each
ac@ve
process
sends
its
UID
to
the
“next”
and
to


the
“next‐to‐next”
process


  • Process
i
receives
UIDi‐1
and
UIDi‐2
  • It
asempts
to
adopt
UIDi‐1
as
new
temporary
UID


– only
if
UIDi‐1 >
UIDi and UIDi‐1
>
UIDi‐2

  • if
this
is
not
possible,
then
i
goes
in
“relay”
mode
for


next
phase


– Leader
condi@on


  • when
a
process
receive
its
same
(temporary)
UID
from


its
“previous”
ac@ve
process,
it
means
that
it
is
the
only
 ac@ve
process
leu
in
the
ring


  • becomes
leader


31


Bertinoro, March 2009

Distributed Algorithms

Part II

Peterson’s
algorithm


  • Complexity
analysis

  • How
many
messages?


– we
have
at
most
log
n
phases
 – in
each
phase
we
have
that
at
most
2n
messages
 are
sent


  • this
accounts
both
for
the
ini@al
2
messages
sent
by
the


ac@ve
processes
and
for
the
subsequent
relays


– O(n
log
n)
messages


  • What
bout
@me?


– O(n(d
+
l
))
@me


32


Bertinoro, March 2009

Distributed Algorithms

Part II

Peterson’s
algorithm


  • Exercise:
sketch
a
proof
that
the
Peterson’s


algorithm
solves
the
leader
elec@on
problem


  • Exercise:
sketch
a
proof
for
the
O(n(d
+
l
))


@me
bound


  • Exercise:
Answer
the
following
ques@on


providing
also
a
par@cular
input
(or
inputs)
 that
show
that
your
answer
is
correct:


– in
the
Peterson
algorithms
which
is
the
UID
that
 gets
elected
leader?
(max,
min,
arbitrary)


33


Bertinoro, March 2009

Distributed Algorithms

Part II

Leader
in
general
graph


  • We
can
use
an
algorithm
similar
to
FloodSet


– we
“simulate”
rounds
by
asaching
“round
numbers”
 to
each
message


  • processes
wait
to
receive
all
round
r
messages
before


proceeding
to
round
r+1


– round
numbers
are
necessary
because
the
algorithm
 relays
on
the
knowledge
of
diam – we
need
to
execute
at
least
diam
rounds.



  • Remark


– in
the
synch
case
there
is
an
easy
op@miza@on


  • send
a
message
only
if
there
is
new
informa@on
(that
is
a


new
max‐uid)


– in
the
asynchronous
case
we
cannot
do
this


  • need
messages
to
make
“rounds”
progress


34


Bertinoro, March 2009

Distributed Algorithms

Part II

Leader
in
general
graph


  • Remark
(cntd)


– we
actually
can
use
the
op@miza@on


  • processes
will
eventually
know
the
leader


– but
will
not
know
that
they
know!!!!


  • might
be
useful
if
we
have
a
different
way
to
detect


termina@on


  • Other
approaches
for
general
graphs


– use
broadcast
and
convergecast
 – Use
a
synchronizer
to
simulate
a
synch
algorithm
 – using
a
global
snapshot
to
detect
termina@on
of
 an
asynchronous
algorithm


35


Bertinoro, March 2009

Distributed Algorithms

Part II

Recap
Leader
Elec@on


Name
 Graph
 Synchrony
 Messages
 Time
 LCR
 Ring
‐
unidirec@onal
 SYNCH
 O(n2)
 O(n)
 HS
 Ring
‐
bidirec@onal
 SYNCH
 O(n
log
n)
 O(n)
 lower
bound
for
 comparison‐based
 Ring
 SYNCH
 Ω(n
log
n)
 TimeSlice
 Ring
‐
unidirec@onal
 SYNCH
 O(n)
 O(n×umin)
 Peterson
 Ring
‐
unidirec@onal
 ASYNCH
 O(n
log
n)
 O(n(d
+
l
))
 Leader
with
BSF
 General
–
bidirect.
 SYNCH
 O(|E|)

 O(diam)
 Leader
with
BSF
 General
–
unidirect.
 SYNCH
 O(diam×|E|)
 O(diam2)


36


slide-7
SLIDE 7

3/4/09
 7


Bertinoro, March 2009

Distributed Algorithms

Part II

Other
basic
algorithms


37


Spanning trees, BFS, Shortest paths, MST


Bertinoro, March 2009

Distributed Algorithms

Part II

Spanning
trees


  • Assume
network
graph
bidirec@onal

  • Spanning
tree
for
the
network
graph


– rooted
at
some
node
i0

  • Each
node
has
to
report
with
a
parent
ac@on
the


name
of
its
parent
in
the
spanning
tree


  • SynchBFS
algorithm


– construct
a
BFS
spanning
tree
rooted
at
i0
 – each
process
selects
as
parent
the
first
neighbor
from
 which
it
hears
 – can
be
used
also
in
asynch
se~ng,
although
the
tree
 constructed
is
not
necessarily
breadth‐first


38


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
Spanning
Tree


39


Signature: 
 
 
 
 
 

 Input: 
 
 
 
 
 
 
 
Output
 




receive(“search”)i,j , j ∈ nbrsi send(“search”)i,j , j ∈
nbrs parent(j)i, j ∈
nbrs States:
 


parent
∈ nbrs ∪
{⊥},
ini@ally
⊥ reported, a
Boolean,
ini@ally
false
 


for
every j ∈ nbrs
 
send(j)
∈
{“search”,
⊥},
ini@ally
“search”,
if
i=i0,
⊥
otherwise Transi:ons: 



  • ut send(“search”)i,j

inp receive(“search”)j,i Pr

Pre: e: send(j) = “search” 


Ef

Effect fect: 
if
i≠i0
and
parent
=
⊥
then Ef Effect: fect:
send(j) = ⊥ 
 
 
 parent := j
 
 
 
 
 
 
 
 
 
 

 



for
all k ∈ nbrs
–
{j}

  • ut parent(j)i,






send(k) := “search”

Pr

Pre: e: parent = j and reported = false Ef Effect: fect:
reported := true Tasks:
 for
every j ∈ nbrs
, {send(m)i,j : m ∈
Nat}
 AsynchSpanningTreei
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

Spanning
tree
 Invariant:
In
any
reachable
state
the
edges
 defined
by
all
the
parent
variables
form
a
 spanning
tree
of
a
subgraph
of
G,
containing
i0;
 moreover
if
there
is
a
message
in
any
Channel
 Ci,j,
then
i
is
in
the
spanning
tree


  • The
above
invariant
is
about
safety:
we
always


have
a
spanning
tree.


  • Exercise:
write
an
invariant
about
liveness:


each
node
eventually
gets
included
in
the
 spanning
tree.


40


Bertinoro, March 2009

Distributed Algorithms

Part II

Spanning
tree


  • Complexity
analysis

  • Messages


– O(|E|)
messages


  • Time


– parent
ac@ons
performed
within
@me
 diam
×
(l
+d)+
l



  • Remark:
message
on
path
longer
than
diam


might
travel
faster
since
the
system
is
 asynchronous


– this
is
why
spanning
tree
might
not
be
breadth‐first


41


Bertinoro, March 2009

Distributed Algorithms

Part II

Spanning
tree


  • Broadcast


– as
for
SynchBFS
it
is
easy
to
augment
 AsynchSpanningTree
to
broadcast
a
message
m

  • piggyback
m
on
“search”
messages

  • Child
pointers


– Easy
since
communica@on
is
bidirec@onal
 – Each
node
responds
to
a
“search”
message
either
with
 “parent”
or
with
“non‐parent”
 – Exercise:
write
IO
code
for
handling
such
messages


  • Child
pointers
can
be
used
for
convergecast


42


slide-8
SLIDE 8

3/4/09
 8


Bertinoro, March 2009

Distributed Algorithms

Part II

Spanning
tree


  • Applica@on
to
leader
elec@on

  • We
can
use
asynchronous
broadcast
and


convergecast
to
elect
a
leader


– general
graph
 – no
dis@nguished
source
node
 – no
knowledge
of
n – no
knowledge
of
the
diameter


  • Exercise:
design
such
an
algorithm
and
write


IOA
code.


43


Bertinoro, March 2009

Distributed Algorithms

Part II

BFS


  • AsynchSpanningTree


– asynch
version
of
SycnhBFS,
but
does
not
produce
 a
breadth‐first
tree


  • Breadth‐first
tree


– we
need
a
different
strategy
since
messages
can
 travel
at
different
speed
 – send
messages
with
explicit
“distance”
from
 source
node


44


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
BFS


45


Signature: 
 
 
 
 
 

 Input: 
 
 
 
 
 
 
 
Output
 




receive(m)i,j m ∈ Nat send(m)i,j m ∈
Nat, j ∈
nbrs States:
 


dist
∈ Nat ∪
{∞},
ini@ally
0
if
i=i0,
∞
otherwise parent ∈ nbrs
∪
{⊥}
,
ini@ally
⊥
 


for
every j ∈ nbrs
 
send(j),
a
FIFO
of
Nats,
ini@ally
{0},
if
i=i0,
empty
otherwise Transi:ons: 



  • ut send(m)i,j

inp receive(m)j,i Pr

Pre: e: m
is
first
on
send(j)


Ef

Effect fect: 
if
m+1
<
dist Ef Effect: fect:
remove
first
element
of
send(j) dist := m+1
 
 
 
 
 
 
 
 
 
 

 
 


parent
:=
j for
all k ∈ nbrs
–
{j}
 
 










add
dist
to
send(k)
 Tasks:
 for
every j ∈ nbrs
, {send(m)i,j : m ∈
Nat}
 AsynchBFSi
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
BFS


Theorem:
In
any
fair
execu@on
of
the
AsynchBFS
 algorithm,
the
system
eventually
stabilizes
to
a
state
 in
which
the
parent
variables
represent
a
breadth‐ first
spanning
tree.


  • Complexity


– O(n|E|)
messages


  • Exercise:
can
be
improved
to
O(diam×|E|),
if
diameter


known


– O(diam
×
n(d
+
l
))
@me


  • Termina@on


– actually
the
algorithm
doesn’t
terminate
 – Need
a
version
with
Acks
(Exercise)


46


Bertinoro, March 2009

Distributed Algorithms

Part II

Other
BFS
algorithms


Algorithm
 Messages
 Time
 AsynchBFS
 O(diam
×|E|)
 O(diam
×
d)
 LayeredBFS
 O(|E|+n ×
diam)
 O(diam2
×
d)
 HybrydBFS,
1
≤
m
≤
diam O(m|E|+
n
×
diam/m)
 O(d×diam2/m)


47


  • Exercise


– Modify
AsynchBFS
to
get
AsynchBellmanFord
to
 compute
Shortest
paths
from
source
node


  • you
now
have
weights
on
edges


Bertinoro, March 2009

Distributed Algorithms

Part II

MST


  • Minimum
Spanning
Tree


– G=(V,E),
connected
and
undirected
 – weigth(i,j),
known
to
i
and
j – Want
a
spanning
tree
of
minimum
weight


  • Unique
weights


– will
see
how
to
easily
extend
to
general
case


  • GHS
algorithm


– Gallager,
Humblet,
Spira


48


slide-9
SLIDE 9

3/4/09
 9


Bertinoro, March 2009

Distributed Algorithms

Part II

MST


Lemma:
Let
G=(V,E)
be
a
weighted
undirected
 graph,
and
let
{(Vi,Ei
:
1≤i≤k)}
be
any
spanning
forest
 for
G,
where
k>1.
Fix
any
i,
1≤i≤k.
Let
e
be
an
edge


  • f
smallest
weight
in
the
set


{e’
:
e’
has
exactly
one
endpoint
in
Vi}
 Then
there
is
a
spanning
tree
for
G
that
includes
 ∪jEj
and
e,
and
this
tree
is
of
minimum
weigh
 among
all
spanning
tree
that
include
∪jEj.

49


MST:
Start
with
spanning
forest
of
singleton.
 Repeatedly
chose
any
min
outgoing
edge
(i,j)
 and
combine
the
two
trees
Ti
and
Tj
into
one
 tree.
Stop
when
forest
has
only
one
tree.


Bertinoro, March 2009

Distributed Algorithms

Part II

MST


  • Can
we
use
the
same
strategy
in
a
distributed


se~ng?


  • Yes
and
no


– Yes
in
the
sense
that
the
lemma
is
true
(obviously)
 also
in
a
distributed
se~ng
 – No
in
the
sense
that
in
a
distributed
se~ng
we
 have
to
be
careful
about
how
we
determine
the
 minimum
outgoing
edge


  • several
components
could
determine
min
outgoing


edge
concurrently
crea@ng
a
cycle


50


Bertinoro, March 2009

Distributed Algorithms

Part II

GHS
synchronous


  • GHS:
a
“synchronous”
version


– builds
components
in
“levels”.
For
each
k,
level
k
 components
form
a
spanning
forest;
each
has
at
least
 2k
nodes
and
a
dis@nguished
leader
node.
Each
level
is
 “computed”
in
a
fixed,
O(n),
number
of
rounds.
 – Level
0,
singleton.
 – Assume
level
k
computed.
To
get
level
k+1,
each
 leader
conducts
a
search
in
its
tree
for
the
min


  • utgoing
edge
with
a
broadcast/convergecast
strategy.


– When
all
leaders
have
computed
their
min
outgoing
 edges,
merge
trees
on
min
outgoing
edges
and
elect


  • ne
leader;
the
ID
of
the
new
leader
is
broadcast
into


the
new
component.
Keep
merging
un@l
all
level
k
 components
have
merged
into
a
level
k+1
component
 – When
no
min
outgoing
edge
can
be
found,
the
 algorithms
stops
because
a
MST
has
been
found


51


Bertinoro, March 2009

Distributed Algorithms

Part II

GHS
synchronous


  • Time
O(nlog
n)


– Each
“level”,
takes
O(n)
rounds
 – in
each
level
components
have
size
2k hence
we
have
 at
most
log
n
levels


  • Messages
O((n+|E|)log
n)


– at
each
level
at
most



  • O(n)
messages
in
total
are
sent
along
all
tree
edges

  • addi@onal
O(|E|)
msgs
are
required
to
find
the
min
outgoing


edge


– Can
be
reduced
to
O(n
log
n
+
|E|)


  • Non‐unique
edge
weights


– use
iden@fier
to
break
“@es”
 – (w,i,j)
<
(w,i’,j’)
if
i<i’
or
(i=i’
and
j<j’)


52


Bertinoro, March 2009

Distributed Algorithms

Part II

GHS
asynchronous


  • Asynchrony
poses
several
difficul@es


– process
can
be
at
different
“levels”
 – some
components
may
get
very
large
and
some
other
 be
very
small


  • not
good
for
efficiency


– Message
complexity
is
based
on
the
fact
that
levels
 are
kept
synchronized


  • We
can
“tag”
messages
with
the
level
they


belong
to


  • A
level
k+1
component
is
formed
by
“merging”


two
level
k
components


– but
a
higher
level
component
can
also
“absorb”
a
 lower‐level
component


53


Bertinoro, March 2009

Distributed Algorithms

Part II

GHS
asynchronous


  • Messages
used


– Ini:ate.
Such
a
message
is
broadcast
by
the
leader
so
 that
all
nodes
of
the
component
collaborate
in
finding
 the
min
outgoing
edge
 – Report.
Convergecast
info
about
min
outgoing
edge
to
 the
leader
 – Test.
A
process
i
sends
such
a
message
to
process
j
to
 check
whether
i
and
j
are
in
the
same
component
 – Accept
and
reject.
Responses
to
test
messages
 – Changeroot.
The
leader
of
a
component
that
wants
to
 perform
a
merge
sends
this
messages
to
ask
for
a
 merge
(with
the
other
tree)
 – Connect.
Is
used
to
perform
the
merge
opera@on
(or
 also
an
absorb
if
the
other
component
is
lower‐level)


54


slide-10
SLIDE 10

3/4/09
 10


Bertinoro, March 2009

Distributed Algorithms

Part II

GHS


Theorem:
The
GHS
algorithm
solves
the
MST
 problem
in
an
arbitrary
undirected
weighted
graph.


  • Complexity


– O(n
log
n
+
|E|)
messages



  • as
improved
synch
version


– O(n
log
n
(d
+
l ))
@me


  • Communica@on
complexity
must
be
Ω(n
log
n)


– from
an
MST
algorithm
can
easily
get
a
leader
elec@on
 algorithm;
we
know
that
for
leader
elec@on
we
need
 Ω(n
log
n)
messages


55


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
consensus


56


ASYNCHRONOUS CONSENSUS WITH PROCESS FAILURES


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
consensus


  • Agreement:
In
any
execu@on,
all
decision
values


are
iden@cal


  • Validity:
In
any
execu@on,
if
all
ini@al
values
are


equal
to
some
v,
then
v
is
the
only
possible
 decision
value


  • f‐failure
termina:on:
in
any
execu@on
in
which
at


most
f
processes
stop,
all
non‐faulty
process
 decide


– failure‐free,
case
f=0:
All
processes
decide
 – wait‐free,
case
f=n:
Any
non‐faulty
process
decide


57


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
consensus


Theorem:
There
is
no
algorithm
to
solve
consensus
 in
the
asynchronous
se~ng
if
even
just
one
process
 can
fail.


  • known
as
the
FLP
impossibility
result


– Fischer,
Lynch,
Paterson
[1984]


  • We
will
prove
the
theorem
using
the
exact
same


model
of
the
original
paper


– that
is,
we
will
not
use
IO
automata


  • Assume
V={0,1}


58


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


  • System
of
n
processes


– p1,p2,…,pn – each
with
a
1‐bit
input


  • containing
the
input
and
cannot
be
changed


– and
1‐bit
output


  • which
can
be
wrisen
only
once

  • Communica@on


– sending
and
receiving
messages
 – channels
modeled
with
a
global
(reliable)
buffer


  • a
send
ac@on
places
a
message
in
the
buffer

  • a
receive
ac@on
takes
an
arbitrary
massage
from
the
buffer


and
delivers
it
to
its
recipient
(dele@ng
it
from
the
buffer).


59


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


  • The
communica@on
buffer
is
a
mul@set
of


pairs
(p,m):


– Buffer
=
{(p,m)
:
m
is
a
message
for
p}


  • The
system
has
two
abstract
opera@ons


– send(p,m)



  • denotes
the
sending
of
message
m
to
process
p.
Places


(p,m)
in
the
buffer.

– receive(p)


  • deletes
some
message
(m,p)
from
the
buffer
and


delivers
m
to
p OR
returns
a
special
null
marker
∅


60


slide-11
SLIDE 11

3/4/09
 11


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


  • ConfiguraBon


– state
of
each
process
+
content
of
buffer


  • IniBal configuraBon

– buffer
is
empty
 – input
bit
is
set


  • Step

– brings
the
system
from
a
configura@on
to
another
 – is
performed
by
a
single
process
p

  • the
process
performs
a
receive(p)
to
obtain
a
message
m ∈

M ∪
{∅}


  • then
p
execute
some
local
code
changing
its
internal
local


state
and
possibly
sending
a
finite
number
of
messages.


61


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


  • event

– is
the
pair
e=(p,m),
which
represents
the
receipt
of
 m
by
p

  • A
schedule σ

is
a
finite
or
infinite
sequence
of


events
star@ng
from
a
configura@on
C

– σ(C)
is
the
resul@ng
configura@on


  • An
execu@on
is
a
schedule
star@ng
from
an


ini@al
configura@on


62


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result
 Lemma
1:
Consider
a
configura@on
c
and
the
 schedules
σ1
and
σ2
applicable
to
c.
Let
c1=σ1(c)
 and

c2=σ2(c).
If
the
set
of
processes
taking
steps
 in
σ1
and
σ2
are
disjoint,
then
σ2(c1)=σ1(c2).


63


c c1 c2 c’

σ1
 σ1
 σ2
 σ2


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


  • 0‐valent
configura@on


– a
configura@on
that
leads
only
to
final
decision
0


  • 1‐valent
configura@on


– a
configura@on
that
leads
only
to
final
decision
1


  • univalent
configura@on


– a
configura@on
that
is
either
0‐valent
or
1‐valent


  • bivalent
configura@on


– a
configura@on
from
which
we
can
reach
either
 the
decision
0
or
the
decision
1


64


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


  • Assume
now
that
there
exists
an
algorithm
A


that
can
solve
consensus


  • We
will
show
that
we
can
construct
an


execu@on
in
which
A
never
reaches
a
decision


  • We
show


– that
there
exists
an
ini@al
bivalent
configura@on
 – it
is
possible
to
keep
taking
execu@ons
steps
 without
reaching
a
univalent
configura@on


65


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


Lemma
2:
There
is
a
bivalent
ini@al
configura@on
for
A.
 Proof:
By
contradic@on,
assume
all
ini@al
configura@on
 are
univalent.

 Let
c0
be
the
ini@al
configura@on
with
all
ini@al
values=0.

 
By
validity
c0
is
0‐valent.

 For
each
i,
i=1,2,…,n,
consider
the
ini@al
configura@on
ci,
 which
differs
from
ci‐1
only
for
the
input
of
pi:
 
The
input
of
pi
which
is
0
in
ci‐1
and
1
in
ci.
 Hence
cn
is
the
ini@al
configura@on
with
all
ini@al
 values=1.

 
By
validity
cn
is
1‐valent.
 There
must
be
an
i such
that
ci is
0‐valent
and
ci+1 is
1‐ valent.

66


slide-12
SLIDE 12

3/4/09
 12


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


Proof
(ctnd):

 Consider
the
schedule
σi
star@ng
from
ci where
pi stops
 immediately
(taking
no
ac@ons).
The
input
of
pi cannot
be
 known
to
any
other
process.
 Consider
the
schedule
σi+1
star@ng
from
ci+1 where
pi again
 stops
immediately
(taking
no
ac@ons).
The
input
of
pi cannot
 be
known
to
any
other
process.
 The
input
of
pi

is
the
only
difference
between
σi
and
σi+1.
 Since
pi
fails
immediately,
σi
and
σi+1 are
indis@nguishable
to
 all
other
pj.
 Hence
non
faulty
processes
decide
same
value
in
σi
and
σi+1.
 This
is
a
contradic@on
since
σi
is
0‐valent
and
σi+1
is
1‐valent.
 ☐


67


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


Lemma
3:
Let
c
be
a
bivalent
configura@on
and
let
 e=(p,m)
be
an
event
applicable
to
c.
Let
C
be
the
set
of
 states
reachable
from
c
without
applying
e,
and
let
 D={e(c’):

c’∈
C
and e
applicable
to c’}.

 Then
D
contains
a
bivalent
configura@on.

 Proof:
By
contradic@on,
assume
D
contains
only
univalent
 configura@ons.
 Let
ck
be
k‐valent
and
reachable
from
c,
k ∈
{0,1}.
Both
c0 and
c1 exist
since
c
is
bivalent.
 Let
dk be
k‐valent
and
dk ∈
D.
These
must
exist.

 Indeed
if
ck∈
C,
then
by
defini@on
of
D,
dk=e(ck)∈
D.
 Otherwise,
ck∉
C,
e
was
applied
in
reaching
ck.
The
state
 auer
e,
call
it
dk,
belongs
to
D.



68


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


69


Case:
c0 ∈
C
 c0
 c


0‐valent
 bivalent


c1


1‐valent


Case:
c0 ∉
C


e=(p,m)


d0


D
 C


c0
 c


0‐valent
 bivalent
 e=(p,m)


c0
 d0
 d0 ∈
D


0‐valent
 univalent⇒0‐valent


  • Remember
that
we
assumed
that
D
does
not
contain


bivalent
configura@ons
(only
univalent).


– D
contains
both
0‐valent
and
1‐valent
configura@ons.


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


There
exist
in
C

two
“neighbors”
configura@ons
b0
 and
b1,

that
is
b1=e’(b0),
for
some
event
e’=(p’,m’),
 such
that
d0=e(b0)
is
0‐valent
and
d1=e(c1)
is
1‐ valent.


70


e=(p,m)


C
 D


d


0


b


0


b


1


d


1


e=(p,m)
 e’=(p’,m’)
 0‐valent
 1‐valent


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


It
cannot
be
p≠p’. Indeed
if
p≠p’
we
can
apply
e’
to
d0

and,
by
Lemma
 1
we
would
get
to
d1.
This
is
impossible
since
d0

is
0‐ valent
and
d1

is
1‐valent.






71


e=(p,m)


C
 D


d


0


b


0


b


1


d


1


e=(p,m)
 e’=(p’,m’)
 0‐valent
 1‐valent
 e’=(p’,m’)


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


It
cannot
be
p=p’. Consider
any
finite
deciding
run
σ

from
c0
in
which
p
takes
no
 steps.
By
Lemma
1
we
can
apply
σ

also
to
d0
to
d1.



72


e

d


0


b


0


b


1


d


1


e e’
 0‐valent
 1‐valent
 σ


t


0


σ


f


σ


t


1


0‐valent
 1‐valent


Hence
D
must
contain
a
bivalent
configura@on.












 
 
 
 
☐


e e e’
 univalent


Impossible


slide-13
SLIDE 13

3/4/09
 13


Bertinoro, March 2009

Distributed Algorithms

Part II

FLP
impossibility
result


Theorem:
There
is
an
execu@on
in
which
A
never
 terminates.
 Proof:
Immediate
from
Lemma
2

 
there
exists
a
bivalent
ini@al
configura@on
 and
Lemma
3
(applied
repeatedly)

 
from
a
bivalent
configura@on
we
can
take
steps
 and
get
to
a
bivalent
configura@on.
 The
resul@ng
execu@on
is
an
infinite
execu@on
in
 which
A
never
reaches
a
decision.
 ☐


73


Bertinoro, March 2009

Distributed Algorithms

Part II

Randomized
consensus


  • We
relax
the
termina@on
condi@on


– terminates
within
@me
t,
with
probability
1‐ε(t)



  • BenOr
algorithm


– works
for
n
>
3f
and
V={0,1}


  • Each
pi
has
local
variables
x
and
y

– x, ini@ally
contains
pi’s
input
value


  • Algorithm
proceeds
in
(logical)
phases


– each
consis@ng
of
two
(logical)
rounds

 – keep
execu@ng,
even
auer
deciding


74


Bertinoro, March 2009

Distributed Algorithms

Part II

BenOr
algorithm
 Phase
s
(any
s)


  • Round
1:
pi
broadcast
(s,“r1”,x)
then
waits
for


n‐f
messages
of
the
form
(s,“r1”,).
If
all
msgs
 have
same
value
v,
then
pi
sets
y:=v,
else
y=⊥.


  • Round
2:
pi
broadcast
(s,“r2”,y)
then
waits
for


n‐f
messages
of
the
form
(s,“r2”,).


– if
all
have
same
v,
then
pi sets x:=v,
and
decides
if
 it
has
not
done
so
already
 – if
≥
n‐2f
have
same
value
v,
then
x:=v,
but
does
 not
decide
 – otherwise
x
:=
0
or
1
(randomly)


75


Bertinoro, March 2009

Distributed Algorithms

Part II

BenOr
algorithm


Validity:
Suppose
all
processes
start
with
v.
All
msgs
sent
 in
round
1
of
the
first
phase
are

(1,”r1”,v)
and
in
round
2


  • f
the
form
(1,”r2”,v).
Hence
any
process
that
completes


phase
1
must
decide
v.
 Agreement:
Suppose
pi
decides
v
at
stage
s
and
no
 process
decides
at
any
smaller‐numbered
stage.
 It
must
be
that
pi
receives
n‐f
(s,”r2”,v)
msgs.
This
implies
 that
any
other
process
pj
that
completes
stage
s
receives
 at
least
n‐2f
(s,”r2”,v)
–
because
there
can
be
at
most
f
 failures.

 Since
n>3f,
pj
cannot
decide
on
a
value
different
from
v
 and
will
set
x:=v.
Hence
all
processes
that
start
stage
s+1,
 start
with
x=v.
By
the
same
argument
used
for
the
 validity,
it
follows
that
(non‐faulty)
processes
decide
v
by
 the
end
of
stage
s+1.


76


Bertinoro, March 2009

Distributed Algorithms

Part II

BenOr
algorithm


Termina:on:
for
any
s≥0,
all
non‐faulty
processes
decide
 within
s+1
stages,
with
probability
1‐(1‐1/2n)s
.
 Case
s=0,
trivial.
 Consider
any
s
≥
1.
We
claim
that
with
probability
1/2n,
all
 non‐faulty
processes
choose
the
same
value
of
x
at
the
end
of
 stage
s.

 Consider
any
shortest
finite
execu@on
α
in
which
some
non‐ faulty
pi
has
received
n‐f
(s,“r1”,)
msgs.
(α
ends
with
the
 delivery
of
one
such
msg).
If
at
least
f+1
contain
a
value
v,
 define
v
as
“good”.
 There
can
be
either
1
or
2
good
values.
 CASE:
two
good
values.
 If
there
are
two
good
values,
then
every
non
faulty‐ process
receives
both
a
0
and
a
1
in
“r1”
msgs
and
thus
 every
“r2”
msg
in
any
extension
of
α
must
contain
⊥.
 Value
of
x
for
every
process
depends
only
on
random
 choice.
With
probability
at
least
1/2n all
n
processes
make
 same
random
choice.



77


Bertinoro, March 2009

Distributed Algorithms

Part II

BenOr
algorithm


CASE:
one
good
value,
v.
 If
there
is
only
one
good
value,
then
every
“r2”
mgs
 in
any
extension
of
α
must
contain
either
v
or
⊥.
 In
this
case
there
are
some
processes,
say
m,
that
 determinis@cally
choose
x=v.
The
other
n‐m
 processes
choose
x
at
random.
The
probability
that
 the
n‐m
processes
choose
x=v
is
1/2n‐m >
1/2n
 In
either
case,
with
prob
>
1/2n all
choose
same x. Hence
with
probability
1/2n all
processes
decide
in
the
 next
phase
(see
validity
argument).
 Thus
with
probability
<
1‐1/2n no
decision
is
taken.
 The
argument
is
independent
from
phase
to
phase.
 Therefore
with
probability
at
least
1‐(1‐1/2n)s
all
non
 faulty
processes
decide
by
phase
s+1.
 ☐


78


slide-14
SLIDE 14

3/4/09
 14


Bertinoro, March 2009

Distributed Algorithms

Part II

Other
algorithms
and
relaxa@ons


  • Other
algorithms
by
Rabin,
by
Feldman
with


much
beser
(in
fact,
constant)

@me
 complexity
…
use
secret
sharing.



  • Other
relaxa@on
of
the
problem


– approximate
agreement
 – k‐set
consensus


79


Bertinoro, March 2009

Distributed Algorithms

Part II

Approximate
agreement


  • If
values
are
real
numbers
we
can
define
the


following
consensus
problem


– Agreement:
All
decision
values
are
within
ε
of
 each
other
 – Validity:
Any
decision
value
is
between
the
 smallest
and
the
biggest
ini@al
values


  • A
simple
algorithm
works
for
this
relaxed


version


– requires
n
>
3f
 – Exercise:
try
to
design
the
algorithm


  • Hint:
broadcast
values,
wait
for
n‐f.
Compute
mean


value.
Repeat
to
get
values
closer.


80


Bertinoro, March 2009

Distributed Algorithms

Part II

k‐consensus


  • Agreement:
There
are
at
most
k
different


decisions.


  • There
is
a
simple
algorithm
to
solve
the


problem
for
f <
k.


– Exercise:
design
such
an
algorithm


  • The
problem
cannot
be
solved
if
f
≥
k.


81


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
algorithms


CAUSALITY AND LOGICAL TIME


82


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • Knowing
the
rela@ve
order
in
which
events


take
place
can
be
useful


  • If
we
could
tag
each
event
with
the
@me
(real


@me)
of
its
occurrence,
then
we
could
order
 them


– this
is
not
possible
in
asynchronous
systems


  • Processes
might
have
local
clocks


– keeping
clocks
synchronized
is
not
easy


83


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • Causality


– events
whose
execu@ons
depends
on
other
 (preceding)
events


  • Logical
@me


– a
par@al
ordering
of
events
 – an
order
that
is
consistent
with
the
dependencies
 among
the
components
 – looks
like
a
real‐@me
ordering


  • Useful
when
the
rela@ve
order
of
events
at



different
loca@ons
is
not
important


– what
is
important
are
the
dependencies
(causality)


84


slide-15
SLIDE 15

3/4/09
 15


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


85


p2
 p1
 p3


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • Let
α
be
an
execu@on
of
an
asynchronous


network.
A
logical‐@me
assignment
for
α
is
 defined
to
be
an
assignment
of
values
from
a
 set
T
such
that


– no
two
events
get
the
same
logical
@me.
 – logical
@mes
of
events
at
each
process
are
strictly
 increasing,
according
to
their
occurrence
in
α – the
logical
@me
of
any
send event
is
strictly
 smaller
than
that
of
the
corresponding
receive – for
any
par@cular
t
∈
T,
there
are
only
finitely
many
 events
with
a
smaller
logical
@me


86


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


87


p2
 p1
 p3
 1
 2
 3
 4
 5
 6
 7
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 23
 22
 9
 8


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


88


p2
 p1
 p3
 1
 2
 3
 4
 5
 6
 7
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 23
 22
 9
 8


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me
 Theorem:
Let
α
be
an
execu@on
(with
reliable
 channels)
and
let
lBme 
be
a
logical
@me
 assignment
for
α.
Then
there
is
α’
such
that


– α’ contains
the
same
events
as
α.
 – the
events
in
α’
occur
in
the
order
of
their
lBme – α’
is
indis@nguishable
from
α
to
every
automaton


89


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • How
we
can
compute
logical
@mes?

  • We
can
transform
any
algorithm
A
into
a
new


algorithm
L(A)
that
generates
logical
@mes


  • LamportTime
transforma@on


– Lamport’s
famous
paper
“Time,
Clocks
and
the
 Ordering
of
Events
in
a
Distributed
System”
 – T={(x,i):
x
∈
Nat,
i
∈
UID}
 – Process
indices
used
as
@e‐breakers


  • Order
of
t1=(x1,i)
and

t2
=
(x2,j):


– t1
<
t2





if
x1<
x2



or





(x1<
x2
and
i
<
j)




90


slide-16
SLIDE 16

3/4/09
 16


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


91


Signature:
as
Ai
except
that
send(m)i,j and
receive(m)j,i
are
replaced
by

 send(m,c)i,j and
receive(m,c)j,i with c ∈ Nat

 States:
as
Ai
plus
 


clock
∈ Nat,
ini@ally
0
 Transi:ons: 
as
Ai

with
the
following
modifica:ons


  • ut send(m,c)i,j

inp receive(m,c)j,i Pr

Pre: e: as in Ai plus Ef Effect fect: 
as in Ai plus c
=
clock
+1 










 
 




clock:=max(clock,c)+1;
 Ef Effect: fect: as in Ai plus clock
:=
c;
 for
all
other
ac:ons
add
“clock := clock+1” to
the
code
of
 Ef Effect fect. Tasks:
 As
for
Ai
(with
the
replacements)
 LamportTime(A)i
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


92


p2
 p1
 p3
 (1,3)
 (1,1)
 (2,2)
 (3,2)
 (4,3)
 (5,3)
 (6,1)
 (7,1)
 (5,2)
 (4,2)


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


93


p2
 p1
 p3
 1
 2
 3
 4
 5
 6
 7
 10
 9
 8


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • An
example
of
applica@on


– CountMoney


  • Processes
are
bank
loca@ons


– there
no
external
deposit/withdrawals
 – only
money
transfers
(msgs)
between
processes


  • Given
a
(logical)
@me
t


– processes
need
to
decide
on
a
local
balance,
in
 such
a
way
that
the
total
is
the
correct
amount
in
 the
system


94


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • Process
start
with
some
ini@al
sum


– make
transfers
 – use
logical
@me


  • Want
to
know
balance
at
(logical)
@me
t

  • Money
at
process
i

– start
balance

 +
all
amounts
received
before
(logical)
@me
t

 –
all
amounts
sent
before
(logical)
@me
t


  • Money
in
channel
(i,j)


– money
sent
before
(or
at)
@me
t
and
delivered
aYer
 @me
t

95


Bertinoro, March 2009

Distributed Algorithms

Part II

100
 300
 200


Logical
@me


96


p2
 p1
 p3
 1
 2
 3
 4
 5
 6
 7
 10
 9
 8


50
 1 
 3 
 20
 40
 Money
at
@me
t
=7.5?


slide-17
SLIDE 17

3/4/09
 17


Bertinoro, March 2009

Distributed Algorithms

Part II

100
 300
 200


Logical
@me


97


p2
 p1
 p3
 1
 2
 3
 4
 5
 6
 7
 10
 9
 8


50
 1 
 3 
 20
 40
 Money
at
@me
t
=7.5?


t=7.5


140
 180
 260


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • Global
snapshot


– instantaneous
view
of
the
system,
at
a
given
@me


  • The
approach
used
before
is
actually
general


enough


– instead
of
just
coun@ng
money
we
can
look
at



  • en@re
state
of
the
process

  • content
of
channels

  • Useful
for


– backup
versions
of
the
system
(in
case
of
failures)
 – determine
proper@es
of
the
algorithm


98


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • If
event
e1
causally
precedes
an
event
e2,
then


we
have
lBme(e1)
<
lBme(e2)


– the
inverse
is
not
necessarily
true


  • (Logical)
Vector
Clocks


– each
process
maintains
a
vector
of
local
@mes
 – jth
component
correspond
to
pj – pi
increments
ith
component
for
every
local
event
 – upon
recep@on
of
message
from
pj


  • pi
updates
the
jth
component
(according
to
@mestamp


included
into
the
message)


99


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


100


p2
 p1
 p3
 (0,0,1)
 (1,0,0)
 (1,1,0)
 (1,2,0)
 (0,2,2)
 (0,2,3)
 (2,0,3)
 (3,3,3)
 (1,4,1)
 (1,3,1)


Bertinoro, March 2009

Distributed Algorithms

Part II

Logical
@me


  • Comparing
vector
clocks


– VC1
≤
VC2


  • if
VC1[i]
≤
VC2[i],
for
all
i
  • VC1
<
VC2


– if
VC1[i]
≤
VC2[i],
for
all
i,
and
VC1
≠
VC2


– incomparable



  • if
neither
VC1
≤
VC2
nor
VC2
≤
VC1

  • Exercise:


  • Prove
that
VC1
<
VC2
if
and
only
if
e1
causally
precedes


event
e2


101


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
systems
 FAILURE DETECTORS


102


slide-18
SLIDE 18

3/4/09
 18


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
system


  • Processes
can
take
arbitrary
long
@mes
to
execute


ac@ons


  • How
we
can
tell
a
very
slow
process
from
a
faulty


(stopped)
one?


– p
is
wai@ng
a
message
from
q

  • stops
wai@ng
and
make
a
decision
assuming
q
is
faulty


– if
q
is
only
very
slow
this
is
(or
might
be)
a
problem


  • p
keeps
wai@ng


– if
q
is
faulty,
p
will
wait
for
ever


  • This
is
one
of
the
major
problem
in
dealing
with


fault‐tolerant
algorithms
in
asynchronous
systems


103


Bertinoro, March 2009

Distributed Algorithms

Part II

Failure
detectors


  • “To
stop
wai@ng
or
not
to
stop
wai@ng,
that’s


the
ques@on”


– quote
from
Raynal’s
paper


  • Failure
detectors


– Introduced
by
Chandra
and
Toueg
[91]


  • Oracles


– that
can
“solve”
the
ques@on
 – the
system
is
augmented
with
a
“failure
detector”
 black‐box
that
gives
“hints”
about
failures


104


Bertinoro, March 2009

Distributed Algorithms

Part II

Failure
detectors


  • Why
study
them?

  • Assume
you
can
use
a
certain
failure
detector


F
to
solve
a
problem
P

– if
someone
tells
you
how
to
actually
implement
F
 then
you
have
solved
P

  • “Weakest”
failure
detector
Fmin
to
solve
P

– necessary
and
sufficient
condi@ons
to
solve
P – important
for
both
theory
and
prac@ce


105


Bertinoro, March 2009

Distributed Algorithms

Part II

Failure
detectors


  • Can
give
beser
understanding
of
the
problems

  • Can
classify
problems
depending
on
weakest


failure
detector
needed
to
solve


– if
Fmin(P1)
⇒
Fmin(P2)
then
P2
is
“easier”
than
P1


  • Here

A
⇒
B
means
that
A
is
more
powerful
than
B

  • Some
failure
detectors
can
be
implemented


– depending
on
the
system
characteris@c


  • for
example,
it
is
easy
to
implement
(stop)
failure


detectors
in
synchronous
system


106


Bertinoro, March 2009

Distributed Algorithms

Part II

Failure
detectors


  • Can
be
modeled
as
an
automaton


– out
ac@ons
inform‐stopped(j)i

107


p1
 p2
 p5
 p3
 p7
 p4
 p6
 p8
 p9
 FAILURE
 DETECTOR
 send(m)5,3
 receive(m)5,3
 inform‐stopped(4)1
 inform‐stopped(4)3
 inform‐stopped(6)8


Bertinoro, March 2009

Distributed Algorithms

Part II

Failure
detectors


  • Let’s
start
with
the
simplest
one:


perfect failure detector

  • A
perfect
failure
detector



– reports
only
failures
actually
occurred
 – eventually
reports
all
failures


  • We
are
not
yet
asking
whether
we
can
actually


build
such
a
failure
detector


  • But
we
can
use
it
to
solve
consensus


108


slide-19
SLIDE 19

3/4/09
 19


Bertinoro, March 2009

Distributed Algorithms

Part II

Perfect
FD
consensus


  • Informally


– each
process
pi
asempts
to
stabilize
on


  • A
vector
val[1,n]i
of
vales
in
∈ V
∪
{⊥}.
If
val[j]i=v,
means


that
pi
has
learned
that
pj’s
ini@al
value
is
v.


  • A
set
stopped
of
process
indices.
If
j ∈
stoppedi,
it


means
that
pi
has
learned
that
pj
has
stopped.


– each
process
broadcast
val
and
stopped

  • and
updates
its
knowledge
of (val,stopped)

  • ignores
messages
from
stopped
processes


– keeps
track
of
messages
that
“ra@fy”
knowledge


  • that
is,
are
the
same
as
its
current
(val,stopped)


– when
data
is
stabilized
(ra@fica@on
from
all
non‐ stopped
processes),
pi
decides
on
the
non‐null
 value
with
smallest
index
in
its
vali

109


Bertinoro, March 2009

Distributed Algorithms

Part II

Perfect
FD
consensus


110


Signature: 
 
 
 
 
 

 Input: 
 
 
 
 
 
 
 
Output
 




init(v)i , v ∈ V bcast(w,S)i , w ∈
W, S ⊆
{1,2,…,n}
 




receive(w,S)j,i, , w ∈
W, S ⊆
{1,2,…,n} 
 
 
decide(v)i , v ∈ V inform‐stopped(j)i, j≠i States:
 


val
∈ W,
ini@ally
val[i]=⊥ 
 
 
 
stopped
⊆
{1,2,…,n},
ini@ally
{}
 


decided,
a
Boolean,
ini@ally
⊥
 
 
 
raBfied
⊆
{1,2,…,n},
ini@ally
{} Transi:ons: 

 inp init(v)i

  • ut bcast(w,S)i

Ef

Effect: fect: val[i]=v, raBfied :={i} Pr Pre: e: w=val, S=stopped Ef Effect: fect: none inp inform‐stopped(j)i

Ef

Effect: fect: stopped := stopped {j} inp ireceive(w,S)j,i raBfied :={i} Ef Effect: fect: if
j
∉
stopped
then if (w,S)=(val,stopped) then


  • ut decide(v)i

raBfied := raBfied ∪
{j} Pr

Pre: e: raBfied ∪
stopped
=
{1,2,…,n} 
 
 
 
 
else
 









v=f(val) 
 
 
 
 
 
 
 
 



stopped := stopped ∪
S
 
decided
=
false ∀k=1,…,n
do Ef Effect: fect: decided :=
true if val(k)=⊥ then val(k):=w(k)

PerfectFDconsensusi
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

Failure
Detectors


  • Does
a
perfect
failure
detector
exist?


– No,
otherwise
we
could
solve
consensus


  • A
number
of
failure
detectors
have
been
studied


– Failure
detector
W

  • There
is
a
@me
auer
which
every
process
that
stops
is
always


suspected
by
some
correct
process


  • There
is
a
@me
auer
which
some
correct
process
is
never


suspected
by
any
correct
process


  • Sufficient
to
solve
consensus



– requires
n
>
2f – it
is
the
“weakest”
failure
detector
for
solving
 consensus
for
n>2f

111


Bertinoro, March 2009

Distributed Algorithms

Part II

Asynchronous
algorithms


112


SYNCHRONIZERS


Bertinoro, March 2009

Distributed Algorithms

Part II

Synchronizers


  • Transforming
algorithms
designed
for
the


synchronous
se~ng


– make
them
work
in
the
asynchronous
se~ng


  • Simulate
rounds


– asach
round
number
to
the
message


  • Messages
sent
to
the
Synchronizer


– the
synchronizer
waits
for
all
messages
for
round
r – then
delivers
them
 – once
a
process
gets
all
messages
for
round
r,
 proceeds
to
round
r+1.


113


Bertinoro, March 2009

Distributed Algorithms

Part II

Synchronizers


  • Tagged
messages


– pair
(m,i)
where
m
∈
M
and
1
≤
i
≤
n


  • i
is
the
recipient
of
message
m
  • User
automaton
Ui

– has
output
ac@ons
user‐send(T,r)i

  • T
is
a
set
of
tagged
messages
(or
⊥,
if
no
messages)

  • r
∈
Nat+
is
a
round
number


– has
input
ac@ons
user‐receive(T,r)i

  • Send/receive
ac@ons
are
supposed
to
be
well‐

formed


– that
is
each
Ui
executes
them
in
the
expected
round


  • rder


114


slide-20
SLIDE 20

3/4/09
 20


Bertinoro, March 2009

Distributed Algorithms

Part II

Synchronizers


  • Synchronizer


– For
each
round
r

  • Waits
for
the
(asynchronous)
message
for
round
r
  • delivers
all
messages
for
round
r

– Synchroniza@on


  • Given
by
the
fact
that
delivery
of
messages
happens
auer
all


processes
have
sent
messages
(just
like
in
the
synchronous
 case)


115


Global
Synchronizer


U1
 Ui Un

user‐send1
 user‐receive1
 user‐sendn user‐receiven

Bertinoro, March 2009

Distributed Algorithms

Part II

Synchronizers


116


Signature: 
 
 
 
 
 

 Input:


 user‐send(T,r)i , T is
a
set
of
tagged
messages,
r
∈
Nat+
,
1
≤
i
≤
n 

 Output:


 user‐receive(T,r)i , T is
a
set
of
tagged
messages,
r
∈
Nat+
,
1
≤
i
≤
n 
 
 



 States:
 


tray, an
array
indexed
by
{1,…,n}×Nat+
of
sets
of
tagged
messages,
ini@ally
all
{}
 


user‐sent, an
array
indexed
by
{1,…,n}×Nat+
of
Booleans,
ini@ally
all
false
 


user‐rcvd, an
array
indexed
by
{1,…,n}×Nat+
of
Booleans,
ini@ally
all
false Transi:ons: 

 inp
user‐send(T,r)i

  • ut user‐receive(T,r)i

Ef

Effect: fect: user‐sent(i,r) :=
true Pr Pre: e: for
all j
∈
nbrs ∪ {i} for
all j ≠ i do user‐sent(j,r) = true tray(j,r) := tray(j,r) ∪
{(m,i)|(m,j)∈T}
 
 
 
user‐received(i,r)
=
false
 
 
 
 
 
 
 
 
 
 
 
 
T=tray(i,r)
 
 
 
 
 
 
 
 
 
 








Ef Effect: fect: user‐rcvd(i,r) := true 

 Tasks:
 for
every i and
r 
{user‐receive(T,r)i:
T
a
set
of
tagged
messages}


Synchronizer
automaton


Bertinoro, March 2009

Distributed Algorithms

Part II

Synchronizers


  • Every
synchronous
algorithm
can
be
restated


using
Synchronizer


– Automaton
Ui
models
the
algorithm
at
pi,
with
 slight
modifica@ons
for
the
send/receive
ac@ons


  • Ques@on:
can
we
really
use
all
the
synchronous


algorithms
that
we
have
seen?



– Transforma@on
works
only
for
the
failure‐free
case
 – in
the
asynchronous
se~ng
many
problems
cannot
 be
solved
 – e.g.,
Consensus


117


Bertinoro, March 2009

Distributed Algorithms

Part II

Synchronizers


  • The
book
describes
other
synchronizers


– work
by
Awerbuch


  • Deal
with
beser
communica@on
complexity

  • Some
lower
bound
on
@me
complexity


118


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@ally
Synchronous
Systems


119


PARTIAL SYNCHRONY


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • Synchronous
system


– ac@ons
taken
in
lock‐step


  • Asynchronous
system


– unbounded
delays
 – @me
bounds
used
only
for
@me
analysis


  • Par@ally
synchronous
system


– processes
have
weak
form
of
synchrony


  • access
to
almost‐synchronized
clocks

  • know
@me
bounds
on
process
step
@me
or
message


delivery
@me


120


slide-21
SLIDE 21

3/4/09
 21


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • Par@ally
Synchronous
systems
are
probably


more
realis@c


– real
systems
do
have
clocks


  • although
it
is
very
difficult
to
keep
clocks
synchronized


– in
real
system
we
can
usually
make
assump@on
 about
the
@me
needed
to
perform
an
ac@on


  • Theory
for
par@ally
synchronous
systems


– not
nearly
so
well
developed
as
those
for
 synchronous
and
asynchronous
systems


121


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • Messages
delivered
within
@me
d
  • Ac@ons
take
within
@me
l


– Actually
we
should
assume
@me
bounds
[l1, l2]


  • l1
is
a
lower
bound

  • l2
is
an
upper
bound

  • rela@ve
speed
L=l2/l1

  • Processes
know
l
and
d


122


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • In
a
par@ally
synchronous
system
we
can


implement
a
perfect
failure
detector


  • Perfect
Failure
Detector


– Each
process
pi
con@nually
sends
messages
to
all


  • ther
processes
pj.


– If
pi
does
not
hear
from
pj for
some
@me,
namely
 at
least
d+l,
then
it
records
that
pj
has
stopped


  • and
performs
an
inform‐stopped(j)i
ac@on


123


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


Theorem:
The
previous
algorithm
implements
a
perfect
failure
 detector.
 Proof:
If
a
process
pj
stops
then
it
will
not
send
messages.
In‐ transit
messages
will
be
delivered
within
d
@me.
No
more
 messages
from
pj will
ever
be
received
by
pi
and
by
the
code
pi
 will
eventually
execute
inform‐stopped(j)i.
 Moreover inform‐stopped(j)i
is
performed
only
for
actual
 failures
(of
pj).
Suppose
inform‐stopped(j)i
is
executed.

 Then
it
means
that
pi
has
waited
at
least
d+l @me
without
pi
 receiving
a
message
from
pj.
But
this
is
impossible
(unless
pj
 has
stopped).
 ☐

  • With
a
perfect
failure
detector
we
can
solve
consensus


– as
discussed
previously


124


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • General
transforma@on
for
the
synchronous


algorithm
for
Consensus
seen
in
Part
I


  • Let
A
any
synchronous
algorithm
for
a


complete
graph


  • Recall


– inputs
appear
in
the
ini@al
state
 – decision
(output)
wrisen
in
a
write‐once
variable
 – most
algorithms
tolerate
f
failures
and
require
 exactly
f+1
rounds.


125


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • B
=
simulated(A)


– Each
pi
is
the
composi@on
of
two
automata:
Qi,
which
 is
the
perfect
failure
detector
at
node
pi,
and
Ri,
which
 is
the
main
automaton.
Ri
has
the
inform‐stopped
 ac@ons
as
input.
 – Ri
simulates
A
by
maintaining
a
variable
with
the
 simulated
state
of
A.
 – For
each
round
r,
Ri
first
sends
all
its
round
r
messages
 (using
the
simulated
state
of
A
and
the
msgsi
func@on


  • f
A).
Then
Ri
waits
for
each
j≠i
either
a
msg
or
an


inform‐stopped(j)i
ac@on.
 – Then
it
determines
the
new
simulated
state
of
A
using
 the
transi
func@on
of
A
(and
⊥
msg
as
the
message
 from
stopped
processes).
 – Then
it
goes
to
round
r+1


126


slide-22
SLIDE 22

3/4/09
 22


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


Theorem:
B
solves
the
agreement
problem
in
the
 par@ally
synchronous
model
and
guarantees
f‐ failure
termina@on.

 Theorem
(lower
bound):
In
the
par@al
synchronous
 model
we
cannot
solve
the
problem
in
less
than




 (f+1)d
@me.


  • Exercise:
sketch
a
proof
which
gets
a


contradic@on
for
the
corresponding
lower
bound
 in
the
synchronous
model.



127


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • “Consensus
in
the
presence
of
par@al


synchrony”


– Dwork,
Lynch,
Stockmeyer


  • Several
models,
depending
on


– whether
d
and
L
known
 – whether
d
and
L
hold
only
eventually


  • Solvability
of
consensus
in
the
various
models


and
with
various
type
of
faults


128


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • Par@ally
synchronous
channels,
synchronous


processors



– L=1
 – bound
d
on
message
delivery


  • holds
but
is
not
known
to
the
processes,
or

  • is
known
but
holds
only
eventually

  • Par@ally
synchronous
processors,
synchronous


channels


– d
holds
and
is
known
a
priori
 – bound
L
on
rela@ve
speed


  • holds
but
is
not
known
to
the
processes,
or

  • is
known
but
holds
only
eventually

  • Both
processors
and
channel
are
par@ally
synchronous


129


Bertinoro, March 2009

Distributed Algorithms

Part II

Par@al
synchrony


  • Summary
of
results


– smallest
number
of
processors
needed
to
have
an
 f‐resilient
consensus
algorithm


130


SYNCH
 ASYNCH
 PART
SYNCH

 Channels
 PART
SYNCH
 Processor
 PART
SYNCH
 channels
 AND
 processors
 Fail‐stop
 f ∞
 2f+1
 f 2f+1
 Omission
 f ∞
 2f+1
 [2f,2f+1]
 2f+1
 Auth.
Byzan@ne
 f ∞
 3f+1
 2f+1
 3f+1
 Byzan@ne
 3f+1
 ∞
 3f+1
 3f+1
 3f+1