The Consensus Problem Roger Wattenhofer thread a lot of kudos to - - PowerPoint PPT Presentation

the consensus problem
SMART_READER_LITE
LIVE PREVIEW

The Consensus Problem Roger Wattenhofer thread a lot of kudos to - - PowerPoint PPT Presentation

Sequential Computation The Consensus Problem Roger Wattenhofer thread a lot of kudos to memory Maurice Herlihy and Costas Busch Distributed for some of Computing their slides Group object object Distributed Computing Group


slide-1
SLIDE 1

Distributed Computing Group

The Consensus Problem

Roger Wattenhofer a lot of kudos to Maurice Herlihy and Costas Busch for some of their slides

Distributed Computing Group Roger Wattenhofer 2

Sequential Computation

memory

  • bject
  • bject

thread

Distributed Computing Group Roger Wattenhofer 3

Concurrent Computation

memory

  • bject
  • bject

t h r e a d s

Distributed Computing Group Roger Wattenhofer 4

Asynchrony

  • Sudden unpredictable delays

– Cache misses (short) – Page faults (long) – Scheduling quantum used up (really long)

slide-2
SLIDE 2

Distributed Computing Group Roger Wattenhofer 5

Model Summary

  • Multiple threads

– Sometimes called processes

  • Single shared memory
  • Objects live in memory
  • Unpredictable asynchronous delays

Distributed Computing Group Roger Wattenhofer 6

Road Map

  • We are going to focus on principles

– Start with idealized models – Look at a simplistic problem – Emphasize correctness over pragmatism – “Correctness may be theoretical, but incorrectness has practical impact”

Distributed Computing Group Roger Wattenhofer 7

You may ask yourself …

I’m no theory weenie - why all the theorems and proofs?

Distributed Computing Group Roger Wattenhofer 8

Fundamentalism

  • Distributed & concurrent systems are

hard

– Failures – Concurrency

  • Easier to go from theory to practice

than vice-versa

slide-3
SLIDE 3

Distributed Computing Group Roger Wattenhofer 9

The Two Generals

Red army wins If both sides attack together

Distributed Computing Group Roger Wattenhofer 10

Communications

Red armies send messengers across valley

Distributed Computing Group Roger Wattenhofer 11

Communications

Messengers don’t always make it

Distributed Computing Group Roger Wattenhofer 12

Your Mission Design a protocol to ensure that red armies attack simultaneously

slide-4
SLIDE 4

Distributed Computing Group Roger Wattenhofer 13

Theorem There is no non-trivial protocol that ensures the red armies attacks simultaneously

Distributed Computing Group Roger Wattenhofer 14

Proof Strategy

  • Assume a protocol exists
  • Reason about its properties
  • Derive a contradiction

Distributed Computing Group Roger Wattenhofer 15

Proof

1. Consider the protocol that sends fewest messages

  • 2. It still works if last message lost
  • 3. So just don’t send it

– Messengers’ union happy

  • 4. But now we have a shorter protocol!
  • 5. Contradicting #1

Distributed Computing Group Roger Wattenhofer 16

Fundamental Limitation

  • Need an unbounded number of

messages

  • Or possible that no attack takes

place

slide-5
SLIDE 5

Distributed Computing Group Roger Wattenhofer 17

You May Find Yourself …

I want a real-time YAFA compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ... I want a real-time YAFA compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ...

Distributed Computing Group Roger Wattenhofer 18

You might say

I want a real-time YAFA compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ... Yes, Ma’am, right away! Yes, Ma’am, right away!

Distributed Computing Group Roger Wattenhofer 19

You might say

I want a real-time dot-net compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ... Yes, Ma’am, right away!

Advantage:

  • Buys time to find another job
  • No one expects software to work

anyway

Advantage:

  • Buys time to find another job
  • No one expects software to work

anyway

Distributed Computing Group Roger Wattenhofer 20

You might say

I want a real-time dot-net compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ... Yes, Ma’am, right away!

Advantage:

  • Buys time to find another job
  • No one expects software to work

anyway

Disadvantage:

  • You’re doomed
  • Without this course, you may

not even know you’re doomed

Disadvantage:

  • You’re doomed
  • Without this course, you may

not even know you’re doomed

slide-6
SLIDE 6

Distributed Computing Group Roger Wattenhofer 21

You might say

I want a real-time YAFA compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ... I can’t find a fault-tolerant algorithm, I guess I’m just a pathetic loser. I can’t find a fault-tolerant algorithm, I guess I’m just a pathetic loser.

Distributed Computing Group Roger Wattenhofer 22

You might say

I want a real-time YAFA compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ... I can’t find a fault-tolerant algorithm, I guess I’m just a pathetic loser I can’t find a fault-tolerant algorithm, I guess I’m just a pathetic loser

Advantage:

  • No need to take course

Advantage:

  • No need to take course

Distributed Computing Group Roger Wattenhofer 23

You might say

I want a real-time YAFA compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ... I can’t find a fault-tolerant algorithm, I guess I’m just a pathetic loser I can’t find a fault-tolerant algorithm, I guess I’m just a pathetic loser

Advantage:

  • No need to take course

Advantage:

  • No need to take course

Disadvantage:

  • Boss fires you, hires

University St. Gallen graduate

Disadvantage:

  • Boss fires you, hires

University St. Gallen graduate

Distributed Computing Group Roger Wattenhofer 24

You might say

I want a real-time YAFA compliant Two Generals protocol using UDP datagrams running on our enterprise-level fiber tachyion network ... Using skills honed in course, I can avert certain disaster!

  • Rethink problem spec, or
  • Weaken requirements, or
  • Build on different platform

Using skills honed in course, I can avert certain disaster!

  • Rethink problem spec, or
  • Weaken requirements, or
  • Build on different platform
slide-7
SLIDE 7

Distributed Computing Group Roger Wattenhofer 25

Consensus: Each Thread has a Private Input

32 19 21

Distributed Computing Group Roger Wattenhofer 26

They Communicate

Distributed Computing Group Roger Wattenhofer 27

They Agree on Some Thread’s Input

19 19 19

Distributed Computing Group Roger Wattenhofer 28

Consensus is important

  • With consensus, you can implement

anything you can imagine…

  • Examples: with consensus you can

decide on a leader, implement mutual exclusion, or solve the two generals problem

slide-8
SLIDE 8

Distributed Computing Group Roger Wattenhofer 29

You gonna learn

  • In some models, consensus is possible
  • In some other models, it is not
  • Goal of this and next lecture: to learn

whether for a given model consensus is possible or not … and prove it!

Distributed Computing Group Roger Wattenhofer 30

Consensus #1 shared memory

  • n processors, with n > 1
  • Processors can atomically read or

write (not both) a shared memory cell

Distributed Computing Group Roger Wattenhofer 31

Protocol (Algorithm?)

  • There is a designated memory cell c.
  • Initially c is in a special state “?”
  • Processor 1 writes its value v1 into c,

then decides on v1.

  • A processor j (j not 1) reads c until j

reads something else than “?”, and then decides on that.

Distributed Computing Group Roger Wattenhofer 32

Unexpected Delay

Swapped out back at

??? ???

slide-9
SLIDE 9

Distributed Computing Group Roger Wattenhofer 33

Heterogeneous Architectures

??? ???

Pentium Pentium 286

yawn

(1)

Distributed Computing Group Roger Wattenhofer 34

Fault-Tolerance ??? ???

Distributed Computing Group Roger Wattenhofer 35

Consensus #2 wait-free shared memory

  • n processors, with n > 1
  • Processors can atomically read or

write (not both) a shared memory cell

  • Processors might crash (halt)
  • Wait-free implementation… huh?

Distributed Computing Group Roger Wattenhofer 36

Wait-Free Implementation

  • Every process (method call)

completes in a finite number of steps

  • Implies no mutual exclusion
  • We assume that we have wait-free

atomic registers (that is, reads and writes to same register do not

  • verlap)
slide-10
SLIDE 10

Distributed Computing Group Roger Wattenhofer 37

A wait-free algorithm…

  • There is a cell c, initially c=“?”
  • Every processor i does the following

r = Read(c); if (r == “?”) then Write(c, vi); decide vi; else decide r;

Distributed Computing Group Roger Wattenhofer 38

Is the algorithm correct?

time cell c

32 17

? ? ?

32 17 32! 17!

Distributed Computing Group Roger Wattenhofer 39

Theorem: No wait-free consensus

??? ???

Distributed Computing Group Roger Wattenhofer 40

Proof Strategy

  • Make it simple

– n = 2, binary input

  • Assume that there is a protocol
  • Reason about the properties of any

such protocol

  • Derive a contradiction
slide-11
SLIDE 11

Distributed Computing Group Roger Wattenhofer 41

Wait-Free Computation

  • Either A or B “moves”
  • Moving means

– Register read – Register write A moves B moves

Distributed Computing Group Roger Wattenhofer 42

The Two-Move Tree

Initial state Final states

Distributed Computing Group Roger Wattenhofer 43

Decision Values

1

1 1 1

Distributed Computing Group Roger Wattenhofer 44

Bivalent: Both Possible

1 1 1 bivalent

1

slide-12
SLIDE 12

Distributed Computing Group Roger Wattenhofer 45

Univalent: Single Value Possible

1 1 1 univalent

1

Distributed Computing Group Roger Wattenhofer 46

1-valent: Only 1 Possible

1 1 1 1-valent

1

Distributed Computing Group Roger Wattenhofer 47

0-valent: Only 0 possible

1 1 1 0-valent

1

Distributed Computing Group Roger Wattenhofer 48

Summary

  • Wait-free computation is a tree
  • Bivalent system states

– Outcome not fixed

  • Univalent states

– Outcome is fixed – May not be “known” yet – 1-Valent and 0-Valent states

slide-13
SLIDE 13

Distributed Computing Group Roger Wattenhofer 49

Claim

Some initial system state is bivalent (The outcome is not always fixed from the start.)

Distributed Computing Group Roger Wattenhofer 50

A 0-Valent Initial State

  • All executions lead to decision of 0

Distributed Computing Group Roger Wattenhofer 51

A 0-Valent Initial State

  • Solo execution by A also decides 0

Distributed Computing Group Roger Wattenhofer 52

A 1-Valent Initial State

  • All executions lead to decision of 1

1 1

slide-14
SLIDE 14

Distributed Computing Group Roger Wattenhofer 53

A 1-Valent Initial State

  • Solo execution by B also decides 1

1

Distributed Computing Group Roger Wattenhofer 54

A Univalent Initial State?

  • Can all executions lead to the same

decision?

1

Distributed Computing Group Roger Wattenhofer 55

State is Bivalent

  • Solo execution by A

must decide 0

  • Solo execution by B

must decide 1

1

Distributed Computing Group Roger Wattenhofer 56

0-valent

Critical States

1-valent critical

slide-15
SLIDE 15

Distributed Computing Group Roger Wattenhofer 57

Critical States

  • Starting from a bivalent initial state
  • The protocol can reach a critical

state

– Otherwise we could stay bivalent forever – And the protocol is not wait-free

Distributed Computing Group Roger Wattenhofer 58

From a Critical State

c

If A goes first, protocol decides 0 If B goes first, protocol decides 1 0-valent 1-valent

Distributed Computing Group Roger Wattenhofer 59

Model Dependency

  • So far, memory-independent!
  • True for

– Registers – Message-passing – Carrier pigeons – Any kind of asynchronous computation

Distributed Computing Group Roger Wattenhofer 60

What are the Threads Doing?

  • Reads and/or writes
  • To same/different registers
slide-16
SLIDE 16

Distributed Computing Group Roger Wattenhofer 61

Possible Interactions

? ? ? ?

y.write()

? ? ? ?

x.write()

? ? ? ?

y.read()

? ? ? ?

x.read() y.write() x.write() y.read() x.read()

Distributed Computing Group Roger Wattenhofer 62

Reading Registers

A runs solo, decides 0 B reads x

1

A runs solo, decides 1

c

States look the same to A

Distributed Computing Group Roger Wattenhofer 63

Possible Interactions

? ? no no

y.write()

? ? no no

x.write()

no no no no

y.read()

no no no no

x.read() y.write() x.write() y.read() x.read()

Distributed Computing Group Roger Wattenhofer 64

Writing Distinct Registers

A writes y B writes x

1

c

The song remains the same A writes y B writes x

slide-17
SLIDE 17

Distributed Computing Group Roger Wattenhofer 65

Possible Interactions

? no no no

y.write()

no ? no no

x.write()

no no no no

y.read()

no no no no

x.read() y.write() x.write() y.read() x.read()

Distributed Computing Group Roger Wattenhofer 66

Writing Same Registers

States look the same to A

A writes x B writes x

1

A runs solo, decides 1

c

A runs solo, decides 0 A writes x

Distributed Computing Group Roger Wattenhofer 67

That’s All, Folks!

no no no no

y.write()

no no no no

x.write()

no no no no

y.read()

no no no no

x.read() y.write() x.write() y.read() x.read()

Distributed Computing Group Roger Wattenhofer 68

Theorem

  • It is impossible to solve consensus

using read/write atomic registers

– Assume protocol exists – It has a bivalent initial state – Must be able to reach a critical state – Case analysis of interactions

  • Reads vs others
  • Writes vs writes
slide-18
SLIDE 18

Distributed Computing Group Roger Wattenhofer 69

What Does Consensus have to do with Distributed Systems?

Distributed Computing Group Roger Wattenhofer 70

We want to build a Concurrent FIFO Queue

Distributed Computing Group Roger Wattenhofer 71

With Multiple Dequeuers!

Distributed Computing Group Roger Wattenhofer 72

A Consensus Protocol

2-element array FIFO Queue with red and black balls

8

Coveted red ball Dreaded black ball

slide-19
SLIDE 19

Distributed Computing Group Roger Wattenhofer 73

Protocol: Write Value to Array

1

(3)

Distributed Computing Group Roger Wattenhofer 74

Protocol: Take Next Item from Queue

1

8

Distributed Computing Group Roger Wattenhofer 75

1

Protocol: Take Next Item from Queue

I got the coveted red ball, so I will decide my value I got the dreaded black ball, so I will decide the other’s value from the array

8

Distributed Computing Group Roger Wattenhofer 76

Why does this Work?

  • If one thread gets the red ball
  • Then the other gets the black ball
  • Winner can take her own value
  • Loser can find winner’s value in array

– Because threads write array before dequeuing from queue

slide-20
SLIDE 20

Distributed Computing Group Roger Wattenhofer 77

Implication

  • We can solve 2-thread consensus

using only

– A two-dequeuer queue – Atomic registers

Distributed Computing Group Roger Wattenhofer 78

Implications

  • Assume there exists

– A queue implementation from atomic registers

  • Given

– A consensus protocol from queue and registers

  • Substitution yields

– A wait-free consensus protocol from atomic registers

contradiction

Distributed Computing Group Roger Wattenhofer 79

Corollary

  • It is impossible to implement a two-

dequeuer wait-free FIFO queue with read/write shared memory.

  • This was a proof by reduction;

important beyond NP-completeness…

Distributed Computing Group Roger Wattenhofer 80

Consensus #3 read-modify-write shared mem.

  • n processors, with n > 1
  • Wait-free implementation
  • Processors can atomically read and

write a shared memory cell in one atomic step: the value written can depend on the value read

  • We call this a RMW register
slide-21
SLIDE 21

Distributed Computing Group Roger Wattenhofer 81

Protocol

  • There is a cell c, initially c=“?”
  • Every processor i does the following

RMW(c), with

if (c == “?”) then Write(c, vi); decide vi; else decide c;

atomic step

Distributed Computing Group Roger Wattenhofer 82

Discussion

  • Protocol works correctly

– One processor accesses c as the first; this processor will determine decision

  • Protocol is wait-free
  • RMW is quite a strong primitive

– Can we achieve the same with a weaker primitive?

Distributed Computing Group Roger Wattenhofer 83

Read-Modify-Write more formally

  • Method takes 2 arguments:

– Variable x – Function f

  • Method call:

– Returns value of x – Replaces x with f(x) f(x)

Distributed Computing Group Roger Wattenhofer 84

public abstract class public abstract class RMW { RMW { private int private int value; value; public void public void rmw(function rmw(function f) { f) { int int prior = prior = this. this.value; value; this. this.value = f( value = f(this. this.value); value); return return prior; prior; } }

Read-Modify-Write

Return prior value Apply function

slide-22
SLIDE 22

Distributed Computing Group Roger Wattenhofer 85

public abstract class public abstract class RMW { RMW { private int private int value; value; public void public void read() { read() { int int prior = prior = this. this.value; value; this. this.value = value = this. this.value; value; return return prior; prior; } }

Example: Read

identity function

Distributed Computing Group Roger Wattenhofer 86

public abstract class public abstract class RMW { RMW { private int private int value; value; public void public void TAS() { TAS() { int int prior = prior = this. this.value; value; this. this.value = value = 1; return return prior; prior; } }

Example: test&set

constant function

Distributed Computing Group Roger Wattenhofer 87

public abstract class public abstract class RMW { RMW { private int private int value; value; public void public void fai() { fai() { int int prior = prior = this. this.value; value; this. this.value = value = this. this.value+ value+1; return return prior; prior; } }

Example: fetch&inc

increment function

Distributed Computing Group Roger Wattenhofer 88

public abstract class public abstract class RMW { RMW { private int private int value; value; public void public void faa(int faa(int x) { ) { int int prior = prior = this. this.value; value; this. this.value = value = this. this.value+ value+x; return return prior; prior; } }

Example: fetch&add

addition function

slide-23
SLIDE 23

Distributed Computing Group Roger Wattenhofer 89

public abstract class public abstract class RMW { RMW { private int private int value; value; public void public void swap(int swap(int x) { x) { int int prior = prior = this. this.value; value; this. this.value = value = x; return return prior; prior; } }

Example: swap

constant function

Distributed Computing Group Roger Wattenhofer 90

public abstract class public abstract class RMW { RMW { private int private int value; value; public void public void CAS(int CAS(int old, int ld, int new) { ew) { int int prior = prior = this. this.value; value; if ( if (this. this.value == old) value == old) this.value = new; this.value = new; return return prior; prior; } }

Example: compare&swap

complex function

Distributed Computing Group Roger Wattenhofer 91

“Non-trivial” RMW

  • Not simply read
  • But

– test&set, fetch&inc, fetch&add, swap, compare&swap, general RMW

  • Definition: A RMW is non-trivial if

there exists a value v such that v ≠ f(v)

Distributed Computing Group Roger Wattenhofer 92

Consensus Numbers (Herlihy)

  • An object has consensus number n

– If it can be used

  • Together with atomic read/write registers

– To implement n-thread consensus

  • But not (n+1)-thread consensus
slide-24
SLIDE 24

Distributed Computing Group Roger Wattenhofer 93

Consensus Numbers

  • Theorem

– Atomic read/write registers have consensus number 1

  • Proof

– Works with 1 process – We have shown impossibility with 2

Distributed Computing Group Roger Wattenhofer 94

Consensus Numbers

  • Consensus numbers are a useful way
  • f measuring synchronization power
  • Theorem

– If you can implement X from Y – And X has consensus number c – Then Y has consensus number at least c

Distributed Computing Group Roger Wattenhofer 95

Synchronization Speed Limit

  • Conversely

– If X has consensus number c – And Y has consensus number d < c – Then there is no way to construct a wait-free implementation of X by Y

  • This theorem will be very useful

– Unforeseen practical implications!

Distributed Computing Group Roger Wattenhofer 96

Theorem

  • Any non-trivial RMW object has

consensus number at least 2

  • Implies no wait-free implementation
  • f RMW registers from read/write

registers

  • Hardware RMW instructions not just

a convenience

slide-25
SLIDE 25

Distributed Computing Group Roger Wattenhofer 97

Proof

public class public class RMWConsensusFor2 RMWConsensusFor2 implements implements Consensus { Consensus { private RMW r; private RMW r; public public Object Object decide() { decide() { int int i = Thread.myIndex(); i = Thread.myIndex(); if if (r.rmw(f) == v) (r.rmw(f) == v) return return this. this.announce[i]; announce[i]; else else return return this. this.announce[1-i]; announce[1-i]; }} }}

Initialized to v Am I first? Yes, return my input No, return

  • ther’s input

Distributed Computing Group Roger Wattenhofer 98

Proof

  • We have displayed

– A two-thread consensus protocol – Using any non-trivial RMW object

Distributed Computing Group Roger Wattenhofer 99

Interfering RMW

  • Let F be a set of functions such that

for all fi and fj, either

– They commute: fi(fj(x))=fj(fi(x)) – They overwrite: fi(fj(x))=fi(x)

  • Claim: Any such set of RMW objects

has consensus number exactly 2

Distributed Computing Group Roger Wattenhofer 100

Examples

  • Test-and-Set

– Overwrite

  • Swap

– Overwrite

  • Fetch-and-inc

– Commute

slide-26
SLIDE 26

Distributed Computing Group Roger Wattenhofer 101

Meanwhile Back at the Critical State c

0-valent 1-valent A about to apply fA B about to apply fB

Distributed Computing Group Roger Wattenhofer 102

Maybe the Functions Commute c

0-valent A applies fA B applies fB A applies fA B applies fB

1

C runs solo C runs solo 1-valent

Distributed Computing Group Roger Wattenhofer 103

Maybe the Functions Commute c

0-valent A applies fA B applies fB A applies fA B applies fB

1

C runs solo C runs solo 1-valent

These states look the same to C These states look the same to C

Distributed Computing Group Roger Wattenhofer 104

Maybe the Functions Overwrite c

0-valent A applies fA B applies fB A applies fA

1

C runs solo C runs solo 1-valent

slide-27
SLIDE 27

Distributed Computing Group Roger Wattenhofer 105

Maybe the Functions Overwrite c

0-valent A applies fA B applies fB A applies fA

1

C runs solo C runs solo 1-valent

These states look the same to C These states look the same to C

Distributed Computing Group Roger Wattenhofer 106

Impact

  • Many early machines used these

“weak” RMW instructions

– Test-and-set (IBM 360) – Fetch-and-add (NYU Ultracomputer) – Swap

  • We now understand their limitations

– But why do we want consensus anyway?

Distributed Computing Group Roger Wattenhofer 107

public class public class RMWConsensus RMWConsensus implements implements Consensus { Consensus { private RMW r; private RMW r; public public Object Object decide() { decide() { int int i = Thread.myIndex(); i = Thread.myIndex(); int int j = r.CAS(-1,i); j = r.CAS(-1,i); if if (j == -1) (j == -1) return return this. this.announce[i]; announce[i]; else else return return this. this.announce[j]; announce[j]; }} }}

CAS has Unbounded Consensus Number

Initialized to -1 Am I first? Yes, return my input No, return

  • ther’s input

Distributed Computing Group Roger Wattenhofer 108

The Consensus Hierarchy

1 Read/Write Registers, … 2 T&S, F&I, Swap, … ∞ CAS, … . . .

slide-28
SLIDE 28

Distributed Computing Group Roger Wattenhofer 109

Consensus #4 Synchronous Systems

  • In real systems, one can sometimes

tell if a processor had crashed

– Timeouts – Broken TCP connections

  • Can one solve consensus at least in

synchronous systems?

Distributed Computing Group Roger Wattenhofer 110

Communication Model

  • Complete graph
  • Synchronous

1

p

2

p

3

p

4

p

5

p

Distributed Computing Group Roger Wattenhofer 111

1

p

2

p

3

p

4

p

5

p

a a a a

Send a message to all processors in one round: Broadcast

Distributed Computing Group Roger Wattenhofer 112

At the end of the round: everybody receives a

1

p

2

p

3

p

4

p

5

p

a a a a

slide-29
SLIDE 29

Distributed Computing Group Roger Wattenhofer 113

1

p

2

p

3

p

4

p

5

p

a a a a b b b b

Broadcast: Two or more processes can broadcast in the same round

Distributed Computing Group Roger Wattenhofer 114

1

p

2

p

3

p

4

p

5

p

a,b a b a,b a,b

At end of round...

Distributed Computing Group Roger Wattenhofer 115

Crash Failures

Faulty processor 1

p

2

p

3

p

4

p

5

p

a a a a

Distributed Computing Group Roger Wattenhofer 116

1

p

2

p

3

p

4

p

5

p

a a

Some of the messages are lost, they are never received

Faulty processor

slide-30
SLIDE 30

Distributed Computing Group Roger Wattenhofer 117

1

p

2

p

3

p

4

p

5

p

a a

Effect

Faulty processor

Distributed Computing Group Roger Wattenhofer 118

Failure

1

p

2

p

3

p

4

p

5

p

Round

1

1

p

2

p

3

p

4

p

5

p

1

p

2

p

3

p

4

p

5

p

Round

2

Round

3

1

p

2

p

4

p

5

p

Round

4

1

p

2

p

4

p

5

p

Round

5

3

p

3

p After a failure, the process disappears from the network

Distributed Computing Group Roger Wattenhofer 119

Consensus:

Everybody has an initial value

1 2 3 4 Start

Distributed Computing Group Roger Wattenhofer 120

3 3 3 3 3 Finish

Everybody must decide on the same value

slide-31
SLIDE 31

Distributed Computing Group Roger Wattenhofer 121

1 1 1 1 1 Start If everybody starts with the same value they must decide on that value Finish 1 1 1 1 1 Validity condition:

Distributed Computing Group Roger Wattenhofer 122

A simple algorithm

  • 1. Broadcasts value to all processors
  • 2. Decides on the minimum

Each processor: (only one round is needed)

Distributed Computing Group Roger Wattenhofer 123

1 2 3 4 Start

Distributed Computing Group Roger Wattenhofer 124

1 2 3 4 Broadcast values

0,1,2,3,4 0,1,2,3,4 0,1,2,3,4 0,1,2,3,4 0,1,2,3,4

slide-32
SLIDE 32

Distributed Computing Group Roger Wattenhofer 125

Decide on minimum

0,1,2,3,4 0,1,2,3,4 0,1,2,3,4 0,1,2,3,4 0,1,2,3,4

Distributed Computing Group Roger Wattenhofer 126

Finish

Distributed Computing Group Roger Wattenhofer 127

This algorithm satisfies the validity condition 1 1 1 1 1 Start Finish 1 1 1 1 1 If everybody starts with the same initial value, everybody sticks to that value (minimum)

Distributed Computing Group Roger Wattenhofer 128

Consensus with Crash Failures

  • 1. Broadcasts value to all processors
  • 2. Decides on the minimum

Each processor: The simple algorithm doesn’t work

slide-33
SLIDE 33

Distributed Computing Group Roger Wattenhofer 129

1 2 3 4 Start fail The failed processor doesn’t broadcast its value to all processors

Distributed Computing Group Roger Wattenhofer 130

1 2 3 4

0,1,2,3,4 1,2,3,4

fail

0,1,2,3,4 1,2,3,4

Broadcasted values

Distributed Computing Group Roger Wattenhofer 131

1 1

0,1,2,3,4 1,2,3,4

fail

0,1,2,3,4 1,2,3,4

Decide on minimum

Distributed Computing Group Roger Wattenhofer 132

1 1 fail

Finish - No Consensus!

slide-34
SLIDE 34

Distributed Computing Group Roger Wattenhofer 133

If an algorithm solves consensus for f failed processes we say it is an f-resilient consensus algorithm

Distributed Computing Group Roger Wattenhofer 134

1 4 3 2 Start Finish 1 1 Example: The input and output of a 3-resilient consensus algorithm

Distributed Computing Group Roger Wattenhofer 135

New validity condition: if all non-faulty processes start with the same value then all non-faulty processes decide on that value 1 1 1 1 1 Start Finish 1 1

Distributed Computing Group Roger Wattenhofer 136

An f-resilient algorithm Round 1: Broadcast my value Round 2 to round f+1: Broadcast any new received values End of round f+1: Decide on the minimum value received

slide-35
SLIDE 35

Distributed Computing Group Roger Wattenhofer 137

1 2 3 4 Start Example: f=1 failures, f+1=2 rounds needed

Distributed Computing Group Roger Wattenhofer 138

1 2 3 4 Round 1 fail Example: f=1 failures, f+1 = 2 rounds needed Broadcast all values to everybody

0,1,2,3,4 1,2,3,4 0,1,2,3,4 1,2,3,4

(new values)

Distributed Computing Group Roger Wattenhofer 139

Example: f=1 failures, f+1 = 2 rounds needed Round 2

Broadcast all new values to everybody

0,1,2,3,4 0,1,2,3,4 0,1,2,3,4 0,1,2,3,4

1 2 3 4

Distributed Computing Group Roger Wattenhofer 140

Example: f=1 failures, f+1 = 2 rounds needed Finish Decide on minimum value

0,1,2,3,4 0,1,2,3,4 0,1,2,3,4 0,1,2,3,4

slide-36
SLIDE 36

Distributed Computing Group Roger Wattenhofer 141

1 2 3 4 Start Example: f=2 failures, f+1 = 3 rounds needed Example of execution with 2 failures

Distributed Computing Group Roger Wattenhofer 142

1 2 3 4 Round 1 Failure 1 Broadcast all values to everybody

1,2,3,4 1,2,3,4 0,1,2,3,4 1,2,3,4

Example: f=2 failures, f+1 = 3 rounds needed

Distributed Computing Group Roger Wattenhofer 143

1 2 3 4 Round 2 Failure 1 Broadcast new values to everybody

0,1,2,3,4 1,2,3,4 0,1,2,3,4 1,2,3,4

Failure 2 Example: f=2 failures, f+1 = 3 rounds needed

Distributed Computing Group Roger Wattenhofer 144

1 2 3 4 Round 3 Failure 1 Broadcast new values to everybody

0,1,2,3,4 0,1,2,3,4 0,1,2,3,4 O,1,2,3,4

Failure 2 Example: f=2 failures, f+1 = 3 rounds needed

slide-37
SLIDE 37

Distributed Computing Group Roger Wattenhofer 145

3 Finish Failure 1 Decide on the minimum value

0,1,2,3,4 0,1,2,3,4 0,1,2,3,4 O,1,2,3,4

Failure 2 Example: f=2 failures, f+1 = 3 rounds needed

Distributed Computing Group Roger Wattenhofer 146

Example: 5 failures, 6 rounds 1 2 No failure 3 4 5 6 Round If there are f failures and f+1 rounds then there is a round with no failed process

Distributed Computing Group Roger Wattenhofer 147

  • Every (non faulty) process knows

about all the values of all the other participating processes

  • This knowledge doesn’t change until

the end of the algorithm

At the end of the round with no failure:

Distributed Computing Group Roger Wattenhofer 148

Everybody would decide on the same value However, as we don’t know the exact position of this round, we have to let the algorithm execute for f+1 rounds

Therefore, at the end of the round with no failure:

slide-38
SLIDE 38

Distributed Computing Group Roger Wattenhofer 149

when all processes start with the same input value then the consensus is that value This holds, since the value decided from each process is some input value

Validity of algorithm:

Distributed Computing Group Roger Wattenhofer 150

A Lower Bound

Any f-resilient consensus algorithm requires at least f+1 rounds Theorem:

Distributed Computing Group Roger Wattenhofer 151

Proof sketch: Assume for contradiction that f

  • r less rounds are enough

Worst case scenario: There is a process that fails in each round

Distributed Computing Group Roger Wattenhofer 152

Round a 1 before process fails, it sends its value a to only one process i

p

k

p

i

p

k

p

Worst case scenario

slide-39
SLIDE 39

Distributed Computing Group Roger Wattenhofer 153

Round a 1 before process fails, it sends value a to only one process m

p

k

p

k

p

m

p

Worst case scenario

2

Distributed Computing Group Roger Wattenhofer 154

Round 1 f

p Worst case scenario

2 ……… a n

p

f 3 At the end

  • f round f
  • nly one

process knows about value a n

p

Distributed Computing Group Roger Wattenhofer 155

Round 1

Worst case scenario

2 ……… f 3 Process may decide

  • n a, and all
  • ther

processes may decide

  • n another

value (b) n

p

n

p

a b decide

Distributed Computing Group Roger Wattenhofer 156

Round 1

Worst case scenario

2 ……… f 3 n

p

a b decide Therefore f rounds are not enough At least f+1 rounds are needed

slide-40
SLIDE 40

Distributed Computing Group Roger Wattenhofer 157

Consensus #5 Byzantine Failures

Faulty processor 1

p

2

p

3

p

4

p

5

p

a b a c Different processes receive different values

Distributed Computing Group Roger Wattenhofer 158

1

p

2

p

3

p

4

p

5

p

a a A Byzantine process can behave like a Crashed-failed process Some messages may be lost Faulty processor

Distributed Computing Group Roger Wattenhofer 159

Failure

1

p

2

p

3

p

4

p

5

p

Round

1

1

p

2

p

3

p

4

p

5

p

1

p

2

p

3

p

4

p

5

p

Round

2

Round

3

1

p

2

p

4

p

5

p

Round

4

1

p

2

p

4

p

5

p

Round

5

After failure the process continues functioning in the network

3

p

3

p Failure

1

p

2

p

4

p

5

p

Round

6

3

p

Distributed Computing Group Roger Wattenhofer 160

Consensus with Byzantine Failures

solves consensus for f failed processes f-resilient consensus algorithm:

slide-41
SLIDE 41

Distributed Computing Group Roger Wattenhofer 161

The input and output of a 1-resilient consensus algorithm 1 4 3 2 Start Finish 3 3 Example: 3 3

Distributed Computing Group Roger Wattenhofer 162

Validity condition: if all non-faulty processes start with the same value then all non-faulty processes decide on that value 1 1 1 1 1 Start Finish 1 1 1 1

Distributed Computing Group Roger Wattenhofer 163

Any f-resilient consensus algorithm requires at least f+1 rounds Theorem: follows from the crash failure lower bound Proof:

Lower bound on number of rounds

Distributed Computing Group Roger Wattenhofer 164

There is no f-resilient algorithm for n processes, where f ≥ n/3 Theorem: Plan: First we prove the 3 process case, and then the general case

Upper bound on failed processes

slide-42
SLIDE 42

Distributed Computing Group Roger Wattenhofer 165

There is no 1-resilient algorithm for 3 processes Lemma: Proof: Assume for contradiction that there is a 1-resilient algorithm for 3 processes

The 3 processes case

Distributed Computing Group Roger Wattenhofer 166

p

1

p

2

p

A(0) B(1) C(0) Initial value Local algorithm

Distributed Computing Group Roger Wattenhofer 167

p

1

p

2

p

1 1 1 Decision value

Distributed Computing Group Roger Wattenhofer 168

3

p

4

p

2

p

A(0) B(1) C(1) 1

p

5

p p

A(1) C(0) B(0) Assume 6 processes are in a ring (just for fun)

slide-43
SLIDE 43

Distributed Computing Group Roger Wattenhofer 169

3

p

4

p

2

p

A(0) B(1) C(1) 1

p

5

p p

A(1) C(0) B(0) B(1) 1

p p

A(1) 2

p

faulty

C(1) C(0)

Processes think they are in a triangle

Distributed Computing Group Roger Wattenhofer 170

3

p

4

p

2

p

A(0) B(1) C(1) 1

p

5

p p

A(1) C(0) B(0) 1 1

p p

1 2

p

faulty (validity condition)

Distributed Computing Group Roger Wattenhofer 171

3

p

4

p

2

p

A(0) C(1) 1

p

5

p p

A(1) C(0) B(0)

p

1 1

p

2

p

C(0) B(0)

p

A(0) A(1)

faulty B(1)

Distributed Computing Group Roger Wattenhofer 172

3

p

4

p

2

p

A(0) B(1) C(1) 1

p

5

p p

A(1) C(0) B(0)

p

1 1

p

2

p

p

faulty (validity condition)

slide-44
SLIDE 44

Distributed Computing Group Roger Wattenhofer 173

3

p

4

p

2

p

A(0) B(1) C(1) 1

p

5

p p

A(1) C(0) B(0)

p

1 2

p

2

p p

A(1) C(0) 1

p

B(1) B(0)

faulty

Distributed Computing Group Roger Wattenhofer 174

3

p

4

p

2

p

A(0) B(1) C(1) 1

p

5

p p

A(1) C(0) B(0)

p

1 2

p

2

p p

1 1

p

faulty

Distributed Computing Group Roger Wattenhofer 175

2

p p

1 1

p

faulty

Impossibility

Distributed Computing Group Roger Wattenhofer 176

There is no algorithm that solves consensus for 3 processes in which 1 is a byzantine process

Conclusion

slide-45
SLIDE 45

Distributed Computing Group Roger Wattenhofer 177

Assume for contradiction that there is an f -resilient algorithm A for n processes, where f ≥ n/3 We will use algorithm A to solve consensus for 3 processes and 1 failure (which is impossible, thus we have a contradiction)

The n processes case

Distributed Computing Group Roger Wattenhofer 178

1

p 1

2

p

n

p 1 … … 2 2 1 1 1 start failures

1

p 1 1

2

p

n

p … … 1 1 1 1 1 finish

Algorithm A

Distributed Computing Group Roger Wattenhofer 179

3 1 n

p p K

1

q

2

q

3

q

3 2 1 3 n n

p p K

+ n n

p p K

1 3 2 +

Each process q simulates algorithm A

  • n n/3 of “p” processes

Distributed Computing Group Roger Wattenhofer 180

3 1 n

p p K

1

q

2

q

3

q

3 2 1 3 n n

p p K

+ n n

p p K

1 3 2 +

fails When a single q is byzantine, then n/3 of the “p” processes are byzantine too.

slide-46
SLIDE 46

Distributed Computing Group Roger Wattenhofer 181

3 1 n

p p K

1

q

2

q

3

q

3 2 1 3 n n

p p K

+ n n

p p K

1 3 2 +

fails algorithm A tolerates n/3 failures Finish of algorithm A

k k k k k k k k k k k k k

all decide k

Distributed Computing Group Roger Wattenhofer 182

1

q

2

q

3

q

fails

Final decision

k k We reached consensus with 1 failure Impossible!!!

Distributed Computing Group Roger Wattenhofer 183

There is no f-resilient algorithm for n processes with f ≥ n/3

Conclusion

Distributed Computing Group Roger Wattenhofer 184

The King Algorithm

solves consensus with n processes and f failures where f < n/4 in f +1 “phases” There are f+1 phases Each phase has two rounds In each phase there is a different king

slide-47
SLIDE 47

Distributed Computing Group Roger Wattenhofer 185

Example: 12 processes, 2 faults, 3 kings

1 1 2 2 1 1 1 initial values Faulty

Distributed Computing Group Roger Wattenhofer 186

Example: 12 processes, 2 faults, 3 kings

Remark: There is a king that is not faulty 1 1 2 2 1 1 1 initial values King 1 King 2 King 3

Distributed Computing Group Roger Wattenhofer 187

Each processor has a preferred value i

p

i

v

In the beginning, the preferred value is set to the initial value

The King algorithm

Distributed Computing Group Roger Wattenhofer 188

Round 1, processor : i

p

  • Broadcast preferred value
  • Set to the majority of

values received i

v

i

v

The King algorithm: Phase k

slide-48
SLIDE 48

Distributed Computing Group Roger Wattenhofer 189

  • If had majority of less than

Round 2, king : k

p

  • Broadcast new preferred value

Round 2, process : i

p

k

v

i

v f n + 2

then set to i

v

k

v

The King algorithm: Phase k

Distributed Computing Group Roger Wattenhofer 190

End of Phase f+1: Each process decides on preferred value

The King algorithm

Distributed Computing Group Roger Wattenhofer 191

Example: 6 processes, 1 fault

Faulty 1

king 1 king 2

1 1 2

Distributed Computing Group Roger Wattenhofer 192

1

king 1

1 1 2

Phase 1, Round 1

2,1,1,0,0,0 2,1,1,1,0,0 2,1,1,1,0,0 2,1,1,0,0,0 2,1,1,0,0,0

1 1 Everybody broadcasts

slide-49
SLIDE 49

Distributed Computing Group Roger Wattenhofer 193

1

king 1

1 1

Phase 1, Round 1

Choose the majority Each majority population was

4 2 3 = + ≤ f n

On round 2, everybody will choose the king’s value

Distributed Computing Group Roger Wattenhofer 194

Phase 1, Round 2

1 1 1 1 1 2

king 1

The king broadcasts

Distributed Computing Group Roger Wattenhofer 195

Phase 1, Round 2

1 1 1 2

king 1

Everybody chooses the king’s value

Distributed Computing Group Roger Wattenhofer 196

1

king 2

1 1 2

Phase 2, Round 1

2,1,1,0,0,0 2,1,1,1,0,0 2,1,1,1,0,0 2,1,1,0,0,0 2,1,1,0,0,0

1 1 Everybody broadcasts

slide-50
SLIDE 50

Distributed Computing Group Roger Wattenhofer 197

1 1 1

Phase 2, Round 1

Choose the majority Each majority population is

4 2 3 = + ≤ f n

On round 2, everybody will choose the king’s value

king 2

2,1,1,1,0,0

Distributed Computing Group Roger Wattenhofer 198

Phase 2, Round 2

1 1 1 The king broadcasts

king 2

Distributed Computing Group Roger Wattenhofer 199

Phase 2, Round 2

1

king 2

Everybody chooses the king’s value Final decision

Distributed Computing Group Roger Wattenhofer 200

In the round where the king is non-faulty, everybody will choose the king’s value v After that round, the majority will remain value v with a majority population which is at least

f n f n + > − 2

Invariant / Conclusion

slide-51
SLIDE 51

Distributed Computing Group Roger Wattenhofer 201

Exponential Algorithm

solves consensus with n processes and f failures where f < n/3 in f +1 “phases” But: uses messages with exponential size

Distributed Computing Group Roger Wattenhofer 202

Atomic Broadcast

  • One process wants to broadcast

message to all other processes

  • Either everybody should receive the

(same) message, or nobody should receive the message

  • Closely related to Consensus: First

send the message to all, then agree!

Distributed Computing Group Roger Wattenhofer 203

Summary

  • We have solved consensus in a variety
  • f models; particularly we have seen

– algorithms – wrong algorithms – lower bounds – impossibility results – reductions – etc.

Distributed Computing Group Roger Wattenhofer

Questions?