Byzan&ne Fault Tolerance CS 425: Distributed Systems Fall 2011 - - PowerPoint PPT Presentation

byzan ne fault tolerance
SMART_READER_LITE
LIVE PREVIEW

Byzan&ne Fault Tolerance CS 425: Distributed Systems Fall 2011 - - PowerPoint PPT Presentation

Byzan&ne Fault Tolerance CS 425: Distributed Systems Fall 2011 Material drived from slides by I. Gupta and N.Vaidya 1 Reading List L. Lamport, R. Shostak, M. Pease, The Byzan&ne Generals Problem, ACM ToPLaS 1982. M.


slide-1
SLIDE 1

Byzan&ne Fault Tolerance

CS 425: Distributed Systems Fall 2011

1

Material drived from slides by I. Gupta and N.Vaidya

slide-2
SLIDE 2

Reading List

  • L. Lamport, R. Shostak, M. Pease, “The

Byzan&ne Generals Problem,” ACM ToPLaS 1982.

  • M. Castro and B. Liskov, “Prac&cal Byzan&ne

Fault Tolerance,” OSDI 1999.

2

slide-3
SLIDE 3

Byzan&ne Generals Problem

A sender wants to send message to n‐1 other peers

  • Fault‐free nodes must agree
  • Sender fault‐free  agree on its message
  • Up to f failures
slide-4
SLIDE 4

Byzan&ne Generals Problem

A sender wants to send message to n‐1 other peers

  • Fault‐free nodes must agree
  • Sender fault‐free  agree on its message
  • Up to f failures
slide-5
SLIDE 5

S 3 2 1

v v

v

Byzan&ne Generals Algorithm

5

value v Faulty peer

slide-6
SLIDE 6

S 3 2 1

v v

v

6

value v v

v

Byzan&ne Generals Algorithm

slide-7
SLIDE 7

S 3 2 1

v v

v

7

value v v

v ? ?

Byzan&ne Generals Algorithm

slide-8
SLIDE 8

S 3 2 1

v v

v

8

value v

v

v

v ?

v

?

Byzan&ne Generals Algorithm

slide-9
SLIDE 9

3 2

v v

v

9

value v

x

v v ? v ? [v,v,?] [v,v,?]

S 1

Byzan&ne Generals Algorithm

slide-10
SLIDE 10

3 2

v v

v

10

value v

x

v v ? v ? v v

S 1

Majority vote results in correct result at good peers

Byzan&ne Generals Algorithm

slide-11
SLIDE 11

S 3 2 1

v w

x

11

Faulty source

Byzan&ne Generals Algorithm

slide-12
SLIDE 12

S 3 2

v w x

12

w w

1

Byzan&ne Generals Algorithm

slide-13
SLIDE 13

S 3 2

v w

x

13

x

w w v x v

1

Byzan&ne Generals Algorithm

slide-14
SLIDE 14

S 3 2

v w

x

14

x

w w v x v

1

[v,w,x] [v,w,x] [v,w,x]

Byzan&ne Generals Algorithm

slide-15
SLIDE 15

S 3 2

v w

x

15

x

w w v x v

1

[v,w,x] [v,w,x] [v,w,x] Vote result iden&cal at good peers

Byzan&ne Generals Algorithm

slide-16
SLIDE 16

Known Results

  • Need 3f + 1 nodes to tolerate f failures
  • Need Ω(n2) messages in general

16

slide-17
SLIDE 17

Ω(n2) Message Complexity

  • Each message at least 1 bit
  • Ω(n2) bits “communica&on complexity” to

agree on just 1 bit value

17

slide-18
SLIDE 18

Prac&cal Byzan&ne Fault Tolerance

  • Computer systems provide crucial services
  • Computer systems fail

– Crash‐stop failure – Crash‐recovery failure – Byzan&ne failure

  • Example: natural disaster, malicious afack,

hardware failure, sogware bug, etc.

  • Need highly available service

18

Replicate to increase availability

slide-19
SLIDE 19

Challenges

19

Request A Request B Client Client

slide-20
SLIDE 20

Requirements

  • All replicas must handle same requests

despite failure.

  • Replicas must handle requests in iden&cal
  • rder despite failure.

20

slide-21
SLIDE 21

Challenges

21

2: Request B 1: Request A Client Client

slide-22
SLIDE 22

State Machine Replica&on

22

2: Request B 1: Request A 2: Request B 1: Request A 2: Request B 1: Request A 2: Request B 1: Request A Client Client

How to assign sequence number to requests?

slide-23
SLIDE 23

Primary Backup Mechanism

23

Client Client 2: Request B 1: Request A

What if the primary is faulty?

Agreeing on sequence number Agreeing on changing the primary (view change) View 0

slide-24
SLIDE 24

Normal Case Opera&on

  • Three phase algorithm:

– PRE‐PREPARE picks order of requests – PREPARE ensures order within views – COMMIT ensures order across views

  • Replicas remember messages in log
  • Messages are authen&cated

– {.}σk denotes a message sent by k

24

slide-25
SLIDE 25

Pre‐prepare Phase

Primary: Replica 0 Replica 1 Replica 2 Replica 3 Request: m {PRE‐PREPARE, v, n, m}σ0 Fail

25

slide-26
SLIDE 26

Prepare Phase

Request: m PRE‐PREPARE Primary: Replica 0 Replica 1 Replica 2 Replica 3 Fail Accepted PRE‐PREPARE

26

slide-27
SLIDE 27

Prepare Phase

Request: m PRE‐PREPARE Primary: Replica 0 Replica 1 Replica 2 Replica 3 Fail {PREPARE, v, n, D(m), 1}σ1 Accepted PRE‐PREPARE

27

slide-28
SLIDE 28

Prepare Phase

Request: m PRE‐PREPARE Primary: Replica 0 Replica 1 Replica 2 Replica 3 Fail {PREPARE, v, n, D(m), 1}σ1 Accepted PRE‐PREPARE Collect PRE‐PREPARE + 2f matching PREPARE

28

slide-29
SLIDE 29

Commit Phase

Request: m PRE‐PREPARE Primary: Replica 0 Replica 1 Replica 2 Replica 3 Fail PREPARE {COMMIT, v, n, D(m)}σ2

29

slide-30
SLIDE 30

Commit Phase (2)

Request: m PRE‐PREPARE Primary: Replica 0 Replica 1 Replica 2 Replica 3 Fail PREPARE COMMIT Collect 2f+1 matching COMMIT: execute and reply

30

slide-31
SLIDE 31

View Change

  • Provide liveness when primary fails

– Timeouts trigger view changes – Select new primary (= view number mod 3f+1)

  • Brief protocol

– Replicas send VIEW‐CHANGE message along with the requests they prepared so far – New primary collects 2f+1 VIEW‐CHANGE messages – Constructs informa&on about commifed requests in previous views

31

slide-32
SLIDE 32

View Change Safety

  • Goal: No two different commifed request

with same sequence number across views

Quorum for Commifed Cer&ficate (m, v, n) At least one correct replica has Prepared Cer&ficate (m, v, n) View Change Quorum

32

slide-33
SLIDE 33

Related Works

Fault Tolerance Fail Stop Fault Tolerance Paxos 1989 (TR) VS Replica&on PODC 1988 Byzan&ne Fault Tolerance Byzan&ne Agreement Rampart TPDS 1995 SecureRing HICSS 1998 PBFT OSDI ‘99 BASE TOCS ‘03 Byzan&ne Quorums Malkhi‐Reiter JDC 1998 Phalanx SRDS 1998 Fleet ToKDI ‘00 Q/U SOSP ‘05 Hybrid Quorum HQ Replica&on OSDI ‘06

33