Fault-Tolerant State Machine Replication Chinasa T. Okolo 1 - - PowerPoint PPT Presentation

fault tolerant state machine replication
SMART_READER_LITE
LIVE PREVIEW

Fault-Tolerant State Machine Replication Chinasa T. Okolo 1 - - PowerPoint PPT Presentation

Fault-Tolerant State Machine Replication Chinasa T. Okolo 1 Slides borrowed from Hakim Weatherspoon and Drew Zagieboylo Authors Fred Schneider Samuel B. Eckert Professor of Computer Science AAAS, ACM, and IEEE Fellow


slide-1
SLIDE 1

Fault-Tolerant State Machine Replication

Chinasa T. Okolo

Slides borrowed from Hakim Weatherspoon and Drew Zagieboylo

1

slide-2
SLIDE 2

Authors

Fred Schneider

  • Samuel B. Eckert Professor
  • f Computer Science
  • AAAS, ACM, and IEEE

Fellow

  • Concurrent and distributed

systems for high-integrity and mission-critical settings

2

slide-3
SLIDE 3

Outline

  • Motivation
  • State Machine Replication Approach
  • Implementation
  • Fault Tolerance
  • Chain Replication
  • Conclusions

3

slide-4
SLIDE 4

Motivation

Client Client Server X = 10 10 get(x) get(x) …No response

4

slide-5
SLIDE 5

Motivation

  • Need replication for fault tolerance
  • What happens in scenarios without replication?
  • Storage - Disk Failure
  • Web service - Network failure
  • Be able to reason about failure tolerance
  • How badly can things go wrong and have our system

continue to function?

5

slide-6
SLIDE 6

Motivation

Server Client X = 10 X = 10 X = 10 X = 10

6

slide-7
SLIDE 7

Motivation

Server X = 3 X = 3 X = 3 X = 3 put(x,10)

7

slide-8
SLIDE 8

Motivation

Server X = 10 X = 10 X = 10 X = 3 get(x) 10 Problem! get(x) 3

8

slide-9
SLIDE 9

Problem

How can we ensure that all replicas are in the same state all of the time?

9

slide-10
SLIDE 10

Outline

  • Motivation
  • State Machine Replication Approach
  • Implementation
  • Fault Tolerance
  • Chain Replication
  • Conclusions

10

slide-11
SLIDE 11

State Machines

X = Y c X = Z f(c)

  • c is a

Command

11

  • f is a Transition

Function

slide-12
SLIDE 12

State Machine Coding

  • State machines are procedures
  • Client calls procedure
  • Avoid loops
  • Flexible structure

12

slide-13
SLIDE 13

State Machine Replication

  • Each starts in the same initial state
  • Executes the same requests
  • Requires consensus to execute in same order
  • Deterministic, each will do the exact same thing
  • Produce the same output

13

slide-14
SLIDE 14

State Machine Replication

All non faulty servers need:

  • Agreement

○ Every replica needs to accept the same set of requests

  • Order

○ All replicas process requests in the same relative

  • rder

14

slide-15
SLIDE 15

Outline

  • Motivation
  • State Machines
  • Implementation
  • Fault Tolerance
  • Chain Replication
  • Conclusions

15

slide-16
SLIDE 16

Implementation

Agreement

  • Transmitter proposes a request; if it is non-faulty

all servers will accept that request

  • Transmitter can be client or server
  • Client or Server can propose the request

16

slide-17
SLIDE 17

Implementation

Agreement

  • IC1: All non-faulty processors agree on the same

value

  • IC2: If transmitter is non-faulty, agree on its value

17

slide-18
SLIDE 18

Ordering

“The Order requirement can be satisfied by assigning unique identifiers to requests and having state machine replicas process requests according to a total ordering relation

  • n these unique identifiers.”

18

slide-19
SLIDE 19

Implementation

  • Order
  • Assign unique ids to requests and process them

in ascending order.

  • How do we assign unique ids in a distributed

system?

19

slide-20
SLIDE 20

Implementation Client Generated IDs

Ordering via clocks

  • Logical Clocks
  • Synchronized Clocks
  • Ideas from last class! [Lamport 1978]

20

slide-21
SLIDE 21

Can the replicas generate unique identifiers?

Of course!

21

slide-22
SLIDE 22

Implementation Replica Generated IDs

  • 2 Phase ID generation
  • Every replica proposes a candidate
  • One candidate is chosen and agreed upon by all

replicas

22

slide-23
SLIDE 23

Implementation Replica Generated IDs

  • When do we know a candidate is stable?
  • A candidate is accepted
  • No other pending requests with smaller

candidate ids

23

slide-24
SLIDE 24

Stability Testing

  • Stability tests for logical and synchronized clocks?
  • Disadvantages
  • Stability tests require all nodes to communicate

■ Logical: stabilizing requests ■ Synchronized: clock synchronization

24

slide-25
SLIDE 25

Outline

  • Motivation
  • State Machines
  • Implementation
  • Fault Tolerance
  • Chain Replication
  • Conclusions

25

slide-26
SLIDE 26

When does behavior become faulty?

When it’s no longer consistent with specification!

26

slide-27
SLIDE 27

Fault Tolerance

  • Fail-Stop
  • A faulty server can be detected as faulty
  • Crash Failures
  • Server can stop responding without notification

(subset of Byzantine)

  • Byzantine
  • Faulty servers can do arbitrary, perhaps malicious

things

27

slide-28
SLIDE 28

Fault Tolerance

  • Fail-Stop Tolerance

○ To tolerate t failures, need t+1 servers. ○ As long as 1 server remains, we’re OK! ○ Only need to participate in protocols with other live servers

28

slide-29
SLIDE 29

Fault Tolerance

Byzantine Failures To tolerate t failures, need 2t + 1 servers

  • Protocols now involve votes

○ Can only trust server response if the majority of servers say the same thing

  • t + 1 servers need to participate in replication

protocols

29

slide-30
SLIDE 30

Takeaways

  • Can represent deterministic distributed system as

Replicated State Machine

  • Each replica reaches the same conclusion about

the system independently

  • Formalizes notions of fault-tolerance in SMR

30

slide-31
SLIDE 31

Discussion

  • Why is State Machine Replication so important?
  • What is the best case scenario in terms of

replications for fault tolerance?

  • Is the state machine approach still feasible?

31

slide-32
SLIDE 32

Outline

  • Motivation
  • State Machines
  • Implementation
  • Fault Tolerance
  • Chain Replication
  • Conclusions

32

slide-33
SLIDE 33

Chain Replication

Authors

  • Robert Van Renesse

○ Senior Researcher at Cornell ○ ACM Fellow and Ukelele enthusiast ○ Systems and Networking

  • Fred Schneider

33

slide-34
SLIDE 34

Chain Replication

  • Fault Tolerant Storage Service
  • Requests:
  • Update(x, y) => set object x to value y
  • Query(x) => read value of object x

34

slide-35
SLIDE 35

Chain Replication

X = 3 X = 3 X = 3 X = 3

35

slide-36
SLIDE 36

Chain Replication

X = 3 X = 3 X = 3 X = 3 Head Tail Client get(x) 3

36

slide-37
SLIDE 37

Chain Replication

X = 3 X = 3 X = 3 X = 3 Head Tail Client put(x,30)

37

slide-38
SLIDE 38

Chain Replication

X = 3 X = 30 X = 3 X = 3 Head Tail Client put(x,30)

Req. UID r0 1

1) Head assigns uid

38

slide-39
SLIDE 39

Chain Replication

X = 30 X = 30 X = 3 X = 3 Head Tail Client put(x,30)

Req. UID r0 1 Req. UID r0 1

2) Head sends message to next node

39

slide-40
SLIDE 40

Chain Replication

X = 30 X = 30 X = 30 X = 3 Head Tail Client put(x,30)

Req. UID r0 1 Req. UID r0 1 Req. UID r0 1

3) Repeat until tail is reached

40

slide-41
SLIDE 41

X = 30 X = 30 X = 30 X = 30 Head Tail Client put(x,30)

Req. UID r0 1 Req. UID r0 1 Req. UID r0 1 Req. UID r0 1

x= 30 4) respond to client with success

Chain Replication

41

slide-42
SLIDE 42

Chain Replication Assumptions

  • No partition tolerance
  • High throughput
  • Fail-stop processors
  • A universally accessible, failure resistant or

replicated Master

42

slide-43
SLIDE 43

Chain Replication

How does Chain Replication implement State Machine Replication?

  • Agreement
  • Only Update modifies state, can ignore Query
  • Client always sends update to Head. Head

propagates request down chain to Tail.

  • Everyone accepts the request!

43

slide-44
SLIDE 44

Chain Replication

How does Chain Replication implement State Machine Replication?

  • Order
  • Unique IDs generated implicitly by Head’s ordering
  • FIFO order preserved down the chain
  • Tail interleaves Query requests

44

slide-45
SLIDE 45

Chain Replication Fault Tolerance

  • Trusted Master

○ Fault-tolerant state machine ○ Trusted by all replicas ○ Monitors all replicas & issues commands

45

slide-46
SLIDE 46

Chain Replication Fault Tolerance

  • Head Fails

○ Master assigns 2nd node as Head

  • Intermediate Node Fails

○ Master coordinates chain link-up

  • Tail Fails

○ Master assigns 2nd to last node as Tail

46

slide-47
SLIDE 47

Outline

  • Motivation
  • State Machines
  • Implementation
  • Fault Tolerance
  • Chain Replication
  • Conclusions

47

slide-48
SLIDE 48

Conclusions

  • Implements the “exercise left to the reader” hinted at by

Lamport’s paper

  • Provides some of the concrete details needed to actually

implement this idea

  • But still a fair number of details in real implementations that

would need to be considered

  • Chain replication illustrates a “simple” example with fully

concrete details

  • A key contribution that bridges the gap between academia and

practicality for SMR

48

slide-49
SLIDE 49

Chain Replication Discussion

  • Comparison to other primary/backup protocols?
  • What are the tradeoffs of Chain Replication?
  • Latency
  • Consistency
  • Any thoughts on the Trusted Master system?

49