EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M - - PowerPoint PPT Presentation

▶

Feb 06, 2023 263 likes •623 views

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M ACHINE R EPLICATION M ODELING FAULTS Mean Time To Failure/Mean Time To Recover used mostly for disks of questionable value in expressing reliability Threshold: out of makes

SLIDE 1

EECS 591 DISTRIBUTED SYSTEMS

Manos Kapritsos Fall 2020

SLIDE 2

STATE MACHINE REPLICATION

SLIDE 3

MODELING FAULTS

Mean Time To Failure/Mean Time To Recover used mostly for disks

f questionable value in expressing reliability

Threshold: out of makes condition for correct operation explicit measures fault-tolerance of the architecture, not

f individual components

Enumerate failure scenarios

SLIDE 4

A HIERARCHY OF FAILURE MODELS

Crash Fail-stop Send omission Receive omission General omission Arbitrary (Byzantine) failures = benign failures

SLIDE 5

A HIERARCHY OF FAILURE MODELS

crash

SLIDE 6

FAULT TOLERANCE: THE PROBLEM

Clients Server Solution: replicate the server

SLIDE 7

REPLICATION IN TIME

When a server fails, restart it or replace it Failures are detected, not masked Lower maintenance, lower availability Tolerates only benign failures

SLIDE 8

REPLICATION IN SPACE

Run multiple copies of a server (replicas) Vote on replica output Failures are masked High availability and can tolerate arbitrary failures but at high cost

SLIDE 9

THE ENEMY: NON-DETERMINISM

An event is non-deterministic if its output is not uniquely determined by its input The problem with non-determinism: Replication in time: must reproduce the original

utcome of all non-deterministic events

Replication in space: each replica must handle non- deterministic events identically

SLIDE 10

THE SOLUTION: STATE MACHINES

Design the server as a deterministic state machine 1 3 4 2 a b c d e f

SLIDE 11

THE SOLUTION: STATE MACHINES

State machine example: a switch

click click

SLIDE 12

Ingredients: a server

1. Make server deterministic (state machine)
2. Replicate server
3. Ensure that all replicas go through the same

sequence of state transitions

STATE MACHINE REPLICATION

= x = 1 x=2

4. Vote on replica outputs

SLIDE 13

Ingredients: a server

1. Make server deterministic (state machine)
2. Replicate server
3. Ensure that all replicas go through the same

sequence of state transitions

STATE MACHINE REPLICATION

x = 1 x=2

4. Vote on replica outputs

All state machines receive all commands in the same order

SLIDE 14

Ingredients: a server

1. Make server deterministic (state machine)
2. Replicate server
3. Ensure that all replicas go through the same

sequence of state transitions

STATE MACHINE REPLICATION

4. Vote on replica outputs

SLIDE 15

…

’

When in trouble, cheat!

Voter and client share fate!

4. Vote on replica outputs

SLIDE 16

ADMINISTRIVIA

Send me your paper preferences by tonight Send me your group declaration preferences by Oct 1 Homework #2 will be sent out later today due Monday, Oct 12, before class Implementation project will be out next Monday due Monday October 26, by end of day Research project topics due next Thursday, 10/08

SLIDE 17

PRIMARY-BACKUP

SLIDE 18

THE MODEL

Failure model: crash Network model: synchrony All messages are delivered within time Reliable, FIFO channels Tolerates crash failures

SLIDE 19

THE IDEA

Clients communicate with a single replica (primary) Primary: sequences and processes clients’ requests updates other replicas (backups) Backups use timeouts to detect failure of primary On primary failure, a backup becomes the new primary

SLIDE 20

A SIMPLE PRIMARY-BACKUP PROTOCOL

request reply sync new primary

Passive replication: sync = state update Active replication: sync = client request(s)

SLIDE 21

WEAKENING THE MODEL

Failure model: crash Network model: synchrony Unreliable, FIFO channels Channels may drop messages All messages are delivered within time (looks paradoxical) Tolerates crash failures

SLIDE 22

A SLIGHTLY DIFFERENT PRIMARY-BACKUP PROTOCOL

request reply sync new primary ack

SLIDE 23

GENERALIZING TO MORE BACKUPS

Primary backups

SLIDE 24

GENERALIZING TO MORE BACKUPS

Primary backups

update

SLIDE 25

GENERALIZING TO MORE BACKUPS

Primary backups

update

SLIDE 26

GENERALIZING TO MORE BACKUPS

(active updates) Primary backups

SLIDE 27

GENERALIZING TO MORE BACKUPS

(passive updates) Primary backups

SLIDE 28

GENERALIZING TO MORE BACKUPS

(passive updates) Primary backups

SLIDE 29

GENERALIZING TO MORE BACKUPS

ack ack ack ack

Primary backups

ack

SLIDE 30

GENERALIZING TO MORE BACKUPS

Primary backups

SLIDE 31

HANDLING QUERIES

Primary backups

query

SLIDE 32

HANDLING QUERIES

Primary backups

SLIDE 33

HANDLING QUERIES

Primary backups

However…

SLIDE 34

HANDLING QUERIES

Primary backups

query

SLIDE 35

HANDLING QUERIES

ack ack ack ack

Primary backups

The primary cannot respond until it has received all acks for prior updates

query ack