EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M - - PowerPoint PPT Presentation

eecs 591
SMART_READER_LITE
LIVE PREVIEW

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M - - PowerPoint PPT Presentation

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M ACHINE R EPLICATION M ODELING FAULTS Mean Time To Failure/Mean Time To Recover used mostly for disks of questionable value in expressing reliability Threshold: out of makes


slide-1
SLIDE 1

EECS 591 DISTRIBUTED SYSTEMS

Manos Kapritsos Fall 2020

slide-2
SLIDE 2

STATE MACHINE REPLICATION

slide-3
SLIDE 3

MODELING FAULTS

Mean Time To Failure/Mean Time To Recover used mostly for disks

  • f questionable value in expressing reliability

Threshold: out of makes condition for correct operation explicit measures fault-tolerance of the architecture, not

  • f individual components

Enumerate failure scenarios

slide-4
SLIDE 4

A HIERARCHY OF FAILURE MODELS

Crash Fail-stop Send omission Receive omission General omission Arbitrary (Byzantine) failures = benign failures

slide-5
SLIDE 5

A HIERARCHY OF FAILURE MODELS

crash

slide-6
SLIDE 6

FAULT TOLERANCE: THE PROBLEM

Clients Server Solution: replicate the server

slide-7
SLIDE 7

REPLICATION IN TIME

When a server fails, restart it or replace it Failures are detected, not masked Lower maintenance, lower availability Tolerates only benign failures

slide-8
SLIDE 8

REPLICATION IN SPACE

Run multiple copies of a server (replicas) Vote on replica output Failures are masked High availability and can tolerate arbitrary failures but at high cost

slide-9
SLIDE 9

THE ENEMY: NON-DETERMINISM

An event is non-deterministic if its output is not uniquely determined by its input The problem with non-determinism: Replication in time: must reproduce the original

  • utcome of all non-deterministic events

Replication in space: each replica must handle non- deterministic events identically

slide-10
SLIDE 10

THE SOLUTION: STATE MACHINES

Design the server as a deterministic state machine 1 3 4 2 a b c d e f

slide-11
SLIDE 11

THE SOLUTION: STATE MACHINES

State machine example: a switch

  • ff
  • n

click click

slide-12
SLIDE 12

Ingredients: a server

  • 1. Make server deterministic (state machine)
  • 2. Replicate server
  • 3. Ensure that all replicas go through the same

sequence of state transitions

STATE MACHINE REPLICATION

= x = 1 x=2

  • 4. Vote on replica outputs
slide-13
SLIDE 13

Ingredients: a server

  • 1. Make server deterministic (state machine)
  • 2. Replicate server
  • 3. Ensure that all replicas go through the same

sequence of state transitions

STATE MACHINE REPLICATION

x = 1 x=2

  • 4. Vote on replica outputs

All state machines receive all commands in the same order

slide-14
SLIDE 14

Ingredients: a server

  • 1. Make server deterministic (state machine)
  • 2. Replicate server
  • 3. Ensure that all replicas go through the same

sequence of state transitions

STATE MACHINE REPLICATION

  • 4. Vote on replica outputs
slide-15
SLIDE 15

When in trouble, cheat!

Voter and client share fate!

  • 4. Vote on replica outputs
slide-16
SLIDE 16

ADMINISTRIVIA

Send me your paper preferences by tonight Send me your group declaration preferences by Oct 1 Homework #2 will be sent out later today due Monday, Oct 12, before class Implementation project will be out next Monday due Monday October 26, by end of day Research project topics due next Thursday, 10/08

slide-17
SLIDE 17

PRIMARY-BACKUP

slide-18
SLIDE 18

THE MODEL

Failure model: crash Network model: synchrony All messages are delivered within time Reliable, FIFO channels Tolerates crash failures

slide-19
SLIDE 19

THE IDEA

Clients communicate with a single replica (primary) Primary: sequences and processes clients’ requests updates other replicas (backups) Backups use timeouts to detect failure of primary On primary failure, a backup becomes the new primary

slide-20
SLIDE 20

A SIMPLE PRIMARY-BACKUP PROTOCOL

request reply sync new primary

Passive replication: sync = state update Active replication: sync = client request(s)

slide-21
SLIDE 21

WEAKENING THE MODEL

Failure model: crash Network model: synchrony Unreliable, FIFO channels Channels may drop messages All messages are delivered within time (looks paradoxical) Tolerates crash failures

slide-22
SLIDE 22

A SLIGHTLY DIFFERENT PRIMARY-BACKUP PROTOCOL

request reply sync new primary ack

slide-23
SLIDE 23

GENERALIZING TO MORE BACKUPS

Primary backups

slide-24
SLIDE 24

GENERALIZING TO MORE BACKUPS

Primary backups

update

slide-25
SLIDE 25

GENERALIZING TO MORE BACKUPS

Primary backups

update

slide-26
SLIDE 26

GENERALIZING TO MORE BACKUPS

(active updates) Primary backups

slide-27
SLIDE 27

GENERALIZING TO MORE BACKUPS

(passive updates) Primary backups

slide-28
SLIDE 28

GENERALIZING TO MORE BACKUPS

(passive updates) Primary backups

slide-29
SLIDE 29

GENERALIZING TO MORE BACKUPS

ack ack ack ack

Primary backups

ack

slide-30
SLIDE 30

GENERALIZING TO MORE BACKUPS

Primary backups

reply

slide-31
SLIDE 31

HANDLING QUERIES

Primary backups

query

slide-32
SLIDE 32

HANDLING QUERIES

Primary backups

slide-33
SLIDE 33

HANDLING QUERIES

Primary backups

reply

However…

slide-34
SLIDE 34

HANDLING QUERIES

Primary backups

query

slide-35
SLIDE 35

HANDLING QUERIES

ack ack ack ack

Primary backups

The primary cannot respond until it has received all acks for prior updates

query ack