EECS 591 DISTRIBUTED SYSTEMS
Manos Kapritsos Fall 2020
EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M - - PowerPoint PPT Presentation
EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M ACHINE R EPLICATION M ODELING FAULTS Mean Time To Failure/Mean Time To Recover used mostly for disks of questionable value in expressing reliability Threshold: out of makes
Manos Kapritsos Fall 2020
Mean Time To Failure/Mean Time To Recover used mostly for disks
Threshold: out of makes condition for correct operation explicit measures fault-tolerance of the architecture, not
Enumerate failure scenarios
Crash Fail-stop Send omission Receive omission General omission Arbitrary (Byzantine) failures = benign failures
crash
Clients Server Solution: replicate the server
When a server fails, restart it or replace it Failures are detected, not masked Lower maintenance, lower availability Tolerates only benign failures
Run multiple copies of a server (replicas) Vote on replica output Failures are masked High availability and can tolerate arbitrary failures but at high cost
An event is non-deterministic if its output is not uniquely determined by its input The problem with non-determinism: Replication in time: must reproduce the original
Replication in space: each replica must handle non- deterministic events identically
Design the server as a deterministic state machine 1 3 4 2 a b c d e f
State machine example: a switch
click click
Ingredients: a server
sequence of state transitions
= x = 1 x=2
Ingredients: a server
sequence of state transitions
x = 1 x=2
Ingredients: a server
sequence of state transitions
’
When in trouble, cheat!
Voter and client share fate!
Send me your paper preferences by tonight Send me your group declaration preferences by Oct 1 Homework #2 will be sent out later today due Monday, Oct 12, before class Implementation project will be out next Monday due Monday October 26, by end of day Research project topics due next Thursday, 10/08
Failure model: crash Network model: synchrony All messages are delivered within time Reliable, FIFO channels Tolerates crash failures
Clients communicate with a single replica (primary) Primary: sequences and processes clients’ requests updates other replicas (backups) Backups use timeouts to detect failure of primary On primary failure, a backup becomes the new primary
request reply sync new primary
Passive replication: sync = state update Active replication: sync = client request(s)
Failure model: crash Network model: synchrony Unreliable, FIFO channels Channels may drop messages All messages are delivered within time (looks paradoxical) Tolerates crash failures
request reply sync new primary ack
Primary backups
Primary backups
update
Primary backups
update
(active updates) Primary backups
(passive updates) Primary backups
(passive updates) Primary backups
ack ack ack ack
Primary backups
ack
Primary backups
reply
Primary backups
query
Primary backups
Primary backups
reply
However…
Primary backups
query
ack ack ack ack
Primary backups
The primary cannot respond until it has received all acks for prior updates
query ack