Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems
Ramnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau
Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in - - PowerPoint PPT Presentation
Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems Ramnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau Replication Protocols Viewstamped Paxos Raft replication
Ramnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau
OSDI ‘18
1
Viewstamped replication Raft Paxos
OSDI ‘18
1
GFS Colossus BigTable
Foundation upon which datacenter systems are built
Viewstamped replication Raft Paxos
OSDI ‘18
2
OSDI ‘18
2
World-1 World-2
OSDI ‘18
2
How and where to store system state?
World-1 World-2
OSDI ‘18
2
How and where to store system state?
synchronously persist updates to disks Disk-durable World-1 World-2
OSDI ‘18
buffer updates only in volatile memory
2
How and where to store system state?
synchronously persist updates to disks Disk-durable World-1 Memory-durable World-2
OSDI ‘18
buffer updates only in volatile memory
2
How and where to store system state?
synchronously persist updates to disks Disk-durable World-1
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable World-2
OSDI ‘18
buffer updates only in volatile memory
2
How and where to store system state?
synchronously persist updates to disks Disk-durable World-1
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable World-2
Viewstamped replication, NOPaxos [OSDI ‘16], SpecPaxos [NSDI ’15] …
OSDI ‘18
buffer updates only in volatile memory
2
How and where to store system state?
synchronously persist updates to disks Disk-durable World-1
Neither approach is ideal: reliable or performant
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable World-2
Viewstamped replication, NOPaxos [OSDI ‘16], SpecPaxos [NSDI ’15] …
OSDI ‘18
buffer updates only in volatile memory
2
How and where to store system state?
synchronously persist updates to disks Disk-durable World-1
Neither approach is ideal: reliable or performant
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable World-2
Viewstamped replication, NOPaxos [OSDI ‘16], SpecPaxos [NSDI ’15] …
safe but suffer from poor performance
OSDI ‘18
buffer updates only in volatile memory
2
How and where to store system state?
synchronously persist updates to disks Disk-durable World-1
Neither approach is ideal: reliable or performant
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable World-2
Viewstamped replication, NOPaxos [OSDI ‘16], SpecPaxos [NSDI ’15] …
safe but suffer from poor performance performant but risk unsafety or unavailability
OSDI ‘18
3
OSDI ‘18
4
OSDI ‘18
4
OSDI ‘18
with many or all nodes up, buffer in memory – fast mode with failures, if only bare majority up, flush to disk – slow mode
4
OSDI ‘18
with many or all nodes up, buffer in memory – fast mode with failures, if only bare majority up, flush to disk – slow mode
4
OSDI ‘18
SAUCR’s effectiveness depends upon simultaneity of failures
5
OSDI ‘18
SAUCR’s effectiveness depends upon simultaneity of failures
independent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)
can react and switch from fast to slow mode preserves durability and availability
5
OSDI ‘18
SAUCR’s effectiveness depends upon simultaneity of failures
independent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)
can react and switch from fast to slow mode preserves durability and availability
many truly simultaneous correlated
no gap and so cannot react remain unavailable
5
OSDI ‘18
SAUCR’s effectiveness depends upon simultaneity of failures
independent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)
can react and switch from fast to slow mode preserves durability and availability
many truly simultaneous correlated
no gap and so cannot react remain unavailable
however, existing data hints they are extremely rare – the Non-Simultaneity Conjecture
5
OSDI ‘18
6
OSDI ‘18
Implemented in ZooKeeper
6
OSDI ‘18
Implemented in ZooKeeper SAUCR improves reliability compared to memory-durable systems
durable and available in 100s of crash scenarios memory-durable loses data or becomes unavailable
6
OSDI ‘18
Implemented in ZooKeeper SAUCR improves reliability compared to memory-durable systems
durable and available in 100s of crash scenarios memory-durable loses data or becomes unavailable
Improvements at no or little cost
6
OSDI ‘18
Implemented in ZooKeeper SAUCR improves reliability compared to memory-durable systems
durable and available in 100s of crash scenarios memory-durable loses data or becomes unavailable
Improvements at no or little cost
Compared to disk-durable
slight reduction in availability in extremely rare cases improves performance – 2.5x on SSDs, 100x on HDDs
6
OSDI ‘18
Introduction Distributed updates and crash recovery
disk-durable protocols memory-durable protocols
Situation-aware updates and crash recovery Results Summary and conclusion
7
OSDI ‘18
8
OSDI ‘18
8
Leader Follower Follower
Update
OSDI ‘18
8
A=1 A=1 A=1
Leader Follower Follower
Update
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower
Client
Update
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower
Client
Update
A=2 A=2
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower
Client
Update
A=2 A=2
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Update
fsync
A=2 A=2
fsync
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
fsync completed
Update
fsync
A=2 A=2
fsync
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
fsync completed
Update
fsync
A=2 A=2
fsync
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
fsync completed
if ack’d anyone, data on disk – safe
Update
fsync
A=2 A=2
fsync
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
fsync completed
A=1 A=2
if ack’d anyone, data on disk – safe
Update
fsync
A=2 A=2
fsync
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
fsync completed
A=1 A=2
if ack’d anyone, data on disk – safe
Update
fsync
A=2 A=2
fsync
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
fsync completed
A=1 A=2
if ack’d anyone, data on disk – safe
Update
fsync
A=2 A=2
fsync
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
fsync completed
A=1 A=2
ready
immediate
if ack’d anyone, data on disk – safe
A=1 A=2
Update
fsync
A=2 A=2
fsync
recovery: just read from local disk
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
fsync completed
A=1 A=2
ready
immediate
if ack’d anyone, data on disk – safe
A=1 A=2
Update
fsync
A=2 A=2
fsync
recovery: just read from local disk
A=1
ready
immediate
A=1
lagging
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
fsync completed
A=1 A=2
ready
immediate
if ack’d anyone, data on disk – safe
A=1 A=2
Update
fsync
A=2 A=2
Safe and available
fsync
recovery: just read from local disk
A=1
ready
immediate
A=1
lagging
OSDI ‘18
8
A=1 A=1 A=1 A=2
Leader Follower Follower fsync
Client
Committed
Recovery
But poor performance due to fsync – 50x on HDDs, 2.5x on SSDs
fsync completed
A=1 A=2
ready
immediate
if ack’d anyone, data on disk – safe
A=1 A=2
Update
fsync
A=2 A=2
Safe and available
fsync
recovery: just read from local disk
A=1
ready
immediate
A=1
lagging
OSDI ‘18 9
Leader Follower Follower
Client
Update
Memory A=1 Memory A=1 Memory A=1 A=2
OSDI ‘18 9
Leader Follower Follower
Client
Committed
Update
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
OSDI ‘18 9
Leader Follower Follower
Client
Committed
Recovery Update
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
OSDI ‘18 9
Leader Follower Follower
Client
Committed
Recovery Update
Oblivious: doesn’t realize loss on failure
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
OSDI ‘18
Memory
9
Leader Follower Follower
Client
Committed
Recovery Update
Oblivious: doesn’t realize loss on failure
A=1 A=2 Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
OSDI ‘18
Memory
9
Leader Follower Follower
Client
Committed
Recovery Update
Oblivious: doesn’t realize loss on failure
A=1 A=2 Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
OSDI ‘18
Memory
9
Leader Follower Follower
Client
Committed
Recovery Update
Oblivious: doesn’t realize loss on failure
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
OSDI ‘18
Memory Memory
9
Leader Follower Follower
Client
Committed
Recovery Update
Oblivious: doesn’t realize loss on failure ready
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
immediate
OSDI ‘18
Memory Memory
9
Leader Follower Follower
Client
Committed
Recovery Update
Oblivious: doesn’t realize loss on failure ready
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
immediate
e.g., ZooKeeper with forceSync = false practitioners do use this config!
OSDI ‘18
Memory Memory
9
Leader Follower Follower
Client
Committed
Recovery Update
Oblivious: doesn’t realize loss on failure ready
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
immediate
Performant e.g., ZooKeeper with forceSync = false practitioners do use this config!
OSDI ‘18
But can lead to data loss
Memory Memory
9
Leader Follower Follower
Client
Committed
Recovery Update
Oblivious: doesn’t realize loss on failure ready
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?
immediate
Performant e.g., ZooKeeper with forceSync = false practitioners do use this config!
OSDI ‘18
10
OSDI ‘18
10
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
two nodes slow or failed
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
A=1 A=1 A=1 two nodes slow or failed
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
A=1 A=1 A=1
crashes
two nodes slow or failed
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
A=1 A=1 A=1
crashes
two nodes slow or failed , recovers
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
A=1 A=1
crashes
two nodes slow or failed , recovers loses its data but oblivious: immediately joins
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
A=1 A=1
crashes
two nodes slow or failed , recovers loses its data but oblivious: immediately joins A=1 A=1
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
A=1 A=1
crashes
A=1 A=1 two nodes slow or failed , recovers loses its data but oblivious: immediately joins
OSDI ‘18
10
A=1 A=1 A=1
A=1 committed
A=1 A=1
crashes
A=1 A=1 two nodes slow or failed , recovers loses its data but oblivious: immediately joins lagging nodes along with recovered node form majority; lose committed update majority do not know
committed update
OSDI ‘18 11
Leader Follower Follower
Client
Committed
Update
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
A=2
OSDI ‘18 11
Leader Follower Follower
Client
Committed
Recovery Update
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
A=2
OSDI ‘18 11
Leader Follower Follower
Client
Committed
Recovery Update
Loss-aware: realizes loss, waits for majority
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
Client
Committed
Recovery Update
Loss-aware: realizes loss, waits for majority
A=1 A=2 Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
Client
Committed
Recovery Update
Loss-aware: realizes loss, waits for majority
A=1 A=2 Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
Client
Committed
Recovery Update
Loss-aware: realizes loss, waits for majority
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
A=2
OSDI ‘18 11
Leader Follower Follower
Client
Committed
Recovery Update
Loss-aware: realizes loss, waits for majority recovering
wait for majority responses Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
Client
Committed
Recovery Update
Loss-aware: realizes loss, waits for majority recovering
wait for majority responses
ready
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
majority responses
A=1 A=2 A=2
OSDI ‘18
Memory
11
Leader Follower Follower
Client
Committed
Recovery Update
Loss-aware: realizes loss, waits for majority recovering
wait for majority responses
ready
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
majority responses
A=1 A=2 A=2
e.g., Viewstamped replication
OSDI ‘18
Avoids loss (unlike oblivious) but can lead to unavailability
Memory
11
Leader Follower Follower
Client
Committed
Recovery Update
Loss-aware: realizes loss, waits for majority recovering
wait for majority responses
ready
Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?
majority responses
A=1 A=2 A=2
e.g., Viewstamped replication
OSDI ‘18
12
OSDI ‘18
12
A=1 A=1 A=1 A=1 committed two nodes crashed
OSDI ‘18
12
A=1 A=1 A=1 A=1 committed A=1 A=1 A=1 two nodes crashed
OSDI ‘18
12
A=1 A=1 A=1 A=1 committed A=1 A=1 A=1 crashes two nodes crashed
OSDI ‘18
12
A=1 A=1 A=1 A=1 committed A=1 A=1 A=1 crashes two nodes crashed , recovers
OSDI ‘18
12
A=1 A=1 A=1 A=1 committed A=1 A=1 crashes two nodes crashed , recovers cannot collect majority responses although majority up – unavailable
OSDI ‘18
12
A=1 A=1 A=1 A=1 committed A=1 A=1 crashes two nodes crashed , recovers cannot collect majority responses although majority up – unavailable A=1 A=1
OSDI ‘18
12
A=1 A=1 A=1 A=1 committed A=1 A=1 crashes two nodes crashed , recovers cannot collect majority responses although majority up – unavailable A=1 A=1 failed nodes recover
OSDI ‘18
12
A=1 A=1 A=1 A=1 committed A=1 A=1 crashes two nodes crashed , recovers cannot collect majority responses although majority up – unavailable A=1 A=1 failed nodes recover stay in recovering unavailable even after all nodes recover
OSDI ‘18
Introduction Distributed updates and crash recovery Situation-aware updates and crash recovery
SAUCR insights, guarantees, and overview situation-aware updates situation-aware crash recovery
Results Summary and conclusion
13
OSDI ‘18 14
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
14
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
14
buffer even with many failures
Memory-durable
always poor reliability
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
14
buffer even with many failures
Memory-durable
persist even when no failures
Disk-durable
always always poor reliability poor performance
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures Insight: reacting to failures and adapting to situation can achieve reliability and performance
14
buffer even with many failures
Memory-durable
persist even when no failures
Disk-durable
always always poor reliability poor performance
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures Insight: reacting to failures and adapting to situation can achieve reliability and performance
when no or few failures could buffer in memory
14
buffer even with many failures
Memory-durable
persist even when no failures
Disk-durable
always always
common case when many or all up
buffer in memory
Memory-durable
poor reliability poor performance
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures Insight: reacting to failures and adapting to situation can achieve reliability and performance
when no or few failures could buffer in memory when failure arise, flush
14
buffer even with many failures
Memory-durable
persist even when no failures
Disk-durable
always always
with failures when only minimum up common case when many or all up
buffer in memory flush to disk
Memory-durable Disk-durable
poor reliability poor performance
OSDI ‘18
15
OSDI ‘18
15
With non-simultaneous, gap exists, SAUCR can react and ensures durability
OSDI ‘18
15
With non-simultaneous, gap exists, SAUCR can react and ensures durability
independent: likelihood of many nodes failing together is negligible
OSDI ‘18
15
With non-simultaneous, gap exists, SAUCR can react and ensures durability
independent: likelihood of many nodes failing together is negligible correlated: many nodes fail together
although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous
OSDI ‘18
15
With non-simultaneous, gap exists, SAUCR can react and ensures durability
independent: likelihood of many nodes failing together is negligible correlated: many nodes fail together
although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous
With simultaneous correlated, no gap, SAUCR cannot react, unavailable
OSDI ‘18
15
With non-simultaneous, gap exists, SAUCR can react and ensures durability
independent: likelihood of many nodes failing together is negligible correlated: many nodes fail together
although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous
With simultaneous correlated, no gap, SAUCR cannot react, unavailable We conjecture they are extremely rare: a gap exists between failures
correlated but a few seconds apart [Ford et al., OSDI ‘10] analysis reveals a gap of 50 ms or more almost always
OSDI ‘18
15
With non-simultaneous, gap exists, SAUCR can react and ensures durability
independent: likelihood of many nodes failing together is negligible correlated: many nodes fail together
although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous
With simultaneous correlated, no gap, SAUCR cannot react, unavailable We conjecture they are extremely rare: a gap exists between failures
correlated but a few seconds apart [Ford et al., OSDI ‘10] analysis reveals a gap of 50 ms or more almost always
Most cases: any no. of independent and non-simultaneous correlated – same as disk-durable Rare cases: more than a majority crash truly simultaneously – remain unavailable
OSDI ‘18
16
OSDI ‘18
Updates
when more than a majority up, buffer updates in memory – fast mode
e.g., 4 or 5 nodes up in a 5-node cluster
16
OSDI ‘18
Updates
when more than a majority up, buffer updates in memory – fast mode
e.g., 4 or 5 nodes up in a 5-node cluster
When nodes fail and only a bare majority alive, flush to disk – slow mode
e.g., only 3 nodes up in a 5-node cluster
16
OSDI ‘18
Updates
when more than a majority up, buffer updates in memory – fast mode
e.g., 4 or 5 nodes up in a 5-node cluster
When nodes fail and only a bare majority alive, flush to disk – slow mode
e.g., only 3 nodes up in a 5-node cluster
Crash Recovery
when a node recovers from a crash, it recovers its data
either from its disk (if crashed in slow mode)
16
OSDI ‘18
17
OSDI ‘18
17
L all nodes up
OSDI ‘18
17
L all nodes up fast mode - buffer updates
OSDI ‘18
17
L L all nodes up 4 nodes up (more than majority) fast mode - buffer updates
OSDI ‘18
17
L L all nodes up 4 nodes up (more than majority) fast mode - buffer updates remain in fast mode
OSDI ‘18
17
L L L all nodes up 4 nodes up (more than majority)
fast mode - buffer updates remain in fast mode
OSDI ‘18
17
L L L all nodes up 4 nodes up (more than majority)
fast mode - buffer updates remain in fast mode switch to slow, flush to disk
OSDI ‘18
17
L L L L all nodes up 4 nodes up (more than majority)
fast mode - buffer updates remain in fast mode switch to slow, flush to disk
OSDI ‘18
17
L L L L all nodes up 4 nodes up (more than majority)
commit subsequent updates in slow mode fast mode - buffer updates remain in fast mode switch to slow, flush to disk
OSDI ‘18
17
L L L L L all nodes up 4 nodes up (more than majority)
commit subsequent updates in slow mode
and catches up; fast mode - buffer updates remain in fast mode switch to slow, flush to disk
OSDI ‘18
17
L L L L L all nodes up 4 nodes up (more than majority)
commit subsequent updates in slow mode
and catches up; fast mode - buffer updates remain in fast mode switch to slow, flush to disk
OSDI ‘18
17
L L L L L all nodes up 4 nodes up (more than majority)
commit subsequent updates in slow mode
and catches up; fast mode - buffer updates remain in fast mode switch to slow, flush to disk
OSDI ‘18
17
L L L L L all nodes up 4 nodes up (more than majority)
commit subsequent updates in slow mode
and catches up; fast mode - buffer updates remain in fast mode switch to slow, flush to disk switch to fast
OSDI ‘18
32
Basic failure-detection mechanism: heartbeats
remain in fast mode
Follower failures
switch to slow mode
Leader failures
followers flush to disk
Leader Leader
Challenges: too many packets, spurious elections, too much data to flush T echniques in the paper … Result: can react to failures even when they are only a few milliseconds apart, preserving durability and availability
steps down
OSDI ‘18
19
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR recover from local disk
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR recover from local disk ready immediate
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR recover from local disk ready immediate
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR recover from local disk ready lost updates recover from other nodes immediate
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR recover from local disk ready lost updates recover from other nodes a bare minority (bare majority - 1) responses immediate
OSDI ‘18
Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)
19
SAUCR recover from local disk ready lost updates recover from other nodes ready a bare minority (bare majority - 1) responses immediate
OSDI ‘18
20
OSDI ‘18
20
Assume update-A committed, S1 recovers and has seen A before crash
S1
OSDI ‘18
20
Assume update-A committed, S1 recovers and has seen A before crash Safety condition: update-A must be recovered
S1
OSDI ‘18
20
Assume update-A committed, S1 recovers and has seen A before crash Safety condition: update-A must be recovered If A was committed in fast mode, then at least one in any bare minority must contain update-A
A A A S1
OSDI ‘18
20
Assume update-A committed, S1 recovers and has seen A before crash Safety condition: update-A must be recovered If A was committed in fast mode, then at least one in any bare minority must contain update-A If update-A was committed in slow mode, S1 recovers from disk
A A A S1
OSDI ‘18
20
Assume update-A committed, S1 recovers and has seen A before crash Safety condition: update-A must be recovered If A was committed in fast mode, then at least one in any bare minority must contain update-A If update-A was committed in slow mode, S1 recovers from disk Proof sketch in the paper …
A A A S1
OSDI ‘18
Introduction Distributed updates and crash recovery Situation-aware updates and crash recovery Results Summary and conclusion
51
OSDI ‘18
We implement in SAUCR in ZooKeeper Compare SAUCR’s reliability and performance against
disk-durable ZooKeeper (forceSync = true) memory-durable ZooKeeper (forceSync = false) viewstamped replication (ideal model)
52
OSDI ‘18
53
12345 1234 1235 1245 1345 2345 123 124 345
12 1 2 3 4 5 125 45 14
…
13
Cluster crash-testing framework Generates cluster-state sequences How it works? Please see our paper…
OSDI ‘18
703 217 1047 1264 1264
54
Correct Unavailable Data loss Correct Unavailable Data loss 703 561 217 1047 1264 Systems memory-durable zookeeper viewstamped replication disk-durable zookeeper SAUCR 1200 64
Simultaneous non-simultaneous: gap of 50 ms, simultaneous: no gap memory-durable zookeeper silently loses data viewstamped replication leads to permanent unavailability SAUCR reacts to non-simultaneous – durable and available
simultaneous: SAUCR by design remains unavailable in some cases
561
Non-Simultaneous
OSDI ‘18
55
Compared to disk-durable, both memory-durable and SAUCR are faster SAUCR’s performance matches memory-durable ZooKeeper
within 9% of memory-durable Zookeeper even for write-intensive workloads
5 10 15 20 25 HDD SSD Throughput (KOps/s) Memory-durable ZooKeeper SAUCR Disk-durable ZooKeeper
100x 2.5x 9% 9%
OSDI ‘18
26
Replication protocols are an important foundation
need to be performant, yet also provide high reliability
Dichotomy: disk-durable vs. memory-durable protocols
unsavory choices: either performant or reliable
SAUCR – situation-aware updates and crash recovery
provides both high performance and reliability
OSDI ‘18
Paying careful attention to how failures occur
can find approaches that provide both performance and reliability more data from real-world deployments?
Hybrid approach – an effective systems-design technique – applicable to distributed updates and recovery too
worthwhile to look at other important protocols/systems where we make similar two-ends-of-the-spectrum tradeoffs?
27