[PPT] - Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in PowerPoint Presentation

SLIDE 1

Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems

Ramnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau

SLIDE 2

OSDI ‘18

Replication Protocols

1

Viewstamped replication Raft Paxos

SLIDE 3

OSDI ‘18

Replication Protocols

1

GFS Colossus BigTable

Foundation upon which datacenter systems are built

Viewstamped replication Raft Paxos

SLIDE 4

OSDI ‘18

The T wo Different Worlds of Replication

2

SLIDE 5

OSDI ‘18

The T wo Different Worlds of Replication

2

World-1 World-2

SLIDE 6

OSDI ‘18

The T wo Different Worlds of Replication

2

How and where to store system state?

World-1 World-2

SLIDE 7

OSDI ‘18

The T wo Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks Disk-durable World-1 World-2

SLIDE 8

OSDI ‘18

buffer updates only in volatile memory

The T wo Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks Disk-durable World-1 Memory-durable World-2

SLIDE 9

OSDI ‘18

buffer updates only in volatile memory

The T wo Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks Disk-durable World-1

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable World-2

SLIDE 10

OSDI ‘18

buffer updates only in volatile memory

The T wo Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks Disk-durable World-1

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable World-2

Viewstamped replication, NOPaxos [OSDI ‘16], SpecPaxos [NSDI ’15] …

SLIDE 11

OSDI ‘18

buffer updates only in volatile memory

The T wo Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks Disk-durable World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable World-2

Viewstamped replication, NOPaxos [OSDI ‘16], SpecPaxos [NSDI ’15] …

SLIDE 12

OSDI ‘18

buffer updates only in volatile memory

The T wo Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks Disk-durable World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable World-2

Viewstamped replication, NOPaxos [OSDI ‘16], SpecPaxos [NSDI ’15] …

safe but suffer from poor performance

SLIDE 13

OSDI ‘18

buffer updates only in volatile memory

The T wo Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks Disk-durable World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable World-2

Viewstamped replication, NOPaxos [OSDI ‘16], SpecPaxos [NSDI ’15] …

safe but suffer from poor performance performant but risk unsafety or unavailability

SLIDE 14

OSDI ‘18

Can a protocol provide strong reliability while maintaining high performance?

3

SLIDE 15

OSDI ‘18

SAUCR: Situation-Aware Updates and Crash Recovery

4

SLIDE 16

OSDI ‘18

SAUCR: Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

4

SLIDE 17

OSDI ‘18

SAUCR: Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

with many or all nodes up, buffer in memory – fast mode with failures, if only bare majority up, flush to disk – slow mode

4

SLIDE 18

OSDI ‘18

SAUCR: Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

with many or all nodes up, buffer in memory – fast mode with failures, if only bare majority up, flush to disk – slow mode

Strong reliability while maintaining high performance

4

SLIDE 19

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failures

5

SLIDE 20

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failures

independent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow mode preserves durability and availability

5

SLIDE 21

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failures

independent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow mode preserves durability and availability

many truly simultaneous correlated

no gap and so cannot react remain unavailable

5

SLIDE 22

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failures

independent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow mode preserves durability and availability

many truly simultaneous correlated

no gap and so cannot react remain unavailable

however, existing data hints they are extremely rare – the Non-Simultaneity Conjecture

5

SLIDE 23

OSDI ‘18

Results

6

SLIDE 24

OSDI ‘18

Results

Implemented in ZooKeeper

6

SLIDE 25

OSDI ‘18

Results

Implemented in ZooKeeper SAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenarios memory-durable loses data or becomes unavailable

6

SLIDE 26

OSDI ‘18

Results

Implemented in ZooKeeper SAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenarios memory-durable loses data or becomes unavailable

Improvements at no or little cost

verheads within 0%-9% of memory-durable systems

6

SLIDE 27

OSDI ‘18

Results

Implemented in ZooKeeper SAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenarios memory-durable loses data or becomes unavailable

Improvements at no or little cost

verheads within 0%-9% of memory-durable systems

Compared to disk-durable

slight reduction in availability in extremely rare cases improves performance – 2.5x on SSDs, 100x on HDDs

6

SLIDE 28

OSDI ‘18

Outline

Introduction Distributed updates and crash recovery

disk-durable protocols memory-durable protocols

Situation-aware updates and crash recovery Results Summary and conclusion

7

SLIDE 29

OSDI ‘18

Disk-Durable Protocols

8

SLIDE 30

OSDI ‘18

Disk-Durable Protocols

8

Leader Follower Follower

Update

SLIDE 31

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

Leader Follower Follower

Update

SLIDE 32

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower

Client

Update

SLIDE 33

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower

Client

Update

A=2 A=2

SLIDE 34

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower

Client

Update

A=2 A=2

SLIDE 35

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Update

fsync

A=2 A=2

fsync

SLIDE 36

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

fsync completed

n a majority ?

Update

fsync

A=2 A=2

fsync

SLIDE 37

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

fsync completed

n a majority ?

Update

fsync

A=2 A=2

fsync

SLIDE 38

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

fsync completed

n a majority ?

if ack’d anyone, data on disk – safe

Update

fsync

A=2 A=2

fsync

SLIDE 39

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

fsync completed

n a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe

Update

fsync

A=2 A=2

fsync

SLIDE 40

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

fsync completed

n a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe

Update

fsync

A=2 A=2

fsync

SLIDE 41

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

fsync completed

n a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe

Update

fsync

A=2 A=2

fsync

SLIDE 42

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

fsync completed

n a majority ?

A=1 A=2

ready

immediate

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

fsync

recovery: just read from local disk

SLIDE 43

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

fsync completed

n a majority ?

A=1 A=2

ready

immediate

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

fsync

recovery: just read from local disk

A=1

ready

immediate

A=1

lagging

SLIDE 44

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

fsync completed

n a majority ?

A=1 A=2

ready

immediate

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

Safe and available

fsync

recovery: just read from local disk

A=1

ready

immediate

A=1

lagging

SLIDE 45

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1 A=2

Leader Follower Follower fsync

Client

Committed

Recovery

But poor performance due to fsync – 50x on HDDs, 2.5x on SSDs

fsync completed

n a majority ?

A=1 A=2

ready

immediate

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

Safe and available

fsync

recovery: just read from local disk

A=1

ready

immediate

A=1

lagging

SLIDE 46

OSDI ‘18 9

Leader Follower Follower

Client

Update

Memory A=1 Memory A=1 Memory A=1 A=2

Memory-Durable Protocols (Oblivious Recovery)

SLIDE 47

OSDI ‘18 9

Leader Follower Follower

Client

Committed

Update

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

SLIDE 48

OSDI ‘18 9

Leader Follower Follower

Client

Committed

Recovery Update

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

SLIDE 49

OSDI ‘18 9

Leader Follower Follower

Client

Committed

Recovery Update

Oblivious: doesn’t realize loss on failure

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

SLIDE 50

OSDI ‘18

Memory

9

Leader Follower Follower

Client

Committed

Recovery Update

Oblivious: doesn’t realize loss on failure

A=1 A=2 Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

SLIDE 51

OSDI ‘18

Memory

9

Leader Follower Follower

Client

Committed

Recovery Update

Oblivious: doesn’t realize loss on failure

A=1 A=2 Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

SLIDE 52

OSDI ‘18

Memory

9

Leader Follower Follower

Client

Committed

Recovery Update

Oblivious: doesn’t realize loss on failure

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

SLIDE 53

OSDI ‘18

Memory Memory

9

Leader Follower Follower

Client

Committed

Recovery Update

Oblivious: doesn’t realize loss on failure ready

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

immediate

Memory-Durable Protocols (Oblivious Recovery)

SLIDE 54

OSDI ‘18

Memory Memory

9

Leader Follower Follower

Client

Committed

Recovery Update

Oblivious: doesn’t realize loss on failure ready

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

immediate

Memory-Durable Protocols (Oblivious Recovery)

e.g., ZooKeeper with forceSync = false practitioners do use this config!

SLIDE 55

OSDI ‘18

Memory Memory

9

Leader Follower Follower

Client

Committed

Recovery Update

Oblivious: doesn’t realize loss on failure ready

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

immediate

Memory-Durable Protocols (Oblivious Recovery)

Performant e.g., ZooKeeper with forceSync = false practitioners do use this config!

SLIDE 56

OSDI ‘18

But can lead to data loss

Memory Memory

9

Leader Follower Follower

Client

Committed

Recovery Update

Oblivious: doesn’t realize loss on failure ready

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 A=2 buffered on a majority ?

immediate

Memory-Durable Protocols (Oblivious Recovery)

Performant e.g., ZooKeeper with forceSync = false practitioners do use this config!

SLIDE 57

OSDI ‘18

Data Loss Example in Oblivious Approach

10

SLIDE 58

OSDI ‘18

Data Loss Example in Oblivious Approach

10

SLIDE 59

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

SLIDE 60

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

two nodes slow or failed

SLIDE 61

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

A=1 A=1 A=1 two nodes slow or failed

SLIDE 62

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

A=1 A=1 A=1

crashes

two nodes slow or failed

SLIDE 63

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

A=1 A=1 A=1

crashes

two nodes slow or failed , recovers

SLIDE 64

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

A=1 A=1

crashes

two nodes slow or failed , recovers loses its data but oblivious: immediately joins

SLIDE 65

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

A=1 A=1

crashes

two nodes slow or failed , recovers loses its data but oblivious: immediately joins A=1 A=1

SLIDE 66

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

A=1 A=1

crashes

A=1 A=1 two nodes slow or failed , recovers loses its data but oblivious: immediately joins

SLIDE 67

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1 A=1 A=1

A=1 committed

A=1 A=1

crashes

A=1 A=1 two nodes slow or failed , recovers loses its data but oblivious: immediately joins lagging nodes along with recovered node form majority; lose committed update majority do not know

f previously

committed update

SLIDE 68

OSDI ‘18 11

Leader Follower Follower

Client

Committed

Update

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

SLIDE 69

OSDI ‘18 11

Leader Follower Follower

Client

Committed

Recovery Update

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

SLIDE 70

OSDI ‘18 11

Leader Follower Follower

Client

Committed

Recovery Update

Loss-aware: realizes loss, waits for majority

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

SLIDE 71

OSDI ‘18

Memory

11

Leader Follower Follower

Client

Committed

Recovery Update

Loss-aware: realizes loss, waits for majority

A=1 A=2 Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

SLIDE 72

OSDI ‘18

Memory

11

Leader Follower Follower

Client

Committed

Recovery Update

Loss-aware: realizes loss, waits for majority

A=1 A=2 Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

SLIDE 73

OSDI ‘18

Memory

11

Leader Follower Follower

Client

Committed

Recovery Update

Loss-aware: realizes loss, waits for majority

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

SLIDE 74

OSDI ‘18 11

Leader Follower Follower

Client

Committed

Recovery Update

Loss-aware: realizes loss, waits for majority recovering

wait for majority responses Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

SLIDE 75

OSDI ‘18

Memory

11

Leader Follower Follower

Client

Committed

Recovery Update

Loss-aware: realizes loss, waits for majority recovering

wait for majority responses

ready

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

majority responses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2 A=2

SLIDE 76

OSDI ‘18

Memory

11

Leader Follower Follower

Client

Committed

Recovery Update

Loss-aware: realizes loss, waits for majority recovering

wait for majority responses

ready

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

majority responses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2 A=2

e.g., Viewstamped replication

SLIDE 77

OSDI ‘18

Avoids loss (unlike oblivious) but can lead to unavailability

Memory

11

Leader Follower Follower

Client

Committed

Recovery Update

Loss-aware: realizes loss, waits for majority recovering

wait for majority responses

ready

Memory A=1 Memory A=1 Memory A=1 A=2 A=2 buffered on a majority ?

majority responses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2 A=2

e.g., Viewstamped replication

SLIDE 78

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

SLIDE 79

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1 A=1 A=1 A=1 committed two nodes crashed

SLIDE 80

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1 A=1 A=1 A=1 committed A=1 A=1 A=1 two nodes crashed

SLIDE 81

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1 A=1 A=1 A=1 committed A=1 A=1 A=1 crashes two nodes crashed

SLIDE 82

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1 A=1 A=1 A=1 committed A=1 A=1 A=1 crashes two nodes crashed , recovers

SLIDE 83

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1 A=1 A=1 A=1 committed A=1 A=1 crashes two nodes crashed , recovers cannot collect majority responses although majority up – unavailable

SLIDE 84

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1 A=1 A=1 A=1 committed A=1 A=1 crashes two nodes crashed , recovers cannot collect majority responses although majority up – unavailable A=1 A=1

SLIDE 85

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1 A=1 A=1 A=1 committed A=1 A=1 crashes two nodes crashed , recovers cannot collect majority responses although majority up – unavailable A=1 A=1 failed nodes recover

SLIDE 86

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1 A=1 A=1 A=1 committed A=1 A=1 crashes two nodes crashed , recovers cannot collect majority responses although majority up – unavailable A=1 A=1 failed nodes recover stay in recovering unavailable even after all nodes recover

SLIDE 87

OSDI ‘18

Outline

Introduction Distributed updates and crash recovery Situation-aware updates and crash recovery

SAUCR insights, guarantees, and overview situation-aware updates situation-aware crash recovery

Results Summary and conclusion

13

SLIDE 88

OSDI ‘18 14

SAUCR Intuition and Insight

SLIDE 89

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

SAUCR Intuition and Insight

SLIDE 90

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

buffer even with many failures

Memory-durable

always poor reliability

SAUCR Intuition and Insight

SLIDE 91

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

always always poor reliability poor performance

SAUCR Intuition and Insight

SLIDE 92

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures Insight: reacting to failures and adapting to situation can achieve reliability and performance

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

always always poor reliability poor performance

SAUCR Intuition and Insight

SLIDE 93

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures Insight: reacting to failures and adapting to situation can achieve reliability and performance

when no or few failures could buffer in memory

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

always always

common case when many or all up

buffer in memory

Memory-durable

poor reliability poor performance

SAUCR Intuition and Insight

SLIDE 94

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures Insight: reacting to failures and adapting to situation can achieve reliability and performance

when no or few failures could buffer in memory when failure arise, flush

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

always always

with failures when only minimum up common case when many or all up

buffer in memory flush to disk

Memory-durable Disk-durable

poor reliability poor performance

SAUCR Intuition and Insight

SLIDE 95

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

SLIDE 96

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

SLIDE 97

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

independent: likelihood of many nodes failing together is negligible

SLIDE 98

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

independent: likelihood of many nodes failing together is negligible correlated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

SLIDE 99

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

independent: likelihood of many nodes failing together is negligible correlated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable

SLIDE 100

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

independent: likelihood of many nodes failing together is negligible correlated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable We conjecture they are extremely rare: a gap exists between failures

correlated but a few seconds apart [Ford et al., OSDI ‘10] analysis reveals a gap of 50 ms or more almost always

SLIDE 101

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

independent: likelihood of many nodes failing together is negligible correlated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable We conjecture they are extremely rare: a gap exists between failures

correlated but a few seconds apart [Ford et al., OSDI ‘10] analysis reveals a gap of 50 ms or more almost always

Most cases: any no. of independent and non-simultaneous correlated – same as disk-durable Rare cases: more than a majority crash truly simultaneously – remain unavailable

SLIDE 102

OSDI ‘18

SAUCR Overview

16

SLIDE 103

OSDI ‘18

SAUCR Overview

Updates

when more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

16

SLIDE 104

OSDI ‘18

SAUCR Overview

Updates

when more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

When nodes fail and only a bare majority alive, flush to disk – slow mode

e.g., only 3 nodes up in a 5-node cluster

16

SLIDE 105

OSDI ‘18

SAUCR Overview

Updates

when more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

When nodes fail and only a bare majority alive, flush to disk – slow mode

e.g., only 3 nodes up in a 5-node cluster

Crash Recovery

when a node recovers from a crash, it recovers its data

either from its disk (if crashed in slow mode)

r from other nodes (if crashed in fast mode)

16

SLIDE 106

OSDI ‘18

Situation-Aware Updates

17

SLIDE 107

OSDI ‘18

Situation-Aware Updates

17

L all nodes up

SLIDE 108

OSDI ‘18

Situation-Aware Updates

17

L all nodes up fast mode - buffer updates

SLIDE 109

OSDI ‘18

Situation-Aware Updates

17

L L all nodes up 4 nodes up (more than majority) fast mode - buffer updates

SLIDE 110

OSDI ‘18

Situation-Aware Updates

17

L L all nodes up 4 nodes up (more than majority) fast mode - buffer updates remain in fast mode

SLIDE 111

OSDI ‘18

Situation-Aware Updates

17

L L L all nodes up 4 nodes up (more than majority)

nly majority up

fast mode - buffer updates remain in fast mode

SLIDE 112

OSDI ‘18

Situation-Aware Updates

17

L L L all nodes up 4 nodes up (more than majority)

nly majority up

fast mode - buffer updates remain in fast mode switch to slow, flush to disk

SLIDE 113

OSDI ‘18

Situation-Aware Updates

17

L L L L all nodes up 4 nodes up (more than majority)

nly majority up

fast mode - buffer updates remain in fast mode switch to slow, flush to disk

SLIDE 114

OSDI ‘18

Situation-Aware Updates

17

L L L L all nodes up 4 nodes up (more than majority)

nly majority up

commit subsequent updates in slow mode fast mode - buffer updates remain in fast mode switch to slow, flush to disk

SLIDE 115

OSDI ‘18

Situation-Aware Updates

17

L L L L L all nodes up 4 nodes up (more than majority)

nly majority up

commit subsequent updates in slow mode

ne node recovers

and catches up; fast mode - buffer updates remain in fast mode switch to slow, flush to disk

SLIDE 116

OSDI ‘18

Situation-Aware Updates

17

L L L L L all nodes up 4 nodes up (more than majority)

nly majority up

commit subsequent updates in slow mode

ne node recovers

and catches up; fast mode - buffer updates remain in fast mode switch to slow, flush to disk

SLIDE 117

OSDI ‘18

Situation-Aware Updates

17

L L L L L all nodes up 4 nodes up (more than majority)

nly majority up

commit subsequent updates in slow mode

ne node recovers

and catches up; fast mode - buffer updates remain in fast mode switch to slow, flush to disk

SLIDE 118

OSDI ‘18

Situation-Aware Updates

17

L L L L L all nodes up 4 nodes up (more than majority)

nly majority up

commit subsequent updates in slow mode

ne node recovers

and catches up; fast mode - buffer updates remain in fast mode switch to slow, flush to disk switch to fast

SLIDE 119

OSDI ‘18

Failure Reaction

32

Basic failure-detection mechanism: heartbeats

remain in fast mode

Follower failures

switch to slow mode

Leader failures

n a missing heartbeat,

followers flush to disk

Leader Leader

Challenges: too many packets, spurious elections, too much data to flush T echniques in the paper … Result: can react to failures even when they are only a few milliseconds apart, preserving durability and availability

steps down

SLIDE 120

OSDI ‘18

Mode-Aware Crash Recovery

19

SLIDE 121

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SLIDE 122

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

SLIDE 123

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

SLIDE 124

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

SLIDE 125

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

SLIDE 126

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR recover from local disk

SLIDE 127

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR recover from local disk ready immediate

SLIDE 128

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR recover from local disk ready immediate

SLIDE 129

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR recover from local disk ready lost updates recover from other nodes immediate

SLIDE 130

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR recover from local disk ready lost updates recover from other nodes a bare minority (bare majority - 1) responses immediate

SLIDE 131

OSDI ‘18

Disk-durable: always recover from disk Memory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR recover from local disk ready lost updates recover from other nodes ready a bare minority (bare majority - 1) responses immediate

SLIDE 132

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

SLIDE 133

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

S1

SLIDE 134

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash Safety condition: update-A must be recovered

S1

SLIDE 135

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash Safety condition: update-A must be recovered If A was committed in fast mode, then at least one in any bare minority must contain update-A

A A A S1

SLIDE 136

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash Safety condition: update-A must be recovered If A was committed in fast mode, then at least one in any bare minority must contain update-A If update-A was committed in slow mode, S1 recovers from disk

A A A S1

SLIDE 137

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash Safety condition: update-A must be recovered If A was committed in fast mode, then at least one in any bare minority must contain update-A If update-A was committed in slow mode, S1 recovers from disk Proof sketch in the paper …

A A A S1

SLIDE 138

OSDI ‘18

Outline

Introduction Distributed updates and crash recovery Situation-aware updates and crash recovery Results Summary and conclusion

51

SLIDE 139

OSDI ‘18

Evaluation

We implement in SAUCR in ZooKeeper Compare SAUCR’s reliability and performance against

disk-durable ZooKeeper (forceSync = true) memory-durable ZooKeeper (forceSync = false) viewstamped replication (ideal model)

52

SLIDE 140

OSDI ‘18

Reliability Testing

53

12345 1234 1235 1245 1345 2345 123 124 345

…

12 1 2 3 4 5 125 45 14

…

13

Cluster crash-testing framework Generates cluster-state sequences How it works? Please see our paper…

SLIDE 141

OSDI ‘18

703 217 1047 1264 1264

Reliability Results

54

Correct Unavailable Data loss Correct Unavailable Data loss 703 561 217 1047 1264 Systems memory-durable zookeeper viewstamped replication disk-durable zookeeper SAUCR 1200 64

Simultaneous non-simultaneous: gap of 50 ms, simultaneous: no gap memory-durable zookeeper silently loses data viewstamped replication leads to permanent unavailability SAUCR reacts to non-simultaneous – durable and available

ther systems behave the same as non-simultaneous cases

simultaneous: SAUCR by design remains unavailable in some cases

561

Non-Simultaneous

SLIDE 142

OSDI ‘18

Macro-benchmark Performance: YCSB-load

55

Compared to disk-durable, both memory-durable and SAUCR are faster SAUCR’s performance matches memory-durable ZooKeeper

within 9% of memory-durable Zookeeper even for write-intensive workloads

verheads because SAUCR writes to one additional node

5 10 15 20 25 HDD SSD Throughput (KOps/s) Memory-durable ZooKeeper SAUCR Disk-durable ZooKeeper

100x 2.5x 9% 9%

SLIDE 143

OSDI ‘18

Summary

26

Replication protocols are an important foundation

need to be performant, yet also provide high reliability

Dichotomy: disk-durable vs. memory-durable protocols

unsavory choices: either performant or reliable

SAUCR – situation-aware updates and crash recovery

provides both high performance and reliability

SLIDE 144

OSDI ‘18

Conclusions

Paying careful attention to how failures occur

can find approaches that provide both performance and reliability more data from real-world deployments?

Hybrid approach – an effective systems-design technique – applicable to distributed updates and recovery too

worthwhile to look at other important protocols/systems where we make similar two-ends-of-the-spectrum tradeoffs?

Thank you!

Poster #6

27