[PPT] - Reaching reliable agreement in an unreliable world Heidi Howard PowerPoint Presentation

SLIDE 1

Reaching reliable agreement in an unreliable world

Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf

1

SLIDE 2

Distributed Systems in Practice

Social networks
Banking
Government information systems
E-commerce
Web servers

2

SLIDE 3

Distributed Systems in Theory

3

Leslie Lamport “… a collection of distinct processes which are spatially

perated and which communicate

with one another by exchanging messages … the message delay is not negligible compared to the time between events in a single process” [CACM ‘78]

SLIDE 4

Introducing Alice

Alice is new graduate of to the world of work. She joins a cool new start up, where she is responsible for a distributed system.

4

SLIDE 5

Key Value Store

5

A 7 B 2 C 1

SLIDE 6

Key Value Store

6

A 7 B 2 C 1

A? 7

SLIDE 7

Key Value Store

7

A 7 B 2 C 1

A? 7 B=5

SLIDE 8

Key Value Store

8

A 7 B 5 C 1

A? 7 B=5 OK

SLIDE 9

Key Value Store

9

A 7 B 5 C 1

A? 7 B=5 OK B? 5

SLIDE 10

Requirements

Scalability - High throughout processing of
perations.
Latency - Low latency commit of operation as

perceived by the client.

Fault-tolerance - Availability in the face of machine

and network failures.

Linearizable semantics - Operate as if a single

server system.

10

SLIDE 11

Single Server System

Server Client 2

A 7 B 2

Client 1 Client 3

11

SLIDE 12

Single Server System

Server Client 2

A 7 B 2

Client 1 Client 3

A? 7

12

SLIDE 13

Single Server System

Server Client 2

A 7 B 2

Client 1 Client 3

B=3 OK

3

13

SLIDE 14

Single Server System

Server Client 2

A 7 B 2

Client 1 Client 3

B? 3

3

14

SLIDE 15

Single Server System

Pros

easy to deploy
low latency (1 RTT in

common case)

requests executed in-order

Cons

system unavailable if server
r network fails
throughput limited to one

server

15

SLIDE 16

Single Server System (v.2)

Pros

easy to deploy
low latency (1 RTT in

common case)

linearizable semantics
durability with write-ahead

logging

partition tolerance with

retransmission & command cache

Cons

system unavailable if server

fails

throughput limited to one

server

16

SLIDE 17

Backups

aka Primary backup replication

Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1

A 7 B 2

A? 7

A 7 B 2 A 7 B 2 A 7 B 2

17

SLIDE 18

Backups

aka Primary backup replication

Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1

A 7 B 2

B=1

A 7 B 2 A 7 B 2 A 7 B 2

18

SLIDE 19

Backups

aka Primary backup replication

Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1

A 7 B 1

B=1

A 7 B 2 A 7 B 2 A 7 B 2 A 7 B 1

19

SLIDE 20

Backups

aka Primary backup replication

Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1

A 7 B 1

OK

A 7 B 1 A 7 B 1 A 7 B 1

OK OK OK

20

SLIDE 21

Big Gotcha

We are assuming total ordered broadcast

21

SLIDE 22

Totally Ordered Broadcast

(aka atomic broadcast) the guarantee that messages are received reliably and in the same order by all nodes.

22

SLIDE 23

Intro (Review)

So far we have:

Defined our notion of a distributed system
Introduced an example distributed system (Alice

and her key-value store)

Seen that straw man approaches to building this

system are not sufficient

23

Any questions so far?

SLIDE 24

Doing the Impossible

24

SLIDE 25

CAP Theorem

Pick 2 of 3:

Consistency
Availability
Partition tolerance

Proposed by Brewer in 1998, still debated and regarded as misleading. [Brewer’12] [Kleppmann’15]

25

Eric Brewer

SLIDE 26

FLP Impossibility

It is impossible to guarantee consensus when messages may be delay if even one node may fail. [JACM’85]

26

SLIDE 27

Consensus is impossible

[PODC’89] Nancy Lynch

27

SLIDE 28

Aside from Simon PJ

Don’t drag your reader or listener through your blood strained path. Simon Peyton Jones

28

SLIDE 29

Paxos

Paxos is at the foundation of (almost) all distributed consensus protocols. It is a general approach of using two phases and majority quorums. It takes much more to construct a complete fault- tolerance distributed systems.

29

SLIDE 30

Consensus is hard

30

SLIDE 31

Doing the Impossible (Review)

In this section, we have:

Learned about various impossibly results in the

field such as CAP theorem and the FLP results

Introduced the fundamental (yet famously difficult

to understand) Paxos algorithm

31

Any questions so far?

SLIDE 32

A raft in the sea of confusion

32

SLIDE 33

Case Study 1: Raft

Raft, the understandable replication algorithm. Provides us with linearisable semantics and in the best case 2 RTT latency. A complete(ish) architecture for making our application fault-tolerance.

33

SLIDE 34

State Machine Replication

Server Client Server Server

A 7 B 2 A 7 B 2 A 7 B 2

B=3

34

SLIDE 35

State Machine Replication

Server Client Server Server

A 7 B 2 A 7 B 2 A 7 B 2

B=3

35

SLIDE 36

State Machine Replication

Server Client Server Server

A 7 B 2 A 7 B 2 A 7 B 2 3 3 3

36

SLIDE 37

State Machine Replication

Server Client Server Server

A 7 B 2 A 7 B 2 A 7 B 2

B=3

3 3 3

37

SLIDE 38

Leadership

Follower Candidate Leader Startup/ Restart Timeout Win Timeout Step down

38

Step down

SLIDE 39

Ordering

Each node stores is own perspective on a value known as the term. Each message includes the sender’s term and this is checked by the recipient. The term orders periods of leadership to aid in avoiding conflict. Each has one vote per term, thus there is at most one leader per term.

39

SLIDE 40

ID: 1 Term: 0 Vote: n ID: 2 Term: 0 Vote: n ID: 5 Term: 0 Vote: n ID: 4 Term: 0 Vote: n ID: 3 Term: 0 Vote: n

40

SLIDE 41

Leadership

Follower Candidate Leader Startup/ Restart Timeout Win Timeout Step down

41

Step down

SLIDE 42

ID: 1 Term: 0 Vote: n ID: 2 Term: 0 Vote: n ID: 5 Term: 0 Vote: n ID: 4 Term: 1 Vote: me ID: 3 Term: 0 Vote: n

Vote for me in term 1!

42

SLIDE 43

ID: 1 Term: 1 Vote: 4 ID: 2 Term: 1 Vote: 4 ID: 5 Term: 1 Vote: 4 ID: 4 Term: 1 Vote: me ID: 3 Term: 1 Vote: 4

Ok!

43

SLIDE 44

Replication

Each node has a log of client commands and a index into this representing which commands have been committed. A command is consider as committed when the leader has replicated it into the logs of a majority of servers.

44

SLIDE 45

Evaluation

The leader is a serious bottleneck -> limited

scalability

Can only handle the failure of a minority of nodes
Some rare network partitions render protocol in

livelock

45

SLIDE 46

Raft in the sea of confusion (Review)

In this section, we have:

Introduced the Raft algorithm
Seen how Raft elects a leader between a collect of

nodes

Evaluated the Raft algorithm

46

Any questions so far?

SLIDE 47

Beyond Raft

47

SLIDE 48

Case Study 2: Tango

Tango is designed to be a scalable replication protocol. It’s a variant of chain replication. It is leaderless and pushes more work onto clients

48

SLIDE 49

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 1 A=4 A=4 B=5

49

SLIDE 50

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 Next? 1

50

B=5

SLIDE 51

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 B=5 @ 1 OK 1 B=5

51

B=5

SLIDE 52

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5

52

B=5 @ 1 OK

SLIDE 53

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5 1 B=5

53

B=5 @ 1 OK

SLIDE 54

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 5

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5 1 B=5

54

SLIDE 55

Beyond Raft (Review)

In this section, we have:

Introduced an alternative algorithm, known as

Tango

Tango is scalable, as the leader is not longer the

bottleneck but has high latency

55

Any questions so far?

SLIDE 56

Next Steps

56

SLIDE 57

57

wait… we’re not finished yet!

SLIDE 58

Requirements

Scalability - High throughout processing of
perations.
Latency - Low latency commit of operation as

perceived by the client.

Fault-tolerance - Availability in the face of machine

and network failures.

Linearizable semantics - Operate as if a single

server system.

58

SLIDE 59

Many more examples

Raft [ATC’14] - Good starting point, understandable

algorithm from SMR + multi-paxos variant

Tango [SOSP’13] - Scalable algorithm for f+1 nodes, uses

CR + multi-paxos variant

VRR [MIT-TR’12] - Raft with round-robin leadership & more

distributed load

Zookeeper [ATC'10] - Primary backup replication + atomic

broadcast protocol (Zab [DSN’11])

EPaxos [SOSP’13] - leaderless Paxos varient for WANs

59

SLIDE 60

Can we do even better?

Self-scaling replication - adapting resources to

maintain resilience level.

Geo replication - strong consistency between wide

area links

Auto configuration - adapting timeouts and

configure as network changes

Integrated with unikernels, virtualisation, containers

and other such deployment tech

60

SLIDE 61

Evaluation is hard

few common evaluation metrics.
often only one experiment setup is used.
different workloads
evaluation to demonstrate protocol strength

61

SLIDE 62

Lessons Learned

Reaching consensus in distributed

systems is do able

Exploit domain knowledge
Raft is a good starting point but we

can do much better! Any Questions?

62