Reaching reliable agreement in an unreliable world Heidi Howard - - PowerPoint PPT Presentation

reaching reliable agreement in an unreliable world
SMART_READER_LITE
LIVE PREVIEW

Reaching reliable agreement in an unreliable world Heidi Howard - - PowerPoint PPT Presentation

Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf 1


slide-1
SLIDE 1

Reaching reliable agreement in an unreliable world

Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf

1

slide-2
SLIDE 2

Distributed Systems in Practice

  • Social networks
  • Banking
  • Government information systems
  • E-commerce
  • Web servers

2

slide-3
SLIDE 3

Distributed Systems in Theory

3

Leslie Lamport “… a collection of distinct processes which are spatially

  • perated and which communicate

with one another by exchanging messages … the message delay is not negligible compared to the time between events in a single process” [CACM ‘78]

slide-4
SLIDE 4

Introducing Alice

Alice is new graduate of to the world of work. She joins a cool new start up, where she is responsible for a distributed system.

4

slide-5
SLIDE 5

Key Value Store

5

A 7 B 2 C 1

slide-6
SLIDE 6

Key Value Store

6

A 7 B 2 C 1

A? 7

slide-7
SLIDE 7

Key Value Store

7

A 7 B 2 C 1

A? 7 B=5

slide-8
SLIDE 8

Key Value Store

8

A 7 B 5 C 1

A? 7 B=5 OK

slide-9
SLIDE 9

Key Value Store

9

A 7 B 5 C 1

A? 7 B=5 OK B? 5

slide-10
SLIDE 10

Requirements

  • Scalability - High throughout processing of
  • perations.
  • Latency - Low latency commit of operation as

perceived by the client.

  • Fault-tolerance - Availability in the face of machine

and network failures.

  • Linearizable semantics - Operate as if a single

server system.

10

slide-11
SLIDE 11

Single Server System

Server Client 2

A 7 B 2

Client 1 Client 3

11

slide-12
SLIDE 12

Single Server System

Server Client 2

A 7 B 2

Client 1 Client 3

A? 7

12

slide-13
SLIDE 13

Single Server System

Server Client 2

A 7 B 2

Client 1 Client 3

B=3 OK

3

13

slide-14
SLIDE 14

Single Server System

Server Client 2

A 7 B 2

Client 1 Client 3

B? 3

3

14

slide-15
SLIDE 15

Single Server System

Pros

  • easy to deploy
  • low latency (1 RTT in

common case)

  • requests executed in-order

Cons

  • system unavailable if server
  • r network fails
  • throughput limited to one

server

15

slide-16
SLIDE 16

Single Server System (v.2)

Pros

  • easy to deploy
  • low latency (1 RTT in

common case)

  • linearizable semantics
  • durability with write-ahead

logging

  • partition tolerance with

retransmission & command cache

Cons

  • system unavailable if server

fails

  • throughput limited to one

server

16

slide-17
SLIDE 17

Backups

aka Primary backup replication

Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1

A 7 B 2

A? 7

A 7 B 2 A 7 B 2 A 7 B 2

17

slide-18
SLIDE 18

Backups

aka Primary backup replication

Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1

A 7 B 2

B=1

A 7 B 2 A 7 B 2 A 7 B 2

18

slide-19
SLIDE 19

Backups

aka Primary backup replication

Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1

A 7 B 1

B=1

A 7 B 2 A 7 B 2 A 7 B 2 A 7 B 1

19

slide-20
SLIDE 20

Backups

aka Primary backup replication

Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1

A 7 B 1

OK

A 7 B 1 A 7 B 1 A 7 B 1

OK OK OK

20

slide-21
SLIDE 21

Big Gotcha

We are assuming total ordered broadcast

21

slide-22
SLIDE 22

Totally Ordered Broadcast

(aka atomic broadcast) the guarantee that messages are received reliably and in the same order by all nodes.

22

slide-23
SLIDE 23

Intro (Review)

So far we have:

  • Defined our notion of a distributed system
  • Introduced an example distributed system (Alice

and her key-value store)

  • Seen that straw man approaches to building this

system are not sufficient

23

Any questions so far?

slide-24
SLIDE 24

Doing the Impossible

24

slide-25
SLIDE 25

CAP Theorem

Pick 2 of 3:

  • Consistency
  • Availability
  • Partition tolerance

Proposed by Brewer in 1998, still debated and regarded as misleading. [Brewer’12] [Kleppmann’15]

25

Eric Brewer

slide-26
SLIDE 26

FLP Impossibility

It is impossible to guarantee consensus when messages may be delay if even one node may fail. [JACM’85]

26

slide-27
SLIDE 27

Consensus is impossible

[PODC’89] Nancy Lynch

27

slide-28
SLIDE 28

Aside from Simon PJ

Don’t drag your reader or listener through your blood strained path. Simon Peyton Jones

28

slide-29
SLIDE 29

Paxos

Paxos is at the foundation of (almost) all distributed consensus protocols. It is a general approach of using two phases and majority quorums. It takes much more to construct a complete fault- tolerance distributed systems.

29

slide-30
SLIDE 30

Consensus is hard

30

slide-31
SLIDE 31

Doing the Impossible (Review)

In this section, we have:

  • Learned about various impossibly results in the

field such as CAP theorem and the FLP results

  • Introduced the fundamental (yet famously difficult

to understand) Paxos algorithm

31

Any questions so far?

slide-32
SLIDE 32

A raft in the sea of confusion

32

slide-33
SLIDE 33

Case Study 1: Raft

Raft, the understandable replication algorithm. Provides us with linearisable semantics and in the best case 2 RTT latency. A complete(ish) architecture for making our application fault-tolerance.

33

slide-34
SLIDE 34

State Machine Replication

Server Client Server Server

A 7 B 2 A 7 B 2 A 7 B 2

B=3

34

slide-35
SLIDE 35

State Machine Replication

Server Client Server Server

A 7 B 2 A 7 B 2 A 7 B 2

B=3

35

slide-36
SLIDE 36

State Machine Replication

Server Client Server Server

A 7 B 2 A 7 B 2 A 7 B 2 3 3 3

36

slide-37
SLIDE 37

State Machine Replication

Server Client Server Server

A 7 B 2 A 7 B 2 A 7 B 2

B=3

3 3 3

37

slide-38
SLIDE 38

Leadership

Follower Candidate Leader Startup/ Restart Timeout Win Timeout Step down

38

Step down

slide-39
SLIDE 39

Ordering

Each node stores is own perspective on a value known as the term. Each message includes the sender’s term and this is checked by the recipient. The term orders periods of leadership to aid in avoiding conflict. Each has one vote per term, thus there is at most one leader per term.

39

slide-40
SLIDE 40

ID: 1 Term: 0 Vote: n ID: 2 Term: 0 Vote: n ID: 5 Term: 0 Vote: n ID: 4 Term: 0 Vote: n ID: 3 Term: 0 Vote: n

40

slide-41
SLIDE 41

Leadership

Follower Candidate Leader Startup/ Restart Timeout Win Timeout Step down

41

Step down

slide-42
SLIDE 42

ID: 1 Term: 0 Vote: n ID: 2 Term: 0 Vote: n ID: 5 Term: 0 Vote: n ID: 4 Term: 1 Vote: me ID: 3 Term: 0 Vote: n

Vote for me in term 1!

42

slide-43
SLIDE 43

ID: 1 Term: 1 Vote: 4 ID: 2 Term: 1 Vote: 4 ID: 5 Term: 1 Vote: 4 ID: 4 Term: 1 Vote: me ID: 3 Term: 1 Vote: 4

Ok!

43

slide-44
SLIDE 44

Replication

Each node has a log of client commands and a index into this representing which commands have been committed. A command is consider as committed when the leader has replicated it into the logs of a majority of servers.

44

slide-45
SLIDE 45

Evaluation

  • The leader is a serious bottleneck -> limited

scalability

  • Can only handle the failure of a minority of nodes
  • Some rare network partitions render protocol in

livelock

45

slide-46
SLIDE 46

Raft in the sea of confusion (Review)

In this section, we have:

  • Introduced the Raft algorithm
  • Seen how Raft elects a leader between a collect of

nodes

  • Evaluated the Raft algorithm

46

Any questions so far?

slide-47
SLIDE 47

Beyond Raft

47

slide-48
SLIDE 48

Case Study 2: Tango

Tango is designed to be a scalable replication protocol. It’s a variant of chain replication. It is leaderless and pushes more work onto clients

48

slide-49
SLIDE 49

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 1 A=4 A=4 B=5

49

slide-50
SLIDE 50

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 Next? 1

50

B=5

slide-51
SLIDE 51

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 B=5 @ 1 OK 1 B=5

51

B=5

slide-52
SLIDE 52

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5

52

B=5 @ 1 OK

slide-53
SLIDE 53

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 2

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5 1 B=5

53

B=5 @ 1 OK

slide-54
SLIDE 54

Simple Replication

Client 1

A 7 B 2

Client 2

A 4 B 5

Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5 1 B=5

54

slide-55
SLIDE 55

Beyond Raft (Review)

In this section, we have:

  • Introduced an alternative algorithm, known as

Tango

  • Tango is scalable, as the leader is not longer the

bottleneck but has high latency

55

Any questions so far?

slide-56
SLIDE 56

Next Steps

56

slide-57
SLIDE 57

57

wait… we’re not finished yet!

slide-58
SLIDE 58

Requirements

  • Scalability - High throughout processing of
  • perations.
  • Latency - Low latency commit of operation as

perceived by the client.

  • Fault-tolerance - Availability in the face of machine

and network failures.

  • Linearizable semantics - Operate as if a single

server system.

58

slide-59
SLIDE 59

Many more examples

  • Raft [ATC’14] - Good starting point, understandable

algorithm from SMR + multi-paxos variant

  • Tango [SOSP’13] - Scalable algorithm for f+1 nodes, uses

CR + multi-paxos variant

  • VRR [MIT-TR’12] - Raft with round-robin leadership & more

distributed load

  • Zookeeper [ATC'10] - Primary backup replication + atomic

broadcast protocol (Zab [DSN’11])

  • EPaxos [SOSP’13] - leaderless Paxos varient for WANs

59

slide-60
SLIDE 60

Can we do even better?

  • Self-scaling replication - adapting resources to

maintain resilience level.

  • Geo replication - strong consistency between wide

area links

  • Auto configuration - adapting timeouts and

configure as network changes

  • Integrated with unikernels, virtualisation, containers

and other such deployment tech

60

slide-61
SLIDE 61

Evaluation is hard

  • few common evaluation metrics.
  • often only one experiment setup is used.
  • different workloads
  • evaluation to demonstrate protocol strength

61

slide-62
SLIDE 62

Lessons Learned

  • Reaching consensus in distributed

systems is do able

  • Exploit domain knowledge
  • Raft is a good starting point but we

can do much better! Any Questions?

62