Reaching reliable agreement in an unreliable world
Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf
1
Reaching reliable agreement in an unreliable world Heidi Howard - - PowerPoint PPT Presentation
Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf 1
Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf
1
2
3
Leslie Lamport “… a collection of distinct processes which are spatially
with one another by exchanging messages … the message delay is not negligible compared to the time between events in a single process” [CACM ‘78]
Alice is new graduate of to the world of work. She joins a cool new start up, where she is responsible for a distributed system.
4
5
A 7 B 2 C 1
6
A 7 B 2 C 1
A? 7
7
A 7 B 2 C 1
A? 7 B=5
8
A 7 B 5 C 1
A? 7 B=5 OK
9
A 7 B 5 C 1
A? 7 B=5 OK B? 5
perceived by the client.
and network failures.
server system.
10
Server Client 2
A 7 B 2
Client 1 Client 3
11
Server Client 2
A 7 B 2
Client 1 Client 3
A? 7
12
Server Client 2
A 7 B 2
Client 1 Client 3
B=3 OK
3
13
Server Client 2
A 7 B 2
Client 1 Client 3
B? 3
3
14
Pros
common case)
Cons
server
15
Pros
common case)
logging
retransmission & command cache
Cons
fails
server
16
aka Primary backup replication
Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1
A 7 B 2
A? 7
A 7 B 2 A 7 B 2 A 7 B 2
17
aka Primary backup replication
Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1
A 7 B 2
B=1
A 7 B 2 A 7 B 2 A 7 B 2
18
aka Primary backup replication
Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1
A 7 B 1
B=1
A 7 B 2 A 7 B 2 A 7 B 2 A 7 B 1
19
aka Primary backup replication
Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1
A 7 B 1
OK
A 7 B 1 A 7 B 1 A 7 B 1
OK OK OK
20
We are assuming total ordered broadcast
21
(aka atomic broadcast) the guarantee that messages are received reliably and in the same order by all nodes.
22
So far we have:
and her key-value store)
system are not sufficient
23
Any questions so far?
24
Pick 2 of 3:
Proposed by Brewer in 1998, still debated and regarded as misleading. [Brewer’12] [Kleppmann’15]
25
Eric Brewer
It is impossible to guarantee consensus when messages may be delay if even one node may fail. [JACM’85]
26
Consensus is impossible
[PODC’89] Nancy Lynch
27
Don’t drag your reader or listener through your blood strained path. Simon Peyton Jones
28
Paxos is at the foundation of (almost) all distributed consensus protocols. It is a general approach of using two phases and majority quorums. It takes much more to construct a complete fault- tolerance distributed systems.
29
Consensus is hard
30
In this section, we have:
field such as CAP theorem and the FLP results
to understand) Paxos algorithm
31
Any questions so far?
32
Raft, the understandable replication algorithm. Provides us with linearisable semantics and in the best case 2 RTT latency. A complete(ish) architecture for making our application fault-tolerance.
33
Server Client Server Server
A 7 B 2 A 7 B 2 A 7 B 2
B=3
34
Server Client Server Server
A 7 B 2 A 7 B 2 A 7 B 2
B=3
35
Server Client Server Server
A 7 B 2 A 7 B 2 A 7 B 2 3 3 3
36
Server Client Server Server
A 7 B 2 A 7 B 2 A 7 B 2
B=3
3 3 3
37
Follower Candidate Leader Startup/ Restart Timeout Win Timeout Step down
38
Step down
Each node stores is own perspective on a value known as the term. Each message includes the sender’s term and this is checked by the recipient. The term orders periods of leadership to aid in avoiding conflict. Each has one vote per term, thus there is at most one leader per term.
39
ID: 1 Term: 0 Vote: n ID: 2 Term: 0 Vote: n ID: 5 Term: 0 Vote: n ID: 4 Term: 0 Vote: n ID: 3 Term: 0 Vote: n
40
Follower Candidate Leader Startup/ Restart Timeout Win Timeout Step down
41
Step down
ID: 1 Term: 0 Vote: n ID: 2 Term: 0 Vote: n ID: 5 Term: 0 Vote: n ID: 4 Term: 1 Vote: me ID: 3 Term: 0 Vote: n
Vote for me in term 1!
42
ID: 1 Term: 1 Vote: 4 ID: 2 Term: 1 Vote: 4 ID: 5 Term: 1 Vote: 4 ID: 4 Term: 1 Vote: me ID: 3 Term: 1 Vote: 4
Ok!
43
Each node has a log of client commands and a index into this representing which commands have been committed. A command is consider as committed when the leader has replicated it into the logs of a majority of servers.
44
scalability
livelock
45
In this section, we have:
nodes
46
Any questions so far?
47
Tango is designed to be a scalable replication protocol. It’s a variant of chain replication. It is leaderless and pushes more work onto clients
48
Client 1
A 7 B 2
Client 2
A 4 B 2
Sequencer Server 1 Server 2 Server 3 A=4 Next: 1 A=4 A=4 B=5
49
Client 1
A 7 B 2
Client 2
A 4 B 2
Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 Next? 1
50
B=5
Client 1
A 7 B 2
Client 2
A 4 B 2
Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 B=5 @ 1 OK 1 B=5
51
B=5
Client 1
A 7 B 2
Client 2
A 4 B 2
Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5
52
B=5 @ 1 OK
Client 1
A 7 B 2
Client 2
A 4 B 2
Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5 1 B=5
53
B=5 @ 1 OK
Client 1
A 7 B 2
Client 2
A 4 B 5
Sequencer Server 1 Server 2 Server 3 A=4 Next: 2 A=4 A=4 1 B=5 1 B=5 1 B=5
54
In this section, we have:
Tango
bottleneck but has high latency
55
Any questions so far?
56
57
wait… we’re not finished yet!
perceived by the client.
and network failures.
server system.
58
algorithm from SMR + multi-paxos variant
CR + multi-paxos variant
distributed load
broadcast protocol (Zab [DSN’11])
59
maintain resilience level.
area links
configure as network changes
and other such deployment tech
60
61
systems is do able
can do much better! Any Questions?
62