PARALLEL CONSENSUS PROTOCOL Joint work with Sajjad Rizvi and - - PowerPoint PPT Presentation

parallel consensus protocol
SMART_READER_LITE
LIVE PREVIEW

PARALLEL CONSENSUS PROTOCOL Joint work with Sajjad Rizvi and - - PowerPoint PPT Presentation

CANOPUS: A SCALABLE AND MASSIVELY Bernard Wong CoNEXT 2017 PARALLEL CONSENSUS PROTOCOL Joint work with Sajjad Rizvi and Srinivasan Keshav CONSENSUS PROBLEM Agreement between a set of nodes in the presence of failures Asynchronous


slide-1
SLIDE 1

CANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

Bernard Wong CoNEXT 2017 Joint work with Sajjad Rizvi and Srinivasan Keshav

slide-2
SLIDE 2

CONSENSUS PROBLEM

2

Agreement between a set of nodes in the presence of failures

 Asynchronous environment

Primarily used to provide fault tolerance

W(x)=3

W(y=1) W(z=1)

?

Node A: W(x=1) Node B: W(x=2) Replicated log

W(x=3)

W(y=1) W(z=1)

?

W(x=3)

W(y=1) W(z=1)

?

slide-3
SLIDE 3

A BUILDING BLOCK IN DISTRIBUTED SYSTEMS

3

Spanner Akka Mesos Hadoop HBase Kafka BookKeeper …

System applications

ZooKeeper Consul Chef Puppet etcd Chubby …

Coordination services

ZAB Paxos Raft … Mencius EPaxos SPaxos … AllConcur NOPaxos NetPaxos …

Consensus and atomic broadcast

slide-4
SLIDE 4

A BUILDING BLOCK IN DISTRIBUTED SYSTEMS

4

Spanner Akka Mesos Hadoop HBase Kafka BookKeeper …

System applications

ZooKeeper Consul Chef Puppet etcd Chubby …

Coordination services

ZAB Paxos Raft … Mencius EPaxos SPaxos … AllConcur NOPaxos NetPaxos …

Consensus and atomic broadcast

Current consensus protocols are not scalable However, most applications only require a small number of replicas for fault tolerance

slide-5
SLIDE 5

PERMISSIONED BLOCKCHAINS

A distributed ledger shared by all the participants Consensus at a large scale

 Large number of participants (e.g., financial institutions)  Must validate a block before committing it to the ledger

Examples

 Hyperledger, Microsoft Coco, Kadena, Chain …

5

slide-6
SLIDE 6

CANOPUS

Consensus among a large set of participants

 Targets thousands of nodes distributed across the globe

Decentralized protocol

 Nodes execute steps independently and in parallel

Designed for modern datacenters

 Takes advantage of high performance networks and hardware redundancies

6

slide-7
SLIDE 7

SYSTEM ASSUMPTIONS

Non-uniform network latencies and link capacities

 Scalability is bandwidth limited  Protocol must be network topology aware

Deployment consists of racks of servers connected by redundant links

 Full rack failures and network partitions are rare

7

WAN

… … … …

Global view Within a datacenter

slide-8
SLIDE 8

CONSENSUS CYCLES

8

Execution divided into a sequence of consensus cycles

 In each cycle, Canopus determines the order of writes (state changes) received during the previous cycle

d e f g h i Canopus servers x = 1 y = 3 x = 5 z = 2

slide-9
SLIDE 9

SUPER-LEAVES AND VNODES

9

Nodes in the same rack form a logical group called a super-leaf Use an intra-super-leaf consensus protocol to replicate write requests between nodes in the same super-leaf

d e f g h i Super-leaf Super-leaf

slide-10
SLIDE 10

SUPER-LEAVES AND VNODES

10

Nodes in the same rack form a logical group called a super-leaf Use an intra-super-leaf consensus protocol to replicate write requests between nodes in the same super-leaf

d e f g h i Super-leaf Super-leaf d e f g h i Super-leaf Super-leaf b c d, e, f h, g, i

Represent the state of each super-leaf as a height 1 virtual node (vnode)

slide-11
SLIDE 11

ACHIEVING CONSENSUS

11

b d e f c g h i a Consensus in round 1 Consensus in round 2 d, e, f h, g, i d, e, f h, g, i

Members of a height 1 vnode exchange state with members of nearby height 1 vnodes to compute a height 2 vnode

 State exchange is greatly simplified since each vnode is fault tolerant

h rounds in a consensus cycle A node completes a consensus cycle once it has computed the state of the root vnode

slide-12
SLIDE 12

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

12

a b c A1 B1 C1 C2

  • Exploit low latency within a rack
  • Reliable broadcast
  • RAFT
slide-13
SLIDE 13

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

13

a b c A1 634 746 B1 C1 C2 538

  • 1. Nodes prepare a

proposal message that contains a random number and a list of pending write requests

slide-14
SLIDE 14

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

14

a b c

  • 2. Nodes use reliable

broadcast to exchange proposals within a super-leaf

746 B1 A1 634 C1 C2 538 746 B1 A1 634 C1 C2 538 746 B1 A1 634 C1 C2 538

slide-15
SLIDE 15

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

15

a b c

  • 3. Every node orders

proposals

746 B1 A1 634 C1 C2 538 746 B1 A1 634 C1 C2 538 746 B1 A1 634 C1 C2 538

slide-16
SLIDE 16

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

16

These three steps make up a consensus round. At the end, all three nodes have the same state of their common parent.

a b c 746 B1 A1 634 C1 C2 538 746 B1 A1 634 C1 C2 538 746 B1 A1 634 C1 C2 538

slide-17
SLIDE 17

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

17

b d e f a c g h i Representative Emulator

slide-18
SLIDE 18

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

18

b d e f a c g h i Representative Emulator {proposal request}

  • 1. Representatives send

proposal requests to fetch the states of vnodes

slide-19
SLIDE 19

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

19

b d e f a c g h i Representative Emulator

  • 2. Emulators reply

with proposals

{proposal response}

slide-20
SLIDE 20

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

20

b d e f a c g h i Representative Emulator

  • 3. Reliable broadcast

within a super-leaf

slide-21
SLIDE 21

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

21

b d e f a c g h i Representative Emulator

Consensus cycle ends for a node when it has completed the last round

slide-22
SLIDE 22

READ REQUESTS

Read requests can be serviced locally by any Canopus node

 Does not need to disseminate to other participating nodes

Provides linearizability by

 Buffering read requests until the global ordering of writes has been determined  Locally ordering its pending reads and writes to preserve the request order of its clients

Significantly reduces bandwidth requirements for read requests Achieves total ordering of both read and write requests

22

slide-23
SLIDE 23

ADDITIONAL OPTIMIZATIONS

Pipelining consensus cycles

 Critical to achieving high throughput over high latency links

Write leases

 For read-mostly workloads with low latency requirements  Reads can complete without waiting until the end of a consensus cycle

23

slide-24
SLIDE 24

EVALUATION: MULTI DATACENTER CASE

3, 5, and 7 datacenters

 Each datacenter corresponds to a super-leaf

3 nodes per datacenter (up to 21 nodes in total)

 EC2 c3.4xlarge instances

100 clients in five machines per datacenter

 Each client is connected to a random node in the same datacenter

24

Latencies across datacenters (in ms)

Regions: Ireland (IR), California (CA), Virginia (VA), Tokyo (TK), Oregon (OR), Sydney (SY), Frankfurt (FF)

slide-25
SLIDE 25

CANOPUS VS. EPAXOS (20% WRITES)

25

slide-26
SLIDE 26

EVALUATION: SINGLE DATACENTER CASE

3 super-leaves of sizes of 3, 5, 7, 9 servers (i.e., up to 27 total servers)

 Each server has 32GB RAM, 200 GB SSD, 12 cores running at 2.1 GHz

Each server has a 10G to its ToR switch

 Aggregation switch has dual 10G links to each ToR switch

180 clients, uniformly distributed on 15 machines

 5 machines in each rack

26

slide-27
SLIDE 27

ZKCANOPUS VS. ZOOKEEPER

27

slide-28
SLIDE 28

LIMITATIONS

We trade off fault tolerance for performance and understandability

 Cannot tolerate full rack failure or network partitions

We trade off latency for throughput

 At low throughputs, latencies can be higher than other consensus protocols

Stragglers can hold up the system (temporarily)

 Super-leaf peers detect and remove them

28

slide-29
SLIDE 29

ON-GOING WORK

Handling super-leaf failures

 For applications with high availability requirements  Detect and remove failed super-leaves to continue

Byzantine fault tolerance

 Canopus currently supports crash-stop failures  Aiming to maintain our current throughput

29

slide-30
SLIDE 30

CONCLUSIONS

Emerging applications involve consensus at large scales

 Key barrier is a scalable consensus protocol

Addressed by Canopus

 Decentralized  Network topology aware  Optimized for modern datacenters

30