PARALLEL CONSENSUS PROTOCOL Joint work with Sajjad Rizvi and - PowerPoint PPT Presentation

CANOPUS: A SCALABLE AND MASSIVELY Bernard Wong CoNEXT 2017 PARALLEL CONSENSUS PROTOCOL Joint work with Sajjad Rizvi and Srinivasan Keshav

CONSENSUS PROBLEM Agreement between a set of nodes in the presence of failures  Asynchronous environment Primarily used to provide fault tolerance Node A: W(x=1) Node B: W(x=2) ? W(y=1) W(z=1) W(x)=3 ? W(y=1) W(z=1) W(x=3) ? W(y=1) W(z=1) W(x=3) Replicated log 2

A BUILDING BLOCK IN DISTRIBUTED SYSTEMS applications Hadoop HBase Kafka Akka Mesos BookKeeper Spanner … System Coordination services … ZooKeeper Chubby etcd Chef Consul Puppet atomic broadcast … Consensus and ZAB Paxos Raft … EPaxos Mencius SPaxos … NetPaxos AllConcur NOPaxos 3

A BUILDING BLOCK IN DISTRIBUTED SYSTEMS applications Hadoop HBase Kafka Akka Mesos BookKeeper Spanner … System Coordination services … ZooKeeper Chubby etcd Chef Consul Puppet Current consensus protocols are not scalable atomic broadcast … Consensus and However, most applications only require a small ZAB Paxos Raft number of replicas for fault tolerance … EPaxos Mencius SPaxos … NetPaxos AllConcur NOPaxos 4

PERMISSIONED BLOCKCHAINS A distributed ledger shared by all the participants Consensus at a large scale  Large number of participants (e.g., financial institutions)  Must validate a block before committing it to the ledger Examples  Hyperledger, Microsoft Coco, Kadena , Chain … 5

CANOPUS Consensus among a large set of participants  Targets thousands of nodes distributed across the globe Decentralized protocol  Nodes execute steps independently and in parallel Designed for modern datacenters  Takes advantage of high performance networks and hardware redundancies 6

SYSTEM ASSUMPTIONS Non-uniform network latencies and link capacities  Scalability is bandwidth limited  Protocol must be network topology aware Deployment consists of racks of servers connected by redundant links  Full rack failures and network partitions are rare WAN … … … … Global view Within a datacenter 7

CONSENSUS CYCLES Execution divided into a sequence of consensus cycles  In each cycle, Canopus determines the order of writes (state changes) received during the previous cycle d f g i Canopus servers e h z = 2 x = 1 x = 5 y = 3 8

SUPER-LEAVES AND VNODES Nodes in the same rack form a logical group called a super-leaf Use an intra-super-leaf consensus protocol to replicate write requests between nodes in the same super-leaf d f g i e h Super-leaf Super-leaf 9

SUPER-LEAVES AND VNODES Nodes in the same rack form a logical group called a super-leaf Use an intra-super-leaf consensus protocol to replicate write requests between nodes in the same super-leaf d, e, f h, g, i b c d d f f g g i i e e h h Super-leaf Super-leaf Super-leaf Super-leaf Represent the state of each super-leaf as a height 1 virtual node (vnode) 10

ACHIEVING CONSENSUS Consensus in Members of a height 1 vnode exchange round 2 a state with members of nearby height 1 vnodes to compute a height 2 vnode Consensus in round 1 b c  State exchange is greatly simplified since each vnode is fault tolerant d f g i h rounds in a consensus cycle e h A node completes a consensus cycle once it has computed the state of the root vnode d, e, f h, g, i d, e, f h, g, i 11

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF A1 a b B1 c C1 • Exploit low latency within a rack C2 • Reliable broadcast • RAFT 12

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF 634 A1 a b 746 B1 1. Nodes prepare a proposal message that contains a random number and a list of c pending write requests 538 C1 C2 13

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF 634 A1 746 B1 a b 2. Nodes use reliable 538 C1 538 C1 broadcast to exchange C2 C2 proposals within a 746 B1 634 A1 super-leaf c 538 C1 C2 746 B1 634 A1 14

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF 538 C1 538 C1 a b 3. Every node orders C2 C2 proposals 634 A1 634 A1 746 B1 746 B1 c 538 C1 C2 634 A1 746 B1 15

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF 538 C1 538 C1 a b C2 C2 634 A1 634 A1 746 B1 746 B1 c 538 C1 These three steps make up a C2 consensus round. 634 A1 746 B1 At the end, all three nodes have the same state of their common parent. 16

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES a b c Emulator Representative d f g i e h 17

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES a 1. Representatives send proposal requests to b c fetch the states of Emulator vnodes Representative d f g i e h {proposal request} 18

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES a 2. Emulators reply with proposals b c Emulator Representative d f g i e h {proposal response} 19

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES a 3. Reliable broadcast within a super-leaf b c Emulator Representative d f g i e h 20

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES a Consensus cycle ends for a node when it has b c completed the last Emulator round Representative d f g i e h 21

READ REQUESTS Read requests can be serviced locally by any Canopus node  Does not need to disseminate to other participating nodes Provides linearizability by  Buffering read requests until the global ordering of writes has been determined  Locally ordering its pending reads and writes to preserve the request order of its clients Significantly reduces bandwidth requirements for read requests Achieves total ordering of both read and write requests 22

ADDITIONAL OPTIMIZATIONS Pipelining consensus cycles  Critical to achieving high throughput over high latency links Write leases  For read-mostly workloads with low latency requirements  Reads can complete without waiting until the end of a consensus cycle 23

EVALUATION: MULTI DATACENTER CASE Latencies across datacenters (in ms) 3, 5, and 7 datacenters  Each datacenter corresponds to a super-leaf 3 nodes per datacenter ( up to 21 nodes in total )  EC2 c3.4xlarge instances Regions: Ireland (IR), California (CA), 100 clients in five machines per datacenter Virginia (VA), Tokyo (TK), Oregon (OR),  Each client is connected to a random node in the Sydney (SY), Frankfurt (FF) same datacenter 24

CANOPUS VS. EPAXOS (20% WRITES) 25

EVALUATION: SINGLE DATACENTER CASE 3 super-leaves of sizes of 3, 5, 7, 9 servers (i.e., up to 27 total servers)  Each server has 32GB RAM, 200 GB SSD, 12 cores running at 2.1 GHz Each server has a 10G to its ToR switch  Aggregation switch has dual 10G links to each ToR switch 180 clients, uniformly distributed on 15 machines  5 machines in each rack 26

ZKCANOPUS VS. ZOOKEEPER 27

LIMITATIONS We trade off fault tolerance for performance and understandability  Cannot tolerate full rack failure or network partitions We trade off latency for throughput  At low throughputs, latencies can be higher than other consensus protocols Stragglers can hold up the system (temporarily)  Super-leaf peers detect and remove them 28

ON-GOING WORK Handling super-leaf failures  For applications with high availability requirements  Detect and remove failed super-leaves to continue Byzantine fault tolerance  Canopus currently supports crash-stop failures  Aiming to maintain our current throughput 29

CONCLUSIONS Emerging applications involve consensus at large scales  Key barrier is a scalable consensus protocol Addressed by Canopus  Decentralized  Network topology aware  Optimized for modern datacenters 30

PARALLEL CONSENSUS PROTOCOL Joint work with Sajjad Rizvi and - PowerPoint PPT Presentation

CANOPUS: A SCALABLE AND MASSIVELY Bernard Wong CoNEXT 2017 PARALLEL CONSENSUS PROTOCOL Joint work with Sajjad Rizvi and Srinivasan Keshav CONSENSUS PROBLEM Agreement between a set of nodes in the presence of failures Asynchronous

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus Building Consensus is Consensus is finding an acceptable proposal that all members

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus and Dissent or: Meta - Consensus Consensus about what we have consensus

Proof of Luck: an Efficient Blockchain Consensus Protocol Mitar Milutinovic, Warren He, Howard

Membership of the consensus group Membership of the consensus group Members of the group were

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN Consensus The

When Aeron Met Raft Martin Thompson - @mjpt777 What does Consensus mean? consensus

Distributed Systems CS425/ECE428 03/06/2020 Todays agenda Consensus Consensus in

CONSENSUS Fall 2012 Ken Birman Consensus a classic problem Consensus abstraction underlies

EFFICIENCY OF THE BASIC EMDR PROTOCOL COMPARED TO A RESOURCE PROTOCOL ROLE OF EYE MOVEMENTS IN A

Forest Protocol Forest Protocol Protocol Update Effort Protocol Update Effort Goals and

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN Consensus (Recapitulation)

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford

Consensus in Distributed Systems Jeff Chase Duke University Consensus P 1 P 1 v 1 d 1

Designing, Modeling, and Optimizing Transactional Data Structures PhD Dissertation Defense Ahmed

Multi-Data Center Consistency Authors: Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden,

Chainspace: A Sharded Smart Contract Platform Authors Mustafa Al-Bassam* Alberto Sonnino*

OmniLedger Master Semester Project Pablo Lorenceau Supervisor: Linus Gasser Responsible: Prof.

Git Workshop Peter Lundgren September 6, 2016 . . . . . . . . . . . . . . . . .

OmniLedger: A Secure, Scale-Out, Decentralized Ledger via Sharding Philipp Jovanovic (@daeinar)

verification Frdric Tessier EGSnrc update and Monte Carlo simulation verification Ernesto

NUCLEAR DISARMAMENT AND THE CONFERENCE ON DISARMAMENT Presentation to inform