RAFT Consensus Slide content borrowed from Diego Ongaro, John - PowerPoint PPT Presentation

RAFT Consensus Slide content borrowed from Diego Ongaro, John Ousterhout, and Alberto Montresor

Log Consensus • Bit consensus: agree on a single bit, based on inputs • (0,1,0,0,1,0,0) -> 1 • Log consensus: agree on contents and order of events in a log • {A, B, Q, R, W, Z} -> [A, Q, R, B, Z]

Banks / cryptocurrencies • State: account balances • Events: transactions • Alice: $100 • Alice pays Bob $20 • Bob: $200 • Charlie pays Alice $50 • Charlie: $50 • Charlie pays Bob $50

Databases (e.g., enrollment) • State: database tables • Events: transactions • Classes: • Alice drops CS425 • Alice: CS425, CS438 • Bob switches to 3 credits • Bob: CS425, CS411 • Charlie signs up for CS438 • Charlie: ECE428, ECE445 • ECE445 moves to ECEB1013 • Rooms: • CS425: DCL1320 • ECE445: ECEB3013

Filesystems • State: all files on the system • Events: updates • Midterm.tex • Save midterm solutions to midterm-solutions.tex • HW2-solutions.tex • Append MP2 to Assignments.html • Assignments.html • Delete exam-draft.tex

State machines • State: complete state of a • Events: messages received program • Assumption: all state machines determinist

Replicated State Machines • A state machine can fail, taking the state with it • Replicate for • Availability — can continue operation even if one SM fails • Durability — data is not lost • Must ensure: • Consistency!

Log-based • Each replica maintains a log of events (from client(s)) • Replicas apply events in the log to update their state • Same initial state + same order of events in the log => consistent final state

Log Consensus • All replicas must agree on the order of events in the log • Is this possible in asynchronous systems?

Log Consensus • All replicas must agree on the order of events in the log • Is this possible in asynchronous systems? • Totally correct implementation impossible (FLP)! • Safety • Replicas always add events in consistent order • Liveness • If a majority of nodes is available , they will eventually establish consistent log order • Available = not failed, and not delayed beyond a bound

The distributed log (I) • Each server stores a log containing commands • Consensus algorithm ensures that all logs contain the same commands in the same order • State machines always execute commands in the log order • They will remain consistent as long as command executions have deterministic results

The distributed log (II)

The distributed log (III) • Client sends a command to one of the servers • Server adds the command to its log • Server forwards the new log entry to the other servers • Once a consensus has been reached, each server state machine process the command and sends it reply to the client

Paxos Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part- time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament’s protocol provides a new way of implementing the state-machine approach to the design of distributed systems — an approach that has received limited attention because it leads to designs of insufficient complexity.

Paxos Timeline • 1989: Lamport wrote 42 page (!) DEC technical report • 1990: Submitted to and rejected from ACM Transactions on Computer Systems • 1998: The original paper is resubmitted and accepted by TOCS. • 2001 Lamport publishes “Paxos made simple” in ACM SIGACT News • 2007 T. D. Chandra, R. Griesemer, J. Redstone. Paxos made live: an engineering perspective. PODC 2007, Portland, Oregon.

Paxos • Google uses the Paxos algorithm in their Chubby distributed lock service. Chubby is used by BigTable, which is now in production in Google Analytics and other products • Amazon Web Services uses the Paxos algorithm extensively to power its platform • Windows Fabric, used by many of the Azure services, make use of the Paxos algorithm for replication between nodes in a cluster • Neo4j HA graph database implements Paxos, replacing Apache ZooKeeper used in previous versions. • Apache Mesos uses Paxos algorithm for its replicated log coordination

Paxos limitations (I) • Exceptionally difficult to understand “ The dirty little secret of the NSDI * community is that at most five people really, truly understand every part of Paxos ;-). ” – Anonymous NSDI reviewer *The USENIX Symposium on Networked Systems Design and Implementation

Paxos limitations (II) • Very difficult to implement “ There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system…the final system will be based on an unproven protocol. ” – Chubby authors

Designing for understandability • Main objective of RAFT • Whenever possible, select the alternative that is the easiest to understand • Techniques that were used include • Dividing problems into smaller problems • Reducing the number of system states to consider • Could logs have holes in them? No

Raft consensus algorithm (I) • Servers start by electing a leader • Sole server habilitated to accept commands from clients • Will enter them in its log and forward them to other servers • Will tell them when it is safe to apply these log entries to their state machines

Raft consensus algorithm (II) • Decomposes the problem into three fairly independent subproblems • Leader election: How servers will pick a— single —leader • Log replication: How the leader will accept log entries from clients, propagate them to the other servers and ensure their logs remain in a consistent state • Safety

Avoiding split elections • Raft uses randomized election timeouts • Chosen randomly from a fixed interval • Increases the chances that a single follower will detect the loss of the leader before the others

Example Follower with the shortest timeout becomes the new leader Follower A Timeouts Follower B Leader X Last heartbeat

Log replication • Leaders • Accept client commands • Append them to their log (new entry) • Issue AppendEntry RPCs in parallel to all followers • Apply the entry to their state machine once it has been safely replicated • Entry is then committed

A client sends a request Log State Client machine Log Log State State machine machine • Leader stores request on its log and forwards it to its followers

The followers receive the request Log State Client machine Log Log State State machine machine • Followers store the request on their logs and acknowledge its receipt

The leader tallies followers' ACKs Log State Client machine Log Log State State machine machine • Once it ascertains the request has been processed by a majority of the servers, it updates its state machine

The leader tallies followers' ACKs Log State Client machine Log Log State State machine machine • Leader's heartbeats convey the news to its followers: they update their state machines

Log organization Colors identify terms

Handling slow followers ,… • Leader reissues the AppendEntry RPC • They are idempotent

Committed entries • Guaranteed to be both • Durable • Eventually executed by all the available state machine • Committing an entry also commits all previous entries • All AppendEntry RPCS—including heartbeats—include the index of its most recently committed entry

Why? • Raft commits entries in strictly sequential order • Requires followers to accept log entry appends in the same sequential order • Cannot "skip" entries Greatly simplifies the protocol

Raft log matching property • If two entries in different logs have the same index and term • These entries store the same command • All previous entries in the two logs are identical

Handling leader crashes (I) • Can leave the cluster in a inconsistent state if the old leader had not fully replicated a previous entry • Some followers may have in their logs entries that the new leader does not have • Other followers may miss entries that the new leader has

Handling leader crashes (II) (new term)

An election starts Log Log State State machine machine • Candidate for leader position requests votes of other former followers • Includes a summary of the state of its log

Former followers reply Log Log State State machine machine ? • Former followers compare the state of their logs with credentials of candidate • Vote for candidate unless • Their own log is more "up to date" • They have already voted for another server

Handling leader crashes (III) • Raft solution is to let the new leader to force followers' log to duplicate its own • Conflicting entries in followers' logs will be overwritten

The new leader is in charge Log Log State State machine machine • Newly elected candidate forces all its followers to duplicate in their logs the contents of its own log

RAFT Consensus Slide content borrowed from Diego Ongaro, John - PowerPoint PPT Presentation

RAFT Consensus Slide content borrowed from Diego Ongaro, John Ousterhout, and Alberto Montresor Log Consensus Bit consensus: agree on a single bit, based on inputs (0,1,0,0,1,0,0) -> 1 Log consensus: agree on contents and order of

When Aeron Met Raft Martin Thompson - @mjpt777 What does Consensus mean? consensus

Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus Building Consensus is Consensus is finding an acceptable proposal that all members

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus and Dissent or: Meta - Consensus Consensus about what we have consensus

Keeping RAFT Afloat Cloud Scale Distributed Consensus Philip Haynes YOW! Data September 2016

Designing for Understandability: the Raft Consensus Algorithm Diego Ongaro John Ousterhout

Raft and Other Stories Consensus Trilogy: Part III Rough Timeline for Today Talk about

Consensus II Replicated State Machines, RAFT CS 240: Computing Systems and Concurrency Lecture

When Raft Meets SDN: How to Elect a Leader and Reach Consensus in an Unruly Network Yang Zhang ,

Membership of the consensus group Membership of the consensus group Members of the group were

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN Consensus The

Distributed Systems CS425/ECE428 03/06/2020 Todays agenda Consensus Consensus in

CONSENSUS Fall 2012 Ken Birman Consensus a classic problem Consensus abstraction underlies

Bitcoin & RAFT Distributed Systems Nikita Borisov Topics for Today Finish Bitcoin

7th International dCache Workshop Berlin Bits and Pieces 2013 Christian Bernardt (at DESY)

Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday

Logging with SF4L and Logback J.Serrat 102759 Software Design November 3, 2015 Index Why

Week 2: Maximum Likelihood Estimation Instructor: Sergey Levine 1 Recap: MLE for the binomial

Mixture Models Bhiksha Raj 27 Oct 2016 11755/18797 1 Learning Distributions for Data

The Why, What, and How of Software Transactions for More Reliable Concurrency Dan Grossman

CS320: Performance Evaluation Plotting data sets Semi log plots Log log plots Analyzing Program

Modeling and Simulating Social Systems with MATLAB Lecture 2 Statistics and Plotting in

RAFT Consensus Slide content borrowed from Diego Ongaro, John - PowerPoint PPT Presentation

RAFT Consensus Slide content borrowed from Diego Ongaro, John Ousterhout, and Alberto Montresor Log Consensus Bit consensus: agree on a single bit, based on inputs (0,1,0,0,1,0,0) -> 1 Log consensus: agree on contents and order of

When Aeron Met Raft Martin Thompson - @mjpt777 What does Consensus mean? consensus

Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus Building Consensus is Consensus is finding an acceptable proposal that all members

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus and Dissent or: Meta - Consensus Consensus about what we have consensus

Keeping RAFT Afloat Cloud Scale Distributed Consensus Philip Haynes YOW! Data September 2016

Designing for Understandability: the Raft Consensus Algorithm Diego Ongaro John Ousterhout

Raft and Other Stories Consensus Trilogy: Part III Rough Timeline for Today Talk about

Consensus II Replicated State Machines, RAFT CS 240: Computing Systems and Concurrency Lecture

When Raft Meets SDN: How to Elect a Leader and Reach Consensus in an Unruly Network Yang Zhang ,

Membership of the consensus group Membership of the consensus group Members of the group were

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN Consensus The

Distributed Systems CS425/ECE428 03/06/2020 Todays agenda Consensus Consensus in

CONSENSUS Fall 2012 Ken Birman Consensus a classic problem Consensus abstraction underlies

Bitcoin &amp; RAFT Distributed Systems Nikita Borisov Topics for Today Finish Bitcoin

7th International dCache Workshop Berlin Bits and Pieces 2013 Christian Bernardt (at DESY)

Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday

Logging with SF4L and Logback J.Serrat 102759 Software Design November 3, 2015 Index Why

Week 2: Maximum Likelihood Estimation Instructor: Sergey Levine 1 Recap: MLE for the binomial

Mixture Models Bhiksha Raj 27 Oct 2016 11755/18797 1 Learning Distributions for Data

The Why, What, and How of Software Transactions for More Reliable Concurrency Dan Grossman

CS320: Performance Evaluation Plotting data sets Semi log plots Log log plots Analyzing Program

Modeling and Simulating Social Systems with MATLAB Lecture 2 Statistics and Plotting in

Bitcoin & RAFT Distributed Systems Nikita Borisov Topics for Today Finish Bitcoin