Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and - PowerPoint PPT Presentation

Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford University

Goal: Replicated Log Clients shl Consensus State Consensus State Consensus State Module Machine Module Machine Module Machine Servers Log Log Log add jmp mov shl add jmp mov shl add jmp mov shl ● Replicated log => replicated state machine  All servers execute same commands in same order ● Consensus module ensures proper log replication ● System makes progress as long as any majority of servers are up ● Failure model: fail-stop (not Byzantine), delayed/lost messages March 3, 2013 Raft Consensus Algorithm Slide 2

Approaches to Consensus Two general approaches to consensus: ● Symmetric, leader-less:  All servers have equal roles  Clients can contact any server ● Asymmetric, leader-based:  At any given time, one server is in charge, others accept its decisions  Clients communicate with the leader ● Raft uses a leader:  Decomposes the problem (normal operation, leader changes)  Simplifies normal operation (no conflicts)  More efficient than leader-less approaches March 3, 2013 Raft Consensus Algorithm Slide 3

Raft Overview 1. Leader election:  Select one of the servers to act as leader  Detect crashes, choose new leader 2. Normal operation (basic log replication) 3. Safety and consistency after leader changes 4. Neutralizing old leaders 5. Client interactions  Implementing linearizeable semantics 6. Configuration changes:  Adding and removing servers March 3, 2013 Raft Consensus Algorithm Slide 4

Server States ● At any given time, each server is either:  Leader: handles all client interactions, log replication ● At most 1 viable leader at a time  Follower: completely passive (issues no RPCs, responds to incoming RPCs)  Candidate: used to elect a new leader ● Normal operation: 1 leader, N-1 followers timeout, timeout, receive votes from new election start start election majority of servers Follower Candidate Leader “step down” discover server with higher term discover current server March 3, 2013 Raft Consensus Algorithm Slide 5 or higher term

Terms Term 1 Term 2 Term 3 Term 4 Term 5 time Elections Split Vote Normal Operation ● Time divided into terms:  Election  Normal operation under a single leader ● At most 1 leader per term ● Some terms have no leader (failed election) ● Each server maintains current term value ● Key role of terms: identify obsolete information March 3, 2013 Raft Consensus Algorithm Slide 6

Raft Protocol Summary Followers RequestVote RPC • Respond to RPCs from candidates and leaders. Invoked by candidates to gather votes. • Convert to candidate if election timeout elapses without Arguments: either: candidateId candidate requesting vote • Receiving valid AppendEntries RPC, or term candidate's term • Granting vote to candidate lastLogIndex index of candidate's last log entry lastLogTerm term of candidate's last log entry Candidates Results: • Increment currentTerm, vote for self term currentTerm, for candidate to update itself • Reset election timeout voteGranted true means candidate received vote • Send RequestVote RPCs to all other servers, wait for either: • Votes received from majority of servers: become leader Implementation: • AppendEntries RPC received from new leader: step If term > currentTerm, currentTerm ← term 1. down (step down if leader or candidate) • Election timeout elapses without election resolution: 2. If term == currentTerm, votedFor is null or candidateId, increment term, start new election and candidate's log is at least as complete as local log, • Discover higher term: step down grant vote and reset election timeout Leaders • Initialize nextIndex for each to last log index + 1 • Send initial empty AppendEntries RPCs (heartbeat) to each AppendEntries RPC follower; repeat during idle periods to prevent election timeouts Invoked by leader to replicate log entries and discover • Accept commands from clients, append new entries to local inconsistencies; also used as heartbeat . log Arguments: • Whenever last log index ≥ nextIndex for a follower, send term leader's term AppendEntries RPC with log entries starting at nextIndex, leaderId so follower can redirect clients update nextIndex if successful prevLogIndex index of log entry immediately preceding • If AppendEntries fails because of log inconsistency, new ones decrement nextIndex and retry prevLogTerm term of prevLogIndex entry • Mark log entries committed if stored on a majority of entries[] log entries to store (empty for heartbeat) servers and at least one entry from current term is stored on commitIndex last entry known to be committed a majority of servers • Step down if currentTerm changes Results: term currentTerm, for leader to update itself Persistent State success true if follower contained entry matching prevLogIndex and prevLogTerm Each server persists the following to stable storage synchronously before responding to RPCs: Implementation: currentTerm latest term server has seen (initialized to 0 1. Return if term < currentTerm on first boot) If term > currentTerm, currentTerm ← term 2. votedFor candidateId that received vote in current 3. If candidate or leader, step down term (or null if none) 4. Reset election timeout log[] log entries Return failure if log doesn’t contain an entry at 5. prevLogIndex whose term matches prevLogTerm Log Entry 6. If existing entries conflict with new entries, delete all existing entries starting with first conflicting entry term term when entry was received by leader March 3, 2013 Raft Consensus Algorithm Slide 7 7. Append any new entries not already in the log index position of entry in the log 8. Advance state machine with newly committed entries command command for state machine

Heartbeats and Timeouts ● Servers start up as followers ● Followers expect to receive RPCs from leaders or candidates ● Leaders must send heartbeats (empty AppendEntries RPCs) to maintain authority ● If electionTimeout elapses with no RPCs:  Follower assumes leader has crashed  Follower starts new election  Timeouts typically 100-500ms March 3, 2013 Raft Consensus Algorithm Slide 8

Election Basics ● Increment current term ● Change to Candidate state ● Vote for self ● Send RequestVote RPCs to all other servers, retry until either: 1. Receive votes from majority of servers: ● Become leader ● Send AppendEntries heartbeats to all other servers 2. Receive RPC from valid leader: ● Return to follower state 3. No-one wins election (election timeout elapses): ● Increment term, start new election March 3, 2013 Raft Consensus Algorithm Slide 9

Elections, cont’d ● Safety: allow at most one winner per term  Each server gives out only one vote per term (persist on disk)  Two different candidates can’t accumulate majorities in same term B can’t also Voted for get majority candidate A Servers ● Liveness: some candidate must eventually win  Choose election timeouts randomly in [T, 2T]  One server usually times out and wins election before others wake up  Works well if T >> broadcast time March 3, 2013 Raft Consensus Algorithm Slide 10

Log Structure log index 1 2 3 4 5 6 7 8 term 1 1 1 2 3 3 3 3 leader add cmp ret mov jmp div shl sub command 1 1 1 2 3 add cmp ret mov jmp 1 1 1 2 3 3 3 3 add cmp ret mov jmp div shl sub followers 1 1 add cmp 1 1 1 2 3 3 3 add cmp ret mov jmp div shl committed entries ● Log entry = index, term, command ● Log stored on stable storage (disk); survives crashes ● Entry committed if known to be stored on majority of servers  Durable, will eventually be executed by state machines March 3, 2013 Raft Consensus Algorithm Slide 11

Normal Operation ● Client sends command to leader ● Leader appends command to its log ● Leader sends AppendEntries RPCs to followers ● Once new entry committed:  Leader passes command to its state machine, returns result to client  Leader notifies followers of committed entries in subsequent AppendEntries RPCs  Followers pass committed commands to their state machines ● Crashed/slow followers?  Leader retries RPCs until they succeed ● Performance is optimal in common case:  One successful RPC to any majority of servers March 3, 2013 Raft Consensus Algorithm Slide 12

Log Consistency High level of coherency between logs: ● If log entries on different servers have same index and term:  They store the same command  The logs are identical in all preceding entries 1 2 3 4 5 6 1 1 1 2 3 3 add cmp ret mov jmp div 1 1 1 2 3 4 add cmp ret mov jmp sub ● If a given entry is committed, all preceding entries are also committed March 3, 2013 Raft Consensus Algorithm Slide 13

AppendEntries Consistency Check ● Each AppendEntries RPC contains index, term of entry preceding new ones ● Follower must contain matching entry; otherwise it rejects request ● Implements an induction step, ensures coherency 1 2 3 4 5 1 1 1 2 3 leader add cmp ret mov jmp AppendEntries succeeds: matching entry 1 1 1 2 follower add cmp ret mov 1 1 1 2 3 leader add cmp ret mov jmp AppendEntries fails: mismatch 1 1 1 1 follower add cmp ret shl March 3, 2013 Raft Consensus Algorithm Slide 14

Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and - PowerPoint PPT Presentation

Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford University Goal: Replicated Log Clients shl Consensus State Consensus State Consensus State Module Machine Module Machine Module Machine

Consensus II Replicated State Machines, RAFT CS 240: Computing Systems and Concurrency Lecture

When Aeron Met Raft Martin Thompson - @mjpt777 What does Consensus mean? consensus

RAFT Consensus Slide content borrowed from Diego Ongaro, John Ousterhout, and Alberto Montresor

Logs on Logs on Logs No More Append Atomic & Remap Eric Mackay Venkatesh Srinivas Basics

Designing for Understandability: the Raft Consensus Algorithm Diego Ongaro John Ousterhout

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus Building Consensus is Consensus is finding an acceptable proposal that all members

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus and Dissent or: Meta - Consensus Consensus about what we have consensus

Keeping RAFT Afloat Cloud Scale Distributed Consensus Philip Haynes YOW! Data September 2016

Raft and Other Stories Consensus Trilogy: Part III Rough Timeline for Today Talk about

When Raft Meets SDN: How to Elect a Leader and Reach Consensus in an Unruly Network Yang Zhang ,

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data

The ABCDs of Paxos Replicated state machines Consensus: a set of processes decide on an input

Geo-Replicated Transactions in 1.5RTT Robert Escriva Strangeloop September 30, 2017 @rescrv

Geo-Replicated Transaction Commit in 3 Message Delays Robert Escriva VMWare June 9, 2017

Postgres-BDR : Advanced HA Clustering & Scaling Simon Riggs CTO, 2ndQuadrant 17 Oct 2019

CREDIT UNIVERSITY March 13, 2015 CREDIT UNIVERSITY Outline Ford Credit Strategic Value,

Discourse conditions on Verb Phrase Ellipsis Philip Miller 1 In collaboration with Barbara Hemforth

Block-based partial packet recovery corrupt packet Maranello Practical Partial Packet Recovery

CHASM Taylor Jaraczewski Background Yet again.. Drivers vs. passengers Only a very

prediction and assessment of clonal evolution Davide Rossi, M.D., Ph.D. Hematology IOSI -

Problem in Mid-South Cotton 2008 Crop Management Seminar Cotton Incorporated November 12, 2008

MIXED INTERIOR TRANSMISSION EIGENVALUES joint work with Jijun Liu CMMSE 2019 (MS 23) | July 2,

Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and - PowerPoint PPT Presentation

Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford University Goal: Replicated Log Clients shl Consensus State Consensus State Consensus State Module Machine Module Machine Module Machine

Consensus II Replicated State Machines, RAFT CS 240: Computing Systems and Concurrency Lecture

When Aeron Met Raft Martin Thompson - @mjpt777 What does Consensus mean? consensus

RAFT Consensus Slide content borrowed from Diego Ongaro, John Ousterhout, and Alberto Montresor

Logs on Logs on Logs No More Append Atomic &amp; Remap Eric Mackay Venkatesh Srinivas Basics

Designing for Understandability: the Raft Consensus Algorithm Diego Ongaro John Ousterhout

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus Building Consensus is Consensus is finding an acceptable proposal that all members

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus and Dissent or: Meta - Consensus Consensus about what we have consensus

Keeping RAFT Afloat Cloud Scale Distributed Consensus Philip Haynes YOW! Data September 2016

Raft and Other Stories Consensus Trilogy: Part III Rough Timeline for Today Talk about

When Raft Meets SDN: How to Elect a Leader and Reach Consensus in an Unruly Network Yang Zhang ,

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data

The ABCDs of Paxos Replicated state machines Consensus: a set of processes decide on an input

Geo-Replicated Transactions in 1.5RTT Robert Escriva Strangeloop September 30, 2017 @rescrv

Geo-Replicated Transaction Commit in 3 Message Delays Robert Escriva VMWare June 9, 2017

Postgres-BDR : Advanced HA Clustering &amp; Scaling Simon Riggs CTO, 2ndQuadrant 17 Oct 2019

CREDIT UNIVERSITY March 13, 2015 CREDIT UNIVERSITY Outline Ford Credit Strategic Value,

Discourse conditions on Verb Phrase Ellipsis Philip Miller 1 In collaboration with Barbara Hemforth

Block-based partial packet recovery corrupt packet Maranello Practical Partial Packet Recovery

CHASM Taylor Jaraczewski Background Yet again.. Drivers vs. passengers Only a very

prediction and assessment of clonal evolution Davide Rossi, M.D., Ph.D. Hematology IOSI -

Problem in Mid-South Cotton 2008 Crop Management Seminar Cotton Incorporated November 12, 2008

MIXED INTERIOR TRANSMISSION EIGENVALUES joint work with Jijun Liu CMMSE 2019 (MS 23) | July 2,

Logs on Logs on Logs No More Append Atomic & Remap Eric Mackay Venkatesh Srinivas Basics

Postgres-BDR : Advanced HA Clustering & Scaling Simon Riggs CTO, 2ndQuadrant 17 Oct 2019