Coordinating distributed systems part II Marko Vukoli Distributed - PowerPoint PPT Presentation

Coordinating distributed systems part II Marko Vukoli ć Distributed Systems and Cloud Computing

Last Time  Coordinating distributed systems part I  Zookeeper  At the heart of Zookeeper is the ZAB atomic broadcast protocol  Today  Atomic broadcast protocols  Paxos and ZAB  Very briefly 2

Zookeeper components (high-level) Write Request requests processor Tx DB Tx Commit ZAB Tx log In-memory Atomic Replicated broadcast DB Read requests 3

Atomic broadcast  A.k.a. total order broadcast  Critical synchronization primitive in many distributed systems  Fundamental building block to building replicated state machines 4

Atomic Broadcast (safety)  Total Order property  Let m and m’ be any two messages.  Let pi be any correct process that delivers m without having delivered m’  Then no correct process delivers m’ before m  Integrity (a.k.a. No creation)  No message is delivered unless it was broadcast  No duplication  No message is delivered more than once  ZAB deviates from this 5

State machine replication  Think of, e.g., a database  Use atomic broadcast to totally order database operations/transactions  All database replicas apply updates/queries in the same order  Since database is deterministic, the state of the database is fully replicated  Extends to any (deterministic) state machine 6

Consistency of total order  Very strong consistency  “Single-replica” semantics 7

Atomic broadcast implementations  Numerous  Paxos [Lamport98, Lamport01] is probably the most celebrated  We will cover the basics of Paxos and compare then to ZAB, the atomic broadcast used in Zookeeper 8

Paxos  Assume a module that elects a leader within a set of replicas  Election of leader is only eventually reliable  For some time multiple processes may believe that they are the leader  2f+1 replicas, crash-recovery model  At any given point in time a majority of replicas is assumed to be correct  Q: Is Paxos in CP or AP? 9

Simplified Paxos upon tobroadcast(val) by leader inc( seqno ) send [IMPOSE, seqno, val]› to all upon receive [IMPOSE, seq, v] myestimates[seq] = v send ‹[ACK, seq, v]› to ALL upon receive[ACK, seq, v] from majority and myestimates[seq] = v ordered [seq] = v upon exists sno: ordered[sno] ≠ nil and delivered[sno]=nil and forall sno’< sno: delivered[sno’]!=nil delivered[sno] = ordered[sno] 10

Simplified Paxos Failure-Free Message Flow C C req reply S1 S1 leader S1 S2 S2 . . . . . . IMPOSE Sn Sn ACK Impose phase 11

Simplified Paxos  Works very fine if:  Leader is stable (no multiple processes that believe they are the leader)  Leader is correct  This will actually be the case most of the time  Yet there will certainly be time when it is not 12

What if the leader is not stable?  Two leaders might compete to propose different commands for the same sequence number  The leader might fail without having completed broadcast  This is dangerous in case of a partition, cannot distinguish from the case where the leader completed its part of broadcast, some replicas already delivered the command whereas others were partitioned 13

Accounting for multiple leaders  Leader failover  New leader must learn what the previous leader imposed  Multiple leaders  Need to distinguish among values imposed by different leader  To this end we use epoch (a.k.a. ballot) numbers  Assume these are also output by the leader election module  Monotonically increasing 14

Multi-leader Paxos: Impose phase upon tobroadcast(val) by leader inc( seqno ) send [IMPOSE, seqno, epoch, val]› to all upon receive [IMPOSE, seq, epoch, v] if lastKnownEpoch <= epoch myestimates[seq] = <v,epoch> send ‹[ACK, seq, epoch, v]› to ALL upon receive[ACK, seq, epoch, v] from majority and myestimates[seq] = v ordered [seq] = v … 15

Read phase  Need read phase as well  For leader failover  New leader must learn what previous leader(s) left over and pick up from there  Additional latency  Upside: need to do read phase only once per leader change 16

Read phase upon elected leader send [READ, epoch] upon receive [READ,epoch] from p if lastknownEpoch <epoch lastknownEpoch=epoch send [GATHER, epoch, myestimates] to p Upon receive GATHER messages from majority (at p) foreach seqno select the val in myestimates[seqno] with highest epoch number For other (missing) seqno select noop proceed to impose phase for all seqno 17

Paxos Leader failover Message Flow C reply S1 S1 S1 S1 S1 S2 S2 S2 . . . . . READ GATHER . . . . IMPOSE Sn Sn Sn ACK Read phase Impose phase 18

Paxos  This completes high level pseudocode of Paxos  Implements atomic broadcast  Noop fills holes 19

Implementing Paxos  [Chandra07]  Google Paxos implementation for Chubby lock service  Much more difficult to implement Paxos than 2 page pseudocode  “our complete implementation contains several thousand lines of C++ code” 20

Some of the engineering concerns  Crash recovery  Database snapshots  Operator errors  give wrong address of only one node in the cluster  Paxos will mask it but will effectively tolerate f-1 failure  Adapting to the higher level spec  In Google case of the Chubby spec  Handling disk corruption  Replica is correct but disk is corrupted  And a few more… 21

Example: Corrupted disks  A replica with a corrupted disk rebuilds its state as follows  It participates in Paxos as a non-voting member;  meaning that it uses the catch-up mechanism to catch up but does not respond with GATHER/ACK messages  It remains in this state until it observes one complete instance of Paxos that was started after the replica started rebuilding its state  Waiting for the extra instance of Paxos, ensures that this replica could not have reneged on an earlier promise. 22

ZAB  ZAB is atomic broadcast used in Zookeeper  It is a variant of Paxos  Differences  ZAB implements leader order as well  Based on the observation that commands proposed by the same leader might have causal dependencies  Paxos does not account for this 23

Leader order  Local leader order  If a leader broadcasts a message m before it broadcasts m’ then a process that delivers m’ must also deliver m before m’  Global leader order  Let mi and mj be two messages broadcast as follows:  A leader i broadcast mi in epoch ei  A leader j in epoch ej>ei broadcasts mj  Then, if a process p delivers both mj and mi, p must deliver mi before mj  Paxos does not implement leader order 24

Leader order and Paxos  Assume 26 commands are properly ordered  Assume 3 replicas  A leader l1 starts epoch 126  Learns nothing about commands after 26  Imposes A as 27 th command and B as 28 th command  These IMPOSE messages reaches only one replica (l1)  Then leader l2 starts epoch 127  Learns nothing about commands after 26  Imposes C as 27 th command  THESE Impose messages reach only l2 and l3 25

Leader order and Paxos  Then leader l3 starts epoch 128  Only l1 and l3 are alive  l3 will impose C as 27 th command and B as 28 th command  But l1 did impose A as 27 th command before it imposed B as 28 th command  Leader order violation  Sketch these executions 26

Further reading (optional) Flavio Paiva Junqueira, Benjamin C. Reed, Marco Serafini: Zab: High- performance broadcast for primary-backup systems. DSN 2011: 245- 256 Tushar Deepak Chandra, Robert Griesemer, Joshua Redstone: Paxos made live: an engineering perspective. PODC 2007: 398-407 Leslie Lamport: Paxos made simple. SIGACT news. (2001) Leslie Lamport: The Part-Time Parliament. ACM Trans. Comput. Syst. 16(2): 133-169 (1998) 27

Exerise: Read/Write locks WriteLock(filename) 1: myLock=create(filename + “/write-”, “”, EPHEMERAL & SEQUENTIAL) 2: C = getChildren(filename, false) 3: if myLock is the lowest znode in C then return 4: else 5: precLock = znode in C ordered just before myLock 6: if exists(precLock, true) 7: wait for precLock watch 8: goto 2: 28

Exercise: Read/Write Locks ReadLock(filename) 1: myLock=create(filename + “/read-”, “”, EPHEMERAL & SEQUENTIAL) 2: C = getChildren(filename, false) 3: if no “/write-” znode in C then return 4: else 5: precLock = “/write-” znode in C ordered just before myLock 6: if exists(precLock, true) 7: wait for precLock watch 8: goto 2: Release(filename) delete(myLock) 29

Coordinating distributed systems part II Marko Vukoli Distributed - PowerPoint PPT Presentation

Coordinating distributed systems part II Marko Vukoli Distributed Systems and Cloud Computing Last Time Coordinating distributed systems part I Zookeeper At the heart of Zookeeper is the ZAB atomic broadcast protocol Today

Coordinating distributed systems Marko Vukoli Distributed Systems and Cloud Computing Previous

THE THE BE BE NE NE FITS THAT SE FITS THAT SE NSORS CAN BRING TO DISASTE NSORS CAN BRING

Distributed Storage Systems part 2 Marko Vukoli Distributed Systems and Cloud Computing

CCAM Coordinating Council on Access and Mobility Coordinating Council on Access and Mobility

Coordinating Public Transport Tabled 6 August 2014 6 August 2014 Coordinating Public

Distributed Storage Systems part 1 Marko Vukoli Distributed Systems and Cloud Computing This

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

Introduction to Distributed Systems Introduction to Distributed Systems Outline Outline

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Central District Coordinating Council Quarterly Meeting January 28, 2020 Central District

Central District Coordinating Council Quarterly Meeting July 23, 2019 Central District

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Implementing a Multilayer Perceptron from Scratch Implementing a Multilayer Perceptron from

Nick Gnedin The Brief History of Time End of inflation: Today: z=10 27 z=0 t=10 -36 s t=13.7

Design and Architectures for Embedded Systems Prof. Dr. J. Henkel Henkel Prof. Dr. J. CES CES

Genuinely entangled subspaces M. Demianowicz ( joint work with R. Augusiak ) partial support:

VHDL VHDL - Flaxer Eli Ch 2 - 1 Programmable Logic Review (last chapter) VHDL and

Status of front end electronics When mezzanine present, JTAG chain is lost ... PROBLEM When

Wireless monitoring device Goal Wireless monitoring device monitors existing wireless network

Coordinating distributed systems part II Marko Vukoli Distributed - PowerPoint PPT Presentation

Coordinating distributed systems part II Marko Vukoli Distributed Systems and Cloud Computing Last Time Coordinating distributed systems part I Zookeeper At the heart of Zookeeper is the ZAB atomic broadcast protocol Today

Coordinating distributed systems Marko Vukoli Distributed Systems and Cloud Computing Previous

THE THE BE BE NE NE FITS THAT SE FITS THAT SE NSORS CAN BRING TO DISASTE NSORS CAN BRING

Distributed Storage Systems part 2 Marko Vukoli Distributed Systems and Cloud Computing

CCAM Coordinating Council on Access and Mobility Coordinating Council on Access and Mobility

Coordinating Public Transport Tabled 6 August 2014 6 August 2014 Coordinating Public

Distributed Storage Systems part 1 Marko Vukoli Distributed Systems and Cloud Computing This

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

Introduction to Distributed Systems Introduction to Distributed Systems Outline Outline

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Central District Coordinating Council Quarterly Meeting January 28, 2020 Central District

Central District Coordinating Council Quarterly Meeting July 23, 2019 Central District

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Implementing a Multilayer Perceptron from Scratch Implementing a Multilayer Perceptron from

Nick Gnedin The Brief History of Time End of inflation: Today: z=10 27 z=0 t=10 -36 s t=13.7

Design and Architectures for Embedded Systems Prof. Dr. J. Henkel Henkel Prof. Dr. J. CES CES

Genuinely entangled subspaces M. Demianowicz ( joint work with R. Augusiak ) partial support:

VHDL VHDL - Flaxer Eli Ch 2 - 1 Programmable Logic Review (last chapter) VHDL and

Status of front end electronics When mezzanine present, JTAG chain is lost ... PROBLEM When

Wireless monitoring device Goal Wireless monitoring device monitors existing wireless network

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges