unicamp mc714
play

Unicamp MC714 Distributed Systems Slides by Maarten van Steen, - PowerPoint PPT Presentation

Unicamp MC714 Distributed Systems Slides by Maarten van Steen, adapted from Distributed Systems, 3rd edition Chapter 08: Fault Tolerance Fault tolerance: Introduction to fault tolerance Basic concepts Dependability Basics A component


  1. Fault tolerance: Process resilience Failure masking and replication Groups and failure masking 2 1 2 1 2 4 1 x 2 4 y 1 4 (a) what they send to each other 3 4 z (b) what each one got from the Faulty process other (a) (c) what each one got in second step 1 Got( 1, 2, x, 4 ) 1 Got 2 Got 4 Got 2 Got( 1, 2, y, 4 ) ( 1, 2, y, 4 ) ( 1, 2, x, 4 ) ( 1, 2, x, 4 ) 3 Got( 1, 2, 3, 4 ) ( a, b, c, d ) ( e, f, g, h ) ( 1, 2, y, 4 ) 4 Got( 1, 2, z, 4 ) ( 1, 2, z, 4 ) ( 1, 2, z, 4 ) ( i, j, k, l ) (b) (c) 16 / 57

  2. Fault tolerance: Process resilience Failure masking and replication Groups and failure masking 1 1 2 x 1 2 (a) what they send to each other 3 2 y (b) what each one got from the Faulty�process other (a) (c) what each one got in second step 1 Got( 1, 2, x ) 1�Got 2�Got 2 Got( 1, 2, y ) ( 1, 2, y ) ( 1, 2, x ) ) ( a, b, c ) ( d, e, f ) 3 Got( 1, 2, 3 (b) (c) 17 / 57

  3. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable remote procedure calls What can go wrong? The client is unable to locate the server. 1 The request message from the client to the server is lost. 2 The server crashes after receiving a request. 3 The reply message from the server to the client is lost. 4 The client crashes after sending a request. 5 18 / 57

  4. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable remote procedure calls What can go wrong? The client is unable to locate the server. 1 The request message from the client to the server is lost. 2 The server crashes after receiving a request. 3 The reply message from the server to the client is lost. 4 The client crashes after sending a request. 5 Two “easy” solutions 1: (cannot locate server): just report back to client 2: (request was lost): just resend message 18 / 57

  5. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable RPC: server crash Server Server Server REQ REQ REQ Receive Receive Receive Execute Execute Crash REP No REP No REP Reply Crash (a) (b) (c) Problem Where (a) is the normal case, situations (b) and (c) require different solutions. However, we don’t know what happened. Two approaches: At-least-once-semantics: The server guarantees it will carry out an operation at least once, no matter what. At-most-once-semantics: The server guarantees it will carry out an operation at most once. Server crashes 19 / 57

  6. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Why fully transparent server recovery is impossible Three type of events at the server (Assume the server is requested to update a document.) M: send the completion message P: complete the processing of the document C: crash Six possible orderings (Actions between brackets never take place) M → P → C : Crash after reporting completion. 1 M → C → P : Crash after reporting completion, but before the update. 2 P → M → C : Crash after reporting completion, and after the update. 3 P → C ( → M ) : Update took place, and then a crash. 4 C ( → P → M ) : Crash before doing anything 5 C ( → M → P ) : Crash before doing anything 6 Server crashes 20 / 57

  7. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Why fully transparent server recovery is impossible Strategy M → P Strategy P → M Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM) Always DUP OK OK DUP DUP OK Never OK ZERO ZERO OK OK ZERO Only when ACKed DUP OK ZERO DUP OK ZERO Only when not ACKed OK ZERO OK OK DUP OK Client Server Server OK = Document updated once DUP = Document updated twice ZERO = Document not update at all Server crashes 21 / 57

  8. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable RPC: lost reply messages The real issue What the client notices, is that it is not getting an answer. However, it cannot decide whether this is caused by a lost request, a crashed server, or a lost response. Partial solution Design the server such that its operations are idempotent: repeating the same operation is the same as carrying it out exactly once: pure read operations strict overwrite operations Many operations are inherently nonidempotent, such as many banking transactions. Lost reply messages 22 / 57

  9. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable RPC: client crash Problem The server is doing work and holding resources for nothing (called doing an orphan computation). Solution Orphan is killed (or rolled back) by the client when it recovers Client broadcasts new epoch number when recovering ⇒ server kills client’s orphans Require computations to complete in a T time units. Old ones are simply removed. Client crashes 23 / 57

  10. Fault tolerance: Reliable group communication Simple reliable group communication Intuition A message sent to a process group G should be delivered to each member of G . Important: make distinction between receiving and delivering messages. Sender Recipient Recipient Group membership Group membership Group membership functionality functionality functionality Message delivery Message-handling Message-handling Message-handling component component component Message reception Local OS Local OS Local OS Network 24 / 57

  11. Fault tolerance: Reliable group communication Less simple reliable group communication Reliable communication in the presence of faulty processes Group communication is reliable when it can be guaranteed that a message is received and subsequently delivered by all nonfaulty group members. Tricky part Agreement is needed on what the group actually looks like before a received message can be delivered. 25 / 57

  12. Fault tolerance: Reliable group communication Simple reliable group communication Reliable communication, but assume nonfaulty processes Reliable group communication now boils down to reliable multicasting: is a message received and delivered to each recipient, as intended by the sender. Receiver missed message #24 Sender Receiver Receiver Receiver Receiver M25 History buffer Last = 24 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 Network Sender Receiver Receiver Receiver Receiver Last = 25 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 ACK 25 ACK 25 Missed 24 ACK 25 Network 26 / 57

  13. Fault tolerance: Reliable group communication Atomic multicast Atomic multicast Reliable multicast by multiple P1 joins the group P3 crashes P3 rejoins point-to-point messages P1 P2 P3 P4 G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4} Time Partial multicast from P3 is discarded Idea Formulate reliable multicasting in the presence of process failures in terms of process groups and changes to group membership. 27 / 57

  14. Fault tolerance: Distributed commit Distributed commit protocols Problem Have an operation being performed by each member of a process group, or none at all. Reliable multicasting: a message is to be delivered to all recipients. Distributed transaction: each local transaction must succeed. 28 / 57

  15. Fault tolerance: Distributed commit Two-phase commit protocol (2PC) Essence The client who initiated the computation acts as coordinator; processes required to commit are the participants. Phase 1a: Coordinator sends VOTE - REQUEST to participants (also called a pre-write) Phase 1b: When participant receives VOTE - REQUEST it returns either VOTE - COMMIT or VOTE - ABORT to coordinator. If it sends VOTE - ABORT , it aborts its local computation Phase 2a: Coordinator collects all votes; if all are VOTE - COMMIT , it sends GLOBAL - COMMIT to all participants, otherwise it sends GLOBAL - ABORT Phase 2b: Each participant waits for GLOBAL - COMMIT or GLOBAL - ABORT and handles accordingly. 29 / 57

  16. Fault tolerance: Distributed commit 2PC - Finite state machines Vote-request Vote-abort INIT INIT Commit Vote-request Vote-request Vote-commit WAIT READY Vote-abort Vote-commit Global-abort Global-commit Global-abort Global-commit ACK ACK COMMIT ABORT ABORT COMMIT Coordinator Participant 30 / 57

  17. Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S , and recovers to S INIT : No problem: participant was unaware of protocol 31 / 57

  18. Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S , and recovers to S READY : Participant is waiting to either commit or abort. After recovery, participant needs to know which state transition it should make ⇒ log the coordinator’s decision 31 / 57

  19. Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S , and recovers to S ABORT : Merely make entry into abort state idempotent, e.g., removing the workspace of results 31 / 57

  20. Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S , and recovers to S COMMIT : Also make entry into commit state idempotent, e.g., copying workspace to storage. 31 / 57

  21. Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S , and recovers to S INIT : No problem: participant was unaware of protocol READY : Participant is waiting to either commit or abort. After recovery, participant needs to know which state transition it should make ⇒ log the coordinator’s decision ABORT : Merely make entry into abort state idempotent, e.g., removing the workspace of results COMMIT : Also make entry into commit state idempotent, e.g., copying workspace to storage. Observation When distributed commit is required, having participants use temporary workspaces to keep their results allows for simple recovery in the presence of failures. 31 / 57

  22. Fault tolerance: Distributed commit 2PC – Failing participant Alternative When a recovery is needed to READY state, check state of other participants ⇒ no need to log coordinator’s decision. Recovering participant P contacts another participant Q State of Q Action by P COMMIT Make transition to COMMIT ABORT Make transition to ABORT INIT Make transition to ABORT READY Contact another participant Result If all participants are in the READY state, the protocol blocks. Apparently, the coordinator is failing. Note: The protocol prescribes that we need the decision from the coordinator. 32 / 57

  23. Fault tolerance: Distributed commit 2PC – Failing coordinator Observation The real problem lies in the fact that the coordinator’s final decision may not be available for some time (or actually lost). Alternative Let a participant P in the READY state timeout when it hasn’t received the coordinator’s decision; P tries to find out what other participants know (as discussed). Observation Essence of the problem is that a recovering participant cannot make a local decision: it is dependent on other (possibly failed) processes 33 / 57

  24. Fault tolerance: Distributed commit Coordinator in Python 1 class Coordinator: 2 def run(self): 3 yetToReceive = list (participants) 4 self.log.info(’WAIT’) 5 self.chan.sendTo(participants , VOTE_REQUEST) 6 while len (yetToReceive) > 0: 7 msg = self.chan.recvFrom(participants , TIMEOUT) 8 if ( not msg) or (msg [1] == VOTE_ABORT ): 9 self.log.info(’ABORT ’) 10 self.chan.sendTo(participants , GLOBAL_ABORT) 11 return 12 else : # msg [1] == VOTE_COMMIT 13 yetToReceive.remove(msg [0]) 14 self.log.info(’COMMIT ’) 15 self.chan.sendTo(participants , GLOBAL_COMMIT) 16 34 / 57

  25. Fault tolerance: Distributed commit Participant in Python 1 class Participant: def run(self): 2 msg = self.chan.recvFrom(coordinator , TIMEOUT) 3 # Crashed coordinator - give up entirely if ( not msg): 4 decision = LOCAL_ABORT 5 else : # Coordinator will have sent VOTE_REQUEST 6 decision = self.do_work () 7 if decision == LOCAL_ABORT: 8 self.chan.sendTo(coordinator , VOTE_ABORT) 9 else : # Ready to commit , enter READY state 10 self.chan.sendTo(coordinator , VOTE_COMMIT) 11 msg = self.chan.recvFrom(coordinator , TIMEOUT) 12 if ( not msg): # Crashed coordinator - check the others 13 self.chan.sendTo(all_participants , NEED_DECISION) 14 while True: 15 msg = self.chan.recvFromAny () 16 if msg [1] in [GLOBAL_COMMIT , GLOBAL_ABORT , LOCAL_ABORT ]: 17 decision = msg [1] 18 break 19 else : # Coordinator came to a decision 20 decision = msg [1] 21 22 while True: # Help any other participant when coordinator crashed 23 msg = self.chan.recvFrom(all_participants) 24 if msg [1] == NEED_DECISION: 25 self.chan.sendTo ([msg[0]], decision) 26 35 / 57

  26. Fault tolerance: Distributed commit Three-phase commit Model (Again: the client acts as coordinator) Phase 1a: Coordinator sends vote − request to participants Phase 1b: When participant receives vote − request it returns either vote − commit or vote − abort to coordinator. If it sends vote − abort , it aborts its local computation Phase 2a: Coordinator collects all votes; if all are vote − commit , it sends prepare − commit to all participants, otherwise it sends global − abort , and halts Phase 2b: Each participant waits for prepare − commit , or waits for global − abort after which it halts Phase 3a: (Prepare to commit) Coordinator waits until all participants have sent ready − commit , and then sends global − commit to all Phase 3b: (Prepare to commit) Participant waits for global − commit 36 / 57

  27. Fault tolerance: Distributed commit Three-phase commit Vote-request Vote-abort INIT INIT Vote-request Commit Vote-commit Vote-request WAIT READY Vote-abort Vote-commit Global-abort Prepare-commit Global-abort Prepare-commit ACK Ready-commit ABORT PRECOMMIT ABORT PRECOMMIT Ready-commit Global-commit Global-commit ACK COMMIT COMMIT Coordinator Participant 37 / 57

  28. Fault tolerance: Distributed commit 3PC – Failing participant Basic issue Can P find out what it should it do after crashing in the ready or pre-commit state, even if other participants or the coordinator failed? Reasoning Essence: Coordinator and participants on their way to commit, never differ by more than one state transition Consequence: If a participant timeouts in ready state, it can find out at the coordinator or other participants whether it should abort, or enter pre-commit state Observation: If a participant already made it to the pre-commit state, it can always safely commit (but is not allowed to do so for the sake of failing other processes) Observation: We may need to elect another coordinator to send off the final COMMIT 38 / 57

  29. Fault tolerance: Recovery Introduction Recovery: Background Essence When a failure occurs, we need to bring the system into an error-free state: Forward error recovery: Find a new state from which the system can continue operation Backward error recovery: Bring the system back into a previous error-free state Practice Use backward error recovery, requiring that we establish recovery points Observation Recovery in distributed systems is complicated by the fact that processes need to cooperate in identifying a consistent state from where to recover 39 / 57

  30. Fault tolerance: Recovery Checkpointing Consistent recovery state Requirement Every message that has been received is also shown to have been sent in the state of the sender. Recovery line Assuming processes regularly checkpoint their state, the most recent consistent global checkpoint. Recovery line Checkpoint Initial state P1 Failure P2 Time Message sent Inconsistent collection from P2 to P1 of checkpoints 40 / 57

  31. Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: Coordinated checkpointing 41 / 57

  32. Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: A coordinator multicasts a checkpoint request message Coordinated checkpointing 41 / 57

  33. Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: A coordinator multicasts a checkpoint request message When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint Coordinated checkpointing 41 / 57

  34. Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: A coordinator multicasts a checkpoint request message When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint When all checkpoints have been confirmed at the coordinator, the latter broadcasts a checkpoint done message to allow all processes to continue Coordinated checkpointing 41 / 57

  35. Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: A coordinator multicasts a checkpoint request message When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint When all checkpoints have been confirmed at the coordinator, the latter broadcasts a checkpoint done message to allow all processes to continue Observation It is possible to consider only those processes that depend on the recovery of the coordinator, and ignore the rest Coordinated checkpointing 41 / 57

  36. Fault tolerance: Recovery Checkpointing Cascaded rollback Observation If checkpointing is done at the “wrong” instants, the recovery line may lie at system startup time. We have a so-called cascaded rollback. Checkpoint Initial state P1 Failure m m* P2 Time Independent checkpointing 42 / 57

  37. Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. Independent checkpointing 43 / 57

  38. Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. Let CP i ( m ) denote m th checkpoint of process P i and INT i ( m ) the interval between CP i ( m − 1 ) and CP i ( m ) . Independent checkpointing 43 / 57

  39. Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. Let CP i ( m ) denote m th checkpoint of process P i and INT i ( m ) the interval between CP i ( m − 1 ) and CP i ( m ) . When process P i sends a message in interval INT i ( m ) , it piggybacks ( i , m ) Independent checkpointing 43 / 57

  40. Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. Let CP i ( m ) denote m th checkpoint of process P i and INT i ( m ) the interval between CP i ( m − 1 ) and CP i ( m ) . When process P i sends a message in interval INT i ( m ) , it piggybacks ( i , m ) When process P j receives a message in interval INT j ( n ) , it records the dependency INT i ( m ) → INT j ( n ) . Independent checkpointing 43 / 57

  41. Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. Let CP i ( m ) denote m th checkpoint of process P i and INT i ( m ) the interval between CP i ( m − 1 ) and CP i ( m ) . When process P i sends a message in interval INT i ( m ) , it piggybacks ( i , m ) When process P j receives a message in interval INT j ( n ) , it records the dependency INT i ( m ) → INT j ( n ) . The dependency INT i ( m ) → INT j ( n ) is saved to storage when taking checkpoint CP j ( n ) . Independent checkpointing 43 / 57

  42. Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. Let CP i ( m ) denote m th checkpoint of process P i and INT i ( m ) the interval between CP i ( m − 1 ) and CP i ( m ) . When process P i sends a message in interval INT i ( m ) , it piggybacks ( i , m ) When process P j receives a message in interval INT j ( n ) , it records the dependency INT i ( m ) → INT j ( n ) . The dependency INT i ( m ) → INT j ( n ) is saved to storage when taking checkpoint CP j ( n ) . Observation If process P i rolls back to CP i ( m − 1 ) , P j must roll back to CP j ( n − 1 ) . Independent checkpointing 43 / 57

  43. Fault tolerance: Recovery Message logging Message logging Alternative Instead of taking an (expensive) checkpoint, try to replay your (communication) behavior from the most recent checkpoint ⇒ store messages in a log. Assumption We assume a piecewise deterministic execution model: The execution of each process can be considered as a sequence of state intervals Each state interval starts with a nondeterministic event (e.g., message receipt) Execution in a state interval is deterministic Conclusion If we record nondeterministic events (to replay them later), we obtain a deterministic execution model that will allow us to do a complete replay. 44 / 57

  44. Fault tolerance: Recovery Message logging Message logging and consistency When should we actually log messages? Avoid orphan processes: Process Q has just received and delivered messages m 1 and m 2 Assume that m 2 is never logged. After delivering m 1 and m 2 , Q sends message m 3 to process R Process R receives and subsequently delivers m 3 : it is an orphan. Q crashes and recovers P m2 is never replayed, m1 m1 so neither will m3 Q m3 m2 m3 m2 R Time Unlogged message Logged message 45 / 57

  45. Fault tolerance: Recovery Message logging Message-logging schemes Notations DEP ( m ): processes to which m has been delivered. If message m ∗ is causally dependent on the delivery of m , and m ∗ has been delivered to Q , then Q ∈ DEP ( m ). COPY ( m ): processes that have a copy of m , but have not (yet) reliably stored it. FAIL : the collection of crashed processes. Characterization Q is orphaned ⇔ ∃ m : Q ∈ DEP ( m ) and COPY ( m ) ⊆ FAIL 46 / 57

  46. Fault tolerance: Recovery Message logging Message-logging schemes Pessimistic protocol For each nonstable message m , there is at most one process dependent on m , that is | DEP ( m ) | ≤ 1. Consequence An unstable message in a pessimistic protocol must be made stable before sending a next message. 47 / 57

  47. Fault tolerance: Recovery Message logging Message-logging schemes Optimistic protocol For each unstable message m , we ensure that if COPY ( m ) ⊆ FAIL , then eventually also DEP ( m ) ⊆ FAIL . Consequence To guarantee that DEP ( m ) ⊆ FAIL , we generally rollback each orphan process Q until Q �∈ DEP ( m ). 48 / 57

  48. Fault tolerance: Consensus Consensus in faulty systems with crash failures Consensus Prerequisite In a fault-tolerant process group, each nonfaulty process executes the same commands, and in the same order, as every other nonfaulty process. Reformulation Nonfaulty group members need to reach consensus on which command to execute next. 49 / 57

  49. Fault tolerance: Consensus Consensus in faulty systems with crash failures Flooding-based consensus System model A process group P = { P 1 ,..., P n } Fail-stop failure semantics, i.e., with reliable failure detection A client contacts a P i requesting it to execute a command Every P i maintains a list of proposed commands 50 / 57

  50. Fault tolerance: Consensus Consensus in faulty systems with crash failures Flooding-based consensus System model A process group P = { P 1 ,..., P n } Fail-stop failure semantics, i.e., with reliable failure detection A client contacts a P i requesting it to execute a command Every P i maintains a list of proposed commands Basic algorithm (based on rounds) In round r , P i multicasts its known set of commands C r i to all others 1 At the end of r , each P i merges all received commands into a new C r + 1 . 2 i Next command cmd i selected through a globally shared, deterministic 3 function: cmd i ← select ( C r + 1 ) . i 50 / 57

  51. Fault tolerance: Consensus Consensus in faulty systems with crash failures Flooding-based consensus: Example P 1 P 2 decide P 3 decide P 4 decide Observations P 2 received all proposed commands from all other processes ⇒ makes decision. P 3 may have detected that P 1 crashed, but does not know if P 2 received anything, i.e., P 3 cannot know if it has the same information as P 2 ⇒ cannot make decision (same for P 4 ). 51 / 57

  52. Fault tolerance: Consensus Example: Paxos Realistic consensus: Paxos Assumptions (rather weak ones, and realistic) A partially synchronous system (in fact, it may even be asynchronous). Communication between processes may be unreliable: messages may be lost, duplicated, or reordered. Corrupted message can be detected (and thus subsequently ignored). All operations are deterministic: once an execution is started, it is known exactly what it will do. Processes may exhibit crash failures, but not arbitrary failures. Processes do not collude. Understanding Paxos We will build up Paxos from scratch to understand where many consensus algorithms actually come from. Essential Paxos 52 / 57

  53. Fault tolerance: Consensus Example: Paxos Paxos essentials A collection of (replicated) threads, collectively fulfilling the following roles: Client: Requests to have an operation performed Proposer: Takes a client’s requests and attempts to have the operation accepted Learner: (Eventually) performs an operation Acceptor: Votes for the execution of an operation Essential Paxos 53 / 57

  54. Fault tolerance: Consensus Example: Paxos Paxos essentials Paxos properties Safety: nothing bad will happen Only proposed operations will be learnerd At most one operation will be learned (and subsequently executed before a next operation is learned) (Eventual) liveness: eventually something good will happen If enough processes do not have failures, then a proposed operation will eventually be learned (and thus executed) Essential Paxos 54 / 57

  55. Fault tolerance: Consensus Example: Paxos Paxos essentials Clients Single client request/response C Proposer Acceptor Learner C P A L C P A L C P A L C Server process Other request Essential Paxos 55 / 57

  56. Paxos: Phase 1a (prepare) • A proposer P : – has a unique ID, say i – communicates only with a quorum of acceptors – For requested operation cmd : – Selects a counter n higher than any of its previous counters, leading to a proposal number r = ( m , i ). Note: ( m , i ) < ( n , j ) iff m < n or m = n and i < j – Sends prepare ( r ) to a majority of acceptors • Goal: – Proposer tries to get its proposal number anchored: any previous proposal failed, or also proposed cmd . Note: previous is defined wrt proposal number 8.2 Process resilience: Paxos

  57. Paxos: Phase 1b (promise) • What the acceptor does: – If r is highest from any proposer: – Return promise ( r ) to p , telling the proposer that the acceptor will ignore any future proposals with a lower proposal number. – If r is highest, but a previous proposal ( r' , cmd' ) had already been accepted: – Additionally return ( r' , cmd' ) to p . This will allow the proposer to decide on the final operation that needs to be accepted. – Otherwise: do nothing – there is a proposal with a higher proposal number in the works 8.2 Process resilience: Paxos

  58. Paxos: Phase 2a (accept) • It's the proposer's turn again: – If it does not receive any accepted operation, it sends accept ( r , cmd ) to a majority of acceptors – If it receives one or more accepted operations, it sends accept ( r , cmd* ), where – r is the proposer's selected proposal number – cmd* is the operation whose proposal number is highest among all accepted operations received from acceptors. 8.2 Process resilience: Paxos

  59. Paxos: Phase 2b (learn) • An acceptor receives an accept ( r , cmd ) message: – If it did not send a promise ( r' ) with r' > r , it must accept cmd , and says so to the learners: learn ( cmd ). • A learner receiving learn(cmd) from a majority of acceptors, will execute the operation cmd. Observation The essence of Paxos is that the proposers drive a majority of the acceptors to the accepted operation with the highest anchored proposal number 8.2 Process resilience: Paxos

  60. Essential Paxos: Hein Meling Associate professor @ University Stavanger

  61. Essential Paxos: Normal case

  62. Essential Paxos: Normal case

  63. Essential Paxos: Normal case

  64. Essential Paxos: Normal case

  65. Essential Paxos: Normal case

  66. Essential Paxos: Normal case

  67. Essential Paxos: Normal case

  68. Essential Paxos: Problematic case

  69. Essential Paxos: Problematic case

  70. Essential Paxos: Problematic case

  71. Essential Paxos: Problematic case

  72. Essential Paxos: Problematic case

  73. Essential Paxos: Problematic case

  74. Essential Paxos: Problematic case

  75. Essential Paxos: Problematic case

  76. Essential Paxos: Problematic case

  77. Essential Paxos: Problematic case

  78. Essential Paxos: Problematic case

  79. Essential Paxos: Problematic case

  80. Essential Paxos: Problematic case

  81. Essential Paxos: Problematic case

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend