you only live multiple times
play

You Only Live Multiple Times Black box re-use of Crash-Stop - PowerPoint PPT Presentation

You Only Live Multiple Times Black box re-use of Crash-Stop Algorithms In Realistic Crash-Recovery Settings DAVID KOZHAYA 1 , OGNJEN MARIC 2 , AND YVONNE-ANNE PIGNOLET 1 1 ABB CORPORATE RESEARCH SWITZERLAND , 2 DIGITAL ASSET SWITZERLAND A


  1. You Only Live Multiple Times Black box re-use of Crash-Stop Algorithms In Realistic Crash-Recovery Settings — DAVID KOZHAYA 1 , OGNJEN MARIC 2 , AND YVONNE-ANNE PIGNOLET 1 1 ABB CORPORATE RESEARCH SWITZERLAND , 2 DIGITAL ASSET SWITZERLAND A big thank you to Klaus Tycho-Foerster for presenting on our behalf!

  2. — Mitigating The Effect of Failures Fault Tolerance in Distributed Systems Failures happen in real systems Consensus: typical way to mitigate the effect of failures To minimize the impact of failures on service interruption, implement consensus protocols that tolerate failures Distributed parties agree on the actions to • perform despite failures Consensus protocols studied under different synchrony and failure assumptions 1/8/2019 2 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  3. — Solving Consensus in Presence of Failures 2 1 Algorithm Solving Consensus Asynchronous system model • Unbounded processing delays • Unbounded communication delays Failure Crash-stop failure model (CS model) Detector • The simplest failure model • A process crashes by stopping to execute the algorithm forever Partially synchrony Asynchronous system Asynchronous system Partial synchrony and failure detectors rely on + + + Crash-stop failures Crash-stop failures Crash-stop failures system conditions that eventually hold forever In reality, failure and recovery modes of processes and links are probabilistic and temporary 1/8/2019 3 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  4. — Crash Recovery Settings A Way To Capture a System's Dynamicity New Algorithm Solving Consensus Crash-recovery failure model (CR model) • Process can join and leave unannounced • Does not address communication New Failure Subsequently: Detector • New failure detectors and new consensus algorithms on top • Processes that crash and recover infinitely often Asynchronous system are excluded – not required to satisfy consensus + properties Crash-Recovery failures What remains unanswered: Can the plethora of existing crash-stop algorithms be reused unchanged in crash-recovery settings? 1/8/2019 4 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  5. — Our Contribution Re-use crash-stop consensus algorithms with reliable links and failure detectors in crash-recovery model Crash-Stop Consensus Algorithm Failure Reliable Detector Channel Our solution A system where all processes and links can crash and recover unboundedly Deterministic CS (crash-stop) consensus algorithms implement consensus with probability 1 in CR systems where processes and links crash and recover unboundedly 1/8/2019 5 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  6. — The Rest of This Talk • What is different about our approach compared to exiting works • What do we assume in our models • How our wrapper works • What class of algorithms benefit from our results 1/8/2019 6 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  7. — What is different compared to exiting works?

  8. — Difference with Existing Literature Existing solutions Our approach – Existing crash-recovery deterministic solutions: – Our approach: • Implement consensus deterministically • Implement consensus with probability 1 • Exclude processes that crash and recover • Include processes and links that crash and unboundedly recover infinitely often • Introduces new failure detector definitions and • Is modular – does not introduce new algorithms consensus algorithms but rather uses existing crash stop algorithms – Existing probabilistic solutions: • Implement consensus with probability 1 • Introduces new consensus algorithms, e.g., based on random coin flips 1/8/2019 8 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  9. — Description of Our Models

  10. — Reliable asynchronous crash-stop model Reliable asynchronous CS Algorithm Reliable Failure Channel Detector Time: asynchronous processes (no clocks), asynchronous links Failures: processes may crash, all messages get delivered 1/8/2019 Slide 10 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  11. — Lossy synchronous crash-recovery model Lossy Synchronous CR Algorithm Lossy Channel Time: synchronous steps, synchronous links (upper bounds for steps and transmission) Failures: processes may crash and recover infinitely often, messages may get lost 1/8/2019 Slide 11 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  12. — Probabilistic crash-recovery model Probabilistic CR Algorithm Prob. lossy Channel Time: synchronous processes, synchronous links Failures: processes and link crash and recover with probability in (0,1) 1/8/2019 Slide 12 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  13. — Model overview Reliable Asynchronous CS Lossy Synchronous CR Probabilistic CR Sync processes and links Async processes and links Sync processes and links Processes crash and recover Processes may crash, Processes may crash and recover, probabilistically, messages reliable communication lossy communication dropped probabilistically 1/8/2019 Slide 13 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  14. — How our Wrapper works

  15. — Crash-Recovery Wrapper The red box is our wrapper Crash-Stop Consensus Algorithm Reliable Failure Channel Detector Create synchronous crash-recovery step • Crash-stop messages and using multiple crash-stop steps (each failure detector output handling one message) Round-by-round failure detector to • produce outputs to be fed to CS algo Crash-recovery messages and acks Provide reliable links by LIFO buffering • and retransmission until ack Probabilistic Channel 1/8/2019 Slide 15 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  16. — What algorithms benefit from our results

  17. — Bounded Algorithms The Class of Algorithms To Which Our Results Apply A “bounded” crash-stop consensus algorithm satisfies for fixed B, B s , B ∆ : (B1) Communication-closed rounds : Processes operate in rounds, only messages from current round are considered. (B2) Externally triggered state changes : In every round, a process changes state only upon message receipts or failure detector output changes. (B3) Bounded round messages : In any round, a process sends at most B s messages to any other process. (B4) Bounded round gap : The fastest (n-f) processes are always at most B ∆ rounds apart. (B5) Bounded termination : Given any time t where the fastest (n-f) processes are correct, all other processes are faulty, and the failure detector output is perfect after t, then all (n-f) fastest processes decide before any of them reaches round B max = max_round(t)+B. Theorem. If a bounded algorithm solves consensus in the crash-stop setting, then this algorithm using our wrapper solves consensus with probability 1 in our crash-recovery setting 1/8/2019 17 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  18. — Bounded Algorithms Examples Examples of existing algorithms that are bounded are: • The Chandra-Toueg algorithm [1] • Algorithms in the generic indulgent framework of [2] [1] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225 – 267, 1996. [2] Rachid Guerraoui and Michel Raynal. A generic framework for indulgent consensus. In ICDCS, 2003. 1/8/2019 18 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

  19. — Conclusion • Introduced system models that closely capture the messy reality of distributed systems • Allowed processes and links to fail and recover for an unbounded number of time • Proposed a wrapper to deploy crash-stop algorithms as a black box in our crash-recovery setting • Determined the conditions for reusing crash-stop algorithms unchanged in our crash-recovery setting david.kozhaya@ch.abb.com; ogi.yolmt@mynosefroze.com; yvonneanne@pignolet.ch 1/8/2019 19 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend