You Only Live Multiple Times Black box re-use of Crash-Stop - - PowerPoint PPT Presentation

you only live multiple times
SMART_READER_LITE
LIVE PREVIEW

You Only Live Multiple Times Black box re-use of Crash-Stop - - PowerPoint PPT Presentation

You Only Live Multiple Times Black box re-use of Crash-Stop Algorithms In Realistic Crash-Recovery Settings DAVID KOZHAYA 1 , OGNJEN MARIC 2 , AND YVONNE-ANNE PIGNOLET 1 1 ABB CORPORATE RESEARCH SWITZERLAND , 2 DIGITAL ASSET SWITZERLAND A


slide-1
SLIDE 1

DAVID KOZHAYA1, OGNJEN MARIC2, AND YVONNE-ANNE PIGNOLET1

1 ABB CORPORATE RESEARCH SWITZERLAND , 2 DIGITAL ASSET SWITZERLAND

You Only Live Multiple Times

Black box re-use of Crash-Stop Algorithms In Realistic Crash-Recovery Settings

A big thank you to Klaus Tycho-Foerster for presenting on our behalf!

slide-2
SLIDE 2

— Mitigating The Effect of Failures

Failures happen in real systems Consensus: typical way to mitigate the effect of failures Consensus protocols studied under different synchrony and failure assumptions

Fault Tolerance in Distributed Systems To minimize the impact of failures on service interruption, implement consensus protocols that tolerate failures

  • Distributed parties agree on the actions to

perform despite failures

2 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-3
SLIDE 3

— Solving Consensus in Presence of Failures

Asynchronous system model

  • Unbounded processing delays
  • Unbounded communication delays

Crash-stop failure model (CS model)

  • The simplest failure model
  • A process crashes by stopping to execute the

algorithm forever Partial synchrony and failure detectors rely on system conditions that eventually hold forever

Asynchronous system + Crash-stop failures Algorithm Solving Consensus Partially synchrony + Crash-stop failures Failure Detector Asynchronous system + Crash-stop failures 1 2

In reality, failure and recovery modes of processes and links are probabilistic and temporary

3 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-4
SLIDE 4

— Crash Recovery Settings

Crash-recovery failure model (CR model)

  • Process can join and leave unannounced
  • Does not address communication

Subsequently:

  • New

failure detectors and new consensus algorithms on top

  • Processes that crash and recover infinitely often

are excluded – not required to satisfy consensus properties A Way To Capture a System's Dynamicity What remains unanswered: Can the plethora of existing crash-stop algorithms be reused unchanged in crash-recovery settings?

New Algorithm Solving Consensus New Failure Detector Asynchronous system + Crash-Recovery failures

4 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-5
SLIDE 5

— Our Contribution

Re-use crash-stop consensus algorithms with reliable links and failure detectors in crash-recovery model

Deterministic CS (crash-stop) consensus algorithms implement consensus with probability 1 in CR systems where processes and links crash and recover unboundedly A system where all processes and links can crash and recover unboundedly

Our solution

Crash-Stop Consensus Algorithm Failure Detector Reliable Channel

5 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-6
SLIDE 6

  • What is different about our approach compared to exiting works
  • What do we assume in our models
  • How our wrapper works
  • What class of algorithms benefit from our results

The Rest of This Talk

6 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-7
SLIDE 7

— What is different compared to exiting works?

slide-8
SLIDE 8

— Difference with Existing Literature

Existing solutions Our approach

– Our approach:

  • Implement consensus with probability 1
  • Include processes and links that crash and

recover infinitely often

  • Is modular – does not introduce new algorithms

but rather uses existing crash stop algorithms – Existing crash-recovery deterministic solutions:

  • Implement consensus deterministically
  • Exclude processes that crash and recover

unboundedly

  • Introduces new failure detector definitions and

consensus algorithms – Existing probabilistic solutions:

  • Implement consensus with probability 1
  • Introduces new consensus algorithms, e.g.,

based on random coin flips

8 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-9
SLIDE 9

— Description of Our Models

slide-10
SLIDE 10

— Reliable asynchronous crash-stop model

Slide 10

Reliable asynchronous CS Algorithm Failure Detector Reliable Channel

Time: asynchronous processes (no clocks), asynchronous links Failures: processes may crash, all messages get delivered

1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-11
SLIDE 11

— Lossy synchronous crash-recovery model

Slide 11

Time: synchronous steps, synchronous links (upper bounds for steps and transmission) Failures: processes may crash and recover infinitely often, messages may get lost

Lossy Synchronous CR Algorithm Lossy Channel

1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-12
SLIDE 12

— Probabilistic crash-recovery model

Slide 12

Probabilistic CR Algorithm

  • Prob. lossy

Channel

Time: synchronous processes, synchronous links Failures: processes and link crash and recover with probability in (0,1)

1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-13
SLIDE 13

— Model overview

Slide 13

Reliable Asynchronous CS Lossy Synchronous CR Probabilistic CR Async processes and links Processes may crash, reliable communication Sync processes and links Processes may crash and recover, lossy communication Sync processes and links Processes crash and recover

probabilistically, messages dropped probabilistically

1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-14
SLIDE 14

— How our Wrapper works

slide-15
SLIDE 15

— Crash-Recovery Wrapper

Slide 15

Probabilistic Channel Crash-Stop Consensus Algorithm Failure Detector Reliable Channel Crash-stop messages and failure detector output Crash-recovery messages and acks

  • Create synchronous crash-recovery step

using multiple crash-stop steps (each handling one message)

  • Round-by-round failure detector to

produce outputs to be fed to CS algo

  • Provide reliable links by LIFO buffering

and retransmission until ack

1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

The red box is our wrapper

slide-16
SLIDE 16

— What algorithms benefit from our results

slide-17
SLIDE 17

A “bounded” crash-stop consensus algorithm satisfies for fixed B, Bs, B∆: (B1) Communication-closed rounds: Processes operate in rounds, only messages from current round are considered. (B2) Externally triggered state changes: In every round, a process changes state only upon message receipts or failure detector output changes. (B3) Bounded round messages: In any round, a process sends at most Bs messages to any other process. (B4) Bounded round gap: The fastest (n-f) processes are always at most B∆ rounds apart. (B5) Bounded termination: Given any time t where the fastest (n-f) processes are correct, all other processes are faulty, and the failure detector output is perfect after t, then all (n-f) fastest processes decide before any of them reaches round Bmax = max_round(t)+B.

Bounded Algorithms

The Class of Algorithms To Which Our Results Apply Theorem. If a bounded algorithm solves consensus in the crash-stop setting, then this algorithm using our wrapper solves consensus with probability 1 in our crash-recovery setting

17 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-18
SLIDE 18

Examples of existing algorithms that are bounded are:

  • The Chandra-Toueg algorithm [1]
  • Algorithms in the generic indulgent framework of [2]

[1] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225–267, 1996. [2] Rachid Guerraoui and Michel Raynal. A generic framework for indulgent consensus. In ICDCS, 2003.

Bounded Algorithms

Examples

18 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018

slide-19
SLIDE 19

  • Introduced system models that closely capture the messy reality of distributed systems
  • Allowed processes and links to fail and recover for an unbounded number of time
  • Proposed a wrapper to deploy crash-stop algorithms as a black box in our crash-recovery setting
  • Determined the conditions for reusing crash-stop algorithms unchanged in our crash-recovery setting

Conclusion

david.kozhaya@ch.abb.com;

  • gi.yolmt@mynosefroze.com;

yvonneanne@pignolet.ch

19 1/8/2019 You only live multiple times, Kozhaya, Maric, Pignolet, OPODIS 2018