Term 2 2020 Complete your myExperience and shape the future of - PowerPoint PPT Presentation

Fault Tolerance and Inconsistent Information in Distributed Systems Dr Vladimir Z. Tosic 1 Term 2 2020

Complete your myExperience and shape the future of education at UNSW. Click the link in Moodle or login to myExperience.unsw.edu.au (use z1234567@ ad .unsw.edu.au to login) The survey is confidential, your identity will never be released Survey results are not released to teaching staff until after your results are published

MAIN TOPICS IN THE LAST LECTURE… ( NOT IN THE BEN-ARI TEXTBOOK! ) • Ricart-Agrawala algorithm demo in DA in DAJ • Revision of message-passing using CSP channels • The actor or model for message-passing concurrency • Brief overview of some other dist istrib ribute uted d messag age- passing ing and dist istrib ribute uted d shared d memory ory paradigm igms • Notes on some other concepts for programming asynchron chronou ous s dist istrib ribute uted d syste stems ms 3

MAIN TOPICS IN THIS LECTURE… ( BEN-ARI TEXTBOOK CHAPTER 12 ) • Fault lt toler leran ance ce and inc inconsisten istent t inf inform rmation ation in distributed systems – the problem of consen ensus sus • By Byzant ntine General rals algorithm hm explanation • Byzantine Generals algorithm examples and demo in in DA DAJ • King ing algo lgorit ithm hm explanation and examples 4

CONSENSUS – INTRODUCTION From Chapter 12 in Ben- Ari’s Textbook book and additional materials 5

FAIL-SAFE AND FAULT-TOLERANT DISTRIBUTED SYSTEM – DEFINITIONS • Re Reli liabil ilit ity has several meanings • We focus on 2 aspects: 1. Fail-safe 1. afe – 1 or more failures do not damage system or users 2. 2. Fault lt-toler tolerant nt – system continues to fulfill its requirements even when 1 or more failures happen • E.g. Ricart-Agrawala algorithm for distributed mutual exclusion is NOT fault-tolerant because it will deadlock when 1 node fails 6

CRASH FAILURES VS. BYZANTINE FAILURES – DEFINITIONS • Cr Crash h fail ilure – failed node(s) stop sending messages • Assu sume e we know a node crashed (e.g. timeout occurred) • By Byzant ntine failure – failed/malfunctioning node(s) can send arbitrary messages, possibly even according to a malicious plan • We mu must accou count t for the worst t poss ssibl ible e scenar enario, i.e. the biggest negative impact of messages by this failed node • Name comes from Byzantine Empire (Eastern Roman Empire, 395 – 1453) that had many civil wars and treasons 7

EXAMPLE ARCHITECTURE OF A RELIABLE DS USING REPLICATION • Many complications are possible • E.g. different sensors give somewhat different data • E.g. all CPUs run software with the same bug 8 • No No absolut lute reli liabil ilit ity in DS, always some limits

REPLICATION, PARTITIONING, REDUNDANCY – DEFINITIONS • Re Repli lication ion – multiple nodes doing the same work • Apart from replication, there are other ways to achieve reliability • Notably: part rtiti itioning ing (each node does independent subset of processing) with redundancy ndancy (additional information that enables discovering and fixing some failures) • E.g. parity/CRC in RAID • Many uses of these methods, e.g. in cloud computing 9

BROADER CONTEXT – AUTONOMIC / SELF-MANAGING SYSTEMS • Automating work of network/system administrators • Self lf-ma mana nageme gement nt: self-healing, self- adaptation, … • Au Autonomic onomic comput uting – an IBM self-management initiative • Analogy with human autonomous nervous system • Use of various artifi ificial cial int intell llige igence (AI) I) techniques to make decisions using inc incomple lete te or inc inconsiste istent nt inf inform rmation ation • Not only technical, but also busine iness ss information (e.g. costs and benefits of various options) 10

CONSENSUS – PROBLEM DESCRIPTION • Each node choses init initial ial value lue • E.g. result of measurement or computation • It is required that all nodes in the system decide to use the same value lue – 1 of the initial choices of these nodes • If no faults: each node sends its choice to every other node and then a decision is made using some algorithm (e.g. majority voting) to obtain consensu nsus s value lue • All nodes have the same data and run the same decision algorithm, so they all decide upon the same value • If f there ere are faults lts : … [to be discussed in this lesson!] 11

(CONSENSUS EXAMPLE) COMMITMENT – PROBLEM DESCRIPTION (1/2) • n agents collaborate on a database base tran ansact saction ion • Each of the agents has done its share of the transaction • They want to come to an agreement on whether to commit it the transa nsaction ction results for later use by other transactions • Each agent formed an init initial ial vote but has not yet made the final decision • All that remains to be done is to ensure re that t no two agents ts make dif iffe ferent rent decisio isions 12

(CONSENSUS EXAMPLE) COMMITMENT – PROBLEM DESCRIPTION (2/2) • All agents that reach a decision reach the same e one • If there are no fail ilures s and all ll agents s vote ted d to commit mit, then the decision reached is to commit mit • If an agent decides to commit, this means that all agents voted to commit • Failure model: Only agents can fail, and if they do then they crash sh 13

(CONSENSUS EXAMPLE) COMMITMENT SOLUTION – 2-PHASE COMMIT • A dist isting inguishe ished d agent, e.g. #1, collects the other agents' votes • If f all ll vote tes ( incl. #1’s) are “commit” then #1 tells all other agents to commit mit • Otherwise ( if any agent voted “abort” or any agent did not send it its vote e e.g. . it it crash shed ed), #1 tells all other agents to abort rt • “All or nothing” • 2-Phase Commit solves the commitment problem but may fail to terminate if processes fail 14

CONSENSUS – THE NEED FOR SYNCHRONY • Theorem rem: It is impossible for a set of processes in an asynchronous distributed system to agree on a binary value, even if only a single process is subject to an unannounced crash • Proof of by contr trad adicti iction (sketch): Assume correct decisions made by algorithm; its result depends on some process – but if this process crashes then the other processes must choose arbitrarily and sometimes will make wrong decisions; contradiction with the assumption ∎ • Co Conclus lusion ion: Some synchrony is needed to reach consensus in presence of faults; it also helps tolerate some Byzantine failures 15

THE BYZANTINE GENERALS ALGORITHM From Chapter 12 in Ben- Ari’s Text xtbo book ok 16

(CONSENSUS EXAMPLE) BYZANTINE GENERALS - PROBLEM DESCRIPT. (1/2) • Several ral Byzantine ntine genera rals (each with own army) decide whether er to attack tack some enemy or to retr treat at (to avoid defeat) • To win, win, they y must st AL ALL attack tack togeth ether er; if they do not attack all together, they will be defeated • There are reli liable le messen senge gers rs delivering messages between the generals • Some of the generals might be trai aitor tors working towards defeat • Devise algorithm so all ll loy loyal l generals als come to consen ensus sus plan lan based on majority ty vote e of initial choices and if tied choose se retr trea eat 17

(CONSENSUS EXAMPLE) BYZANTINE GENERALS - PROBLEM DESCRIPT. (2/2) • Analogy with real-life distributed systems: • Genera ral – potentially failed/malfunctioning node • Trait itor or – failed/malfunctioning node • Messen enge ger – reliable communications channel • BG algorithm executing concurr urrentl ently with underlying computation • Messages of BG algorithm disjo isjoint int from computation messages • Messages of BG algorithm are synch chrono ronous us: request with reply • In send/receive statements message types are omitted 18

BYZANTINE GENERALS ALG. 1-ROUND VERSION - PSEUDOCODE • Note: planType = {A; R} for attack and retreat 19

BYZANTINE GENERALS ALG. 1-ROUND VERSION – ERROR SCENARIO • Zoe (attack) and Leo (retreat) are loyal, Basil (attack) is traitor • Basil sends A to Leo, but then crashes before sending to Zoe • Leo chooses A, Zoe chooses R – no consensus nsus by loyal ge general als 20

BYZANTINE GENERALS ALG. 1-ROUND VERSION – ERROR DISCUSSION • 1-Round Algorithm cannot ot toler lerat ate 1 c crash sh fail ilure among 3 ge genera rals • Because not using the fact that certain generals are loyal • This scenario can be extended ended to an arbitrary number of generals • Even if just few generals crash, they can cause no consensus if vote is very close in 1-Round Algorithm • Idea: Relay received messages in a further round 21

BYZANTINE GENERALS ALGORITHM – MAIN IDEAS AND DATA STRUCTURES • Fir irst t round: Each general sends own plan to all other generals and receives plans from them • After it, array y pla lan holds lds pla lans s of all ll generals rals • Subsequ quent ent round(s) d(s): Each general sends all other generals what it received from other generals about their plans and receives such reports from the other generals • Loyal generals relay always what they have received • Matrix ix cell ll reported rtedPlan Plan[G [G,G ,G ’] store ores s pla lan that t G reporte orted d receiv eiving ng from G’ 22

Term 2 2020 Complete your myExperience and shape the future of - PowerPoint PPT Presentation

Fault Tolerance and Inconsistent Information in Distributed Systems Dr Vladimir Z. Tosic 1 Term 2 2020 Complete your myExperience and shape the future of education at UNSW. Click the link in Moodle or login to

The short- -term and long term and long- -term term The short stratospheric and tropospheric

8.6.20 1 English Term 6 Week 2.notebook June 06, 2020 8.6.20 2 English Term 6 Week 2.notebook

InfoPorte by the Numbers (Slide 2) 1. Term Code From : Filled in with the current term 2. Term Code

Presentation Outline 1. Medium Term Fiscal projections 1. The 2011/12 and Medium Term Budget

REZCO CASH: SHORT TERM GAIN = LONG TERM PAIN CASH VS EQUITY 2 CASH VS EQUITY CASH VS EQUITY

South Burlington School District Proposed Long-Term Bond Why issue a long term bond? Entities

Codsall Middle School Year 5 Autumn Term Spring Term Summer Term Story Openers Persuasive

SHORT-TERM RENTALS IN AUSTIN, TX Smart City Policy Summit September 17, 2019 Todd LaRue,

SHS MJ-TERM 2018 Survey MJ-TERM May-June Term: May 21 st June 15 th . (18.5 Days)

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

TERM FACULTY TASK FORCE COMMUNITY FORUM Term Faculty Task Force Update Fall 2017 OUR CHARGE The

Towards Greater International Transparency of Clinical Trials Short Term Efforts for Long Term

University of Applied Sciences Upper Austria 2 3 4 y x G(Expr): Expr Term | Term + Expr

The DSM data matrix DSM data are given as a term-term or term-context matrix: get see use hear

Return To Office Strategy Short-Term Strategy Mid-Term Strategy - Remote Work Long-Term

AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 National

Designing High Performance Autonomic Gateways for Large Scale Grids and Distributed

Coordinating Self-Healing & Self-Optimizing Disciplines in Autonomic Elements: An Experiment

Self-Adaptive Architectures for Autonomic Computational Science

for use cases UCAN BoF 90 th IETF, 23 July 2014 Michael Behringer, Brian Carpenter, Sheng Jiang

1 Outline Orbital Insertion Example Model-based Programming Turn camera off and engine on

Diabetes and CKD: is it all about managing glucose Melanie J Davies CBE FMedSci Professor of

Model-Based Self-Adaptation of Service-Oriented Software Systems GK Workshop 2010 Schloss

System Administrators in the Wild! What we ve learned from watching you Good Morning! I am

Term 2 2020 Complete your myExperience and shape the future of - PowerPoint PPT Presentation

Fault Tolerance and Inconsistent Information in Distributed Systems Dr Vladimir Z. Tosic 1 Term 2 2020 Complete your myExperience and shape the future of education at UNSW. Click the link in Moodle or login to

The short- -term and long term and long- -term term The short stratospheric and tropospheric

8.6.20 1 English Term 6 Week 2.notebook June 06, 2020 8.6.20 2 English Term 6 Week 2.notebook

InfoPorte by the Numbers (Slide 2) 1. Term Code From : Filled in with the current term 2. Term Code

Presentation Outline 1. Medium Term Fiscal projections 1. The 2011/12 and Medium Term Budget

REZCO CASH: SHORT TERM GAIN = LONG TERM PAIN CASH VS EQUITY 2 CASH VS EQUITY CASH VS EQUITY

South Burlington School District Proposed Long-Term Bond Why issue a long term bond? Entities

Codsall Middle School Year 5 Autumn Term Spring Term Summer Term Story Openers Persuasive

SHORT-TERM RENTALS IN AUSTIN, TX Smart City Policy Summit September 17, 2019 Todd LaRue,

SHS MJ-TERM 2018 Survey MJ-TERM May-June Term: May 21 st June 15 th . (18.5 Days)

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

TERM FACULTY TASK FORCE COMMUNITY FORUM Term Faculty Task Force Update Fall 2017 OUR CHARGE The

Towards Greater International Transparency of Clinical Trials Short Term Efforts for Long Term

University of Applied Sciences Upper Austria 2 3 4 y x G(Expr): Expr Term | Term + Expr

The DSM data matrix DSM data are given as a term-term or term-context matrix: get see use hear

Return To Office Strategy Short-Term Strategy Mid-Term Strategy - Remote Work Long-Term

AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 National

Designing High Performance Autonomic Gateways for Large Scale Grids and Distributed

Coordinating Self-Healing &amp; Self-Optimizing Disciplines in Autonomic Elements: An Experiment

Self-Adaptive Architectures for Autonomic Computational Science

for use cases UCAN BoF 90 th IETF, 23 July 2014 Michael Behringer, Brian Carpenter, Sheng Jiang

1 Outline Orbital Insertion Example Model-based Programming Turn camera off and engine on

Diabetes and CKD: is it all about managing glucose Melanie J Davies CBE FMedSci Professor of

Model-Based Self-Adaptation of Service-Oriented Software Systems GK Workshop 2010 Schloss

System Administrators in the Wild! What we ve learned from watching you Good Morning! I am

Coordinating Self-Healing & Self-Optimizing Disciplines in Autonomic Elements: An Experiment