Fault tolerance 101 Joe Armstrong Monday, March 3, 2014 Fault - PowerPoint PPT Presentation

Fault tolerance 101 Joe Armstrong Monday, March 3, 2014

Fault • “behaves as per specification” • “does not crash” Monday, March 3, 2014

Many systems have no specification Monday, March 3, 2014

Programming is the act of turning an inexact description of something ( the specification ) into an exact description of the thing ( the program ) Monday, March 3, 2014

A program is the most precise description of the problem that we have Monday, March 3, 2014

What is fault tolerance? • The ability to behave in a sensible manner in the presence of failure. Consumer so f ware, w ebsites, ... • The ability to behave exactly as specified despite failures. Air tra ffi c control, nuclear power station control . “In a sensible manner” is rather wooly Exact specification is When there is no spec - extremely di ffi cul t “in a sensible manner” means - does not crash Monday, March 3, 2014

• History • Hardware Fault Tolerance • Software Fault Tolerance • Specifications and code • Erlang FT • Demo Monday, March 3, 2014

W e cannot prevent failures Monday, March 3, 2014

Automata Studies ed. C. Shannon Princ. Univ. Press 1956 Monday, March 3, 2014

Q: Can we make reliable systems that behave reasonably from unreliable components? A: Y es Monday, March 3, 2014

The Cornerstones of FT • Detect Errors • Correct Errors • Stop Errors from Propagating Monday, March 3, 2014

Needs > 1 computer Error detection must work across machine boundaries Computer 2 w atches computer 1 Computer 3 w atches computer 1 Computer 1 does the job Computer 3 w atches computer 1 Computer ... Must write distributed programs w atches computer 1 Decoupling and separation helps Programs run in para l el stop errors f om propagating Monday, March 3, 2014

Things to ponder • Hardware can fail • Detecting or masking errors? • Software either complies with • Correcting errors a spec = works or does not do • Propagation of errors what the spec says = fails • Error firewalls • What should the software do when the system behaves in a • Self - repairing zones way that is not described in the spec? • Static/Dynamic error detection • What do we do when we don’t have a spec? • Can we make reliable systems that behave reasonably from unreliable components? Monday, March 3, 2014

Hardware fault tolerance • System that mask ( hide ) errors and use redundancy to mask errors. Examples: RAID disks, error correcting bits in memory hardware etc. Monday, March 3, 2014

Tandem nonstop II ( 1981 ) Monday, March 3, 2014

Tandem ... Tandem Computers, Inc. was the Besides handling failures well, this "shared-nothing" dominant manufacturer of fault- messaging system design also scales extremely well tolerant computer systems for ATM to the largest commercial workloads. Each doubling of networks,banks, stock exchanges, the total number of processors would double system telephone switching centers, and throughput, up to the maximum configuration of 4000 other similar commercial transaction processors. In contrast, the performance of processing applications requiring conventional multiprocessor systems is limited by the maximum uptime and zero data loss. speed of some shared memory, bus, or switch. Adding more than 4–8 processors that way gives no further system speedup. NonStop systems have more often been bought to meet scaling requirements than for extreme fault tolerance. They compete well against To contain the scope of failures and of corrupted IBM's largest mainframes, despite being built from data, these multi-computer systems have no simpler minicomputer technology. shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic A l quotes f om Wikipedia snapshots for possible rollback of program memory state. Monday, March 3, 2014

1.10 on tuesday dec 10 Monday, March 3, 2014

Monday, March 3, 2014

What do we do when we detect an error? • Mask it ( try again ) • Do nothing ( crash later - not a tota l y bri l ian t idea ) • Or ... Monday, March 3, 2014

LET IT CRASH Monday, March 3, 2014

Programming the Ericsson Diavox ( 1976 ) If you’re in a three - way call at any time you can press the # key then press 1 to talk to party 1 2 to talk to party 2 or * to enter a conference call Monday, March 3, 2014

if(state == 3waycall && key == “#”){ key = get_next_key(); if(key==”1”){ park(2); connect([self,1]); } elseif(key==”2”){ Defensiv e park(1); programming connect([self,2]); } elseif (key==”*”){ connect([self,1,2]); } elseif(key=”onhook”){ /* Uuugh what do I do here */ } Monday, March 3, 2014

Oh Dear • The Spec tells what to do when things happen • The Spec does not say what to do when the behavior goes “o ff- spec” • The number of ways we can go “o ff spec” is huge • Most specifications do not include failure analysis, and do not say what to do when you are “o ff spec” Monday, March 3, 2014

Joe: “So what happens if we’re in a 3 - way conference, and the guy processes hash and then puts the hook down, and doesn’t press 1 2 or star?” Bernt: “So what you do is stop the conference, send the phone a ring tone and when they answer go back to the point where you were expecting them to enter 1 2 or star.” Joe: “But that’s not in the spec.” Bernt: “But everybody knows.” Joe: “I didn’t know.” Monday, March 3, 2014

Calls are “files” • If a process crashes the OS closes all files opened by the process • If a call crashes the OS closes all calls opened by the process • The OS’s job is to “keep files safe” ( ie it maintains invariants ) Monday, March 3, 2014

Let it crash philosophy • If a processes crashes the OS detects this • The OS protects the resources being used by the process • Programs should crash when going o ff spec Monday, March 3, 2014

if(state == 3waycall && key == “#”){ key = get_next_key(); if(key==”1”){ park(2); connect([self,1]); } elseif(key==”2”){ park(1); connect([self,2]); } elseif (key==”*”){ connect([self,1,2]); Defensiv e } else{ programming exit(out_of_spec1); } } Monday, March 3, 2014

Failed Patte n matching provides the exi t confcall(“#”) -> case get_next_key() of ”1” -> park(2); connect([self,1]); ”2” -> Non defensiv e programming - park(1); there is no error connect([self,2]); detection or correction cod e ”*” -> connect([self,1,2]) end. Monday, March 3, 2014

Are hardware and software faults are fundamentally di ff erent? Monday, March 3, 2014

Are there any pure functions? Monday, March 3, 2014

Class ( a ) functions: If computing f ( X ) fails and f is a pure function computing f ( X ) will always fail. Class ( b ) functions: If computing f ( X ) fails and f is a non - pure function it might succeed if we call f ( X ) again. Monday, March 3, 2014

Is this a pure function? function f(){ int a = 10, int b = 2, return a/b } Monday, March 3, 2014

Cosmic ray hits the memory ce l where b is stored and changes the 2 into zero function f(){ int a = 10, int b = 2, return a/b } A heisenbug Monday, March 3, 2014

Monday, March 3, 2014

• Heisenbug - Bug that that seems to disappear or alter its behavior when one attempts to study it • Bohrbug - A "good, solid bug". Like the deterministic Bohr atom model, they do not change their behavior and are relatively easily detected. • Mandelbug - ( named after Benoît Mandelbrot's fractal ) is a bug whose causes are so complex it defies repair, or makes its behavior appear chaotic or even non - deterministic. • Schrödinbug ( named after Erwin Schrödinger and his thought experiment ) is a bug that manifests itself in running software after a programmer notices that the code should never have worked in the first place. • Hindenbug ( named after Hindenburg disaster ) is a bug with catastrophic behavior. Source: wikipedia Monday, March 3, 2014

• If a process fails restart it ( f ixes many heisenbugs, especia l y those due to subtle timing errors ) • If you have tried restarting a process more than N times in K seconds, then give up. T ry and do something simpler instead. • Build trees of processes, if low - level nodes fail and cannot be restarted fail higher up the tree Monday, March 3, 2014

Supervision trees supervisors workers Don’t forget the manual backup : -) Monday, March 3, 2014

The failure model is part of the specification ( especially for air - tra ffi c control software etc. ) The customer should understand the failure model Monday, March 3, 2014

I want fault tolerant storage That’s impossible W e’ll make three copies of your data, on three di ff erent machines. W e’ll guarantee that if one machine crashes you’ll never lose any data what happens if 2 machines crash at the same time Y ou can still save data on the third machine, but it will be unsafe. Our guarantee will not apply. But I want more safety Monday, March 3, 2014

Fault tolerance 101 Joe Armstrong Monday, March 3, 2014 Fault - PowerPoint PPT Presentation

Fault tolerance 101 Joe Armstrong Monday, March 3, 2014 Fault behaves as per specification does not crash Monday, March 3, 2014 Many systems have no specification Monday, March 3, 2014 Programming is the act of turning

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Recent Anthropogenic Increases in Sulfur Dioxide from Asia Have Minimal Impact on Stratospheric

Industrial REIT C E N T U R I A I N D U S T R I A L R E I T A S X : C I P 1 69 STUDLEY COURT,

LOBBY 10 1 PAINLESS ADVOCACY: The Art of Successfully Engaging with Your Elected officials

Charm++ as an Energy Efficient Runtime 1 4/18/17 BILGE ACUN - CHARM++ WORKSHOP 2017 Interaction

Parallelised Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy Akshay Jeff

Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung

February 13, 2013, 1:30pm 3pm Central THANK YOU FOR JOINING US Please stay tuned and the

Study of Neutron Structure with Spectator Tagging via eD e NX in MEIC Kijun Park 1 1 Old