Fault Tolerance and Robustness in Concurrent Systems Faults, - PowerPoint PPT Presentation

Fault Tolerance and Robustness in Concurrent Systems

Faults, errors, failures, and fault tolerance have many different definitions. What working definition should we use for fault? What does it mean to be fault-tolerant? 2

Faults, errors, failures, and fault tolerance have many different definitions.  The definition of these three terms is not standardized. • One is that failures are random things and errors are designed in. Both cause faults. • Another is that a fault is the underlying defect that may or may not manifest itself and lead to a failure. This would better lead to fault-tolerant where the system can tolerate faults without failing.  What it means to be fault-tolerant is meant as an open-ended question. 3

If not handled, faults can exhibit themselves in a system in a number of different ways.  Actions – the wrong actions are performed  Timing – the right actions are performed but at the wrong time  Sequence – the right actions are performed but in the wrong sequence  Amount – the wrong number of actions are performed 4

Fault-tolerance is a system level attribute that needs to be designed in rather than tacked on. In a broad sense, what are the two major categories of activities that have to go on to achieve fault-tolerance? 5

Fault-tolerance is a system level attribute that needs to be designed in rather than tacked on.  The two major categories of activities are: detection and recovery or taking action.  We need to have mechanisms in place to detect that something is going wrong, and what the underlying fault is.  Then we need to recover without leading to a failure, or at worst, fail safely. 6

A simple software watchdog is a first detection mechanism. Software components are required to report a heartbeat to their supervisor or to a central monitor. The assumption is that as long as the heartbeat is received the component is working. How much does this tell us about the operation of the component? What could be an extension to the simple watchdog concept that could tell us more? 7

How much does this tell us about the operation of the component?  Hardware watchdogs are regularly built into the hardware of safety-critical systems. Unless the watchdog is reset within its timeout period, a hardware reset will be issued to restart the system.  The heartbeat only tells us that the component is regularly getting to the point in its execution where the heartbeat is sent. Nothing much else about the operation of the component. 8

What could be an extension to the simple watchdog concept that could tell us more?  If we have the component send information that is more than a heartbeat at regular intervals, a watchdog monitor that knows how the component is supposed to operate could check the component for incorrect operation.  This would require that the watchdog understands all of the possible correct paths of execution of the component under observation, that some indication is sent whenever the component gets to a significant point, and the watchdog takes actions when the information does not match with correct operation. 9

There are a number of responses that can be taken once you find out that something is wrong. What are some approaches that can be used to deal with a broken component, and an operation that may not have been done correctly? What concerns do you have to consider? 10

First, we will establish some terminology.  Cancellation • Task level termination • May or may not result in stopping threads  Interruption • Thread level termination • Get a thread to terminate with or without completion of the current operation  Shutdown • Application or service level termination • Stop all tasks, and associated threads, with or without completion These definitions are not necessarily universally accepted. 11

If you are not using a framework with fault handling, you will have to deal with it all yourself.  A framework without fault handling may not give you many options  Define cancellation and interruption policies • How to do it, when it is checked, what is done 12

You have some design decisions to make regarding how to handle being interrupted.  At the task level • Finish current work or stop immediately • Does it own the thread?  Yes, end the thread?  No, i.e. it’s running from a thread pool, let thread manager handle it for the thread – Preserve interrupted status – Throw InterruptedException  At the thread level • Propagate interrupt if where it is detected does not implement interruption policy • Otherwise, implement interruption policy 13

There are other things that you need to consider if you want to build a fault-tolerant system.  What is the most common indication that your program had a problem? Exception in thread "main" java.lang.SomeException at com.example.myproject.Class1.method1(Class2.java:16) at com.example.myproject.Class2.method2(Class3.java:25) at com.example.myproject.TopClass.main(TopClass.java:14) If it is operationally critical that the system keeps running, tries to recover from errors, or at a minimum does a graceful, failsafe shutdown, what do you do? 14

What do you do?  The most common response is “handle all exceptions” but this can not always be done.  If a class you use throws an unchecked exception or an error, you have no indication that it might come at you. Interface Thread.UncaughtExceptionHandler  This provides a mechanism for you to catch all Throwable things which include all Exceptions and Errors. 15

Shutdown of a service should take down all tasks and threads that it owns.  At the task level • Let a running task complete? • Let scheduled but not started tasks complete? • Provide information about what work was not finished.  Once tasks are handled, interrupt threads in pool  ExecutorServices provide some support • shutdown() • shutdownNow() • awaitTermination() 16

Fault Tolerance and Robustness in Concurrent Systems Faults, - PowerPoint PPT Presentation

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault tolerance have many different definitions. What working definition should we use for fault? What does it mean to be fault-tolerant? 2 Faults, errors,

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

TDDD82 Secure Mobile Systems Lecture 5: Dependability Mikael Asplund Real-tjme Systems

Hypervisor-Based Fault-Tolerance Thomas C. Bressoud, Isis

Discover UEFI with U-Boot 2020-02-01, Heinrich Schuchardt CC-BY-SA-4.0 About Me

Computer Systems Research Kexin Rong CS197 09/26/19 Agenda - Area overview - Introductions

An Approach to Manage Reconfiguration in Fault- Tolerant Distributed System s Stefano Porcarelli

Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults Dian Yu 1/16 Comparison with

THE RELIABLE COMPUTING BASE A Paradigm for Software-Based Reliability Michael Engel (TU

ERLANG/OTP Torben Ho fg mann Erlang Solutions @LeHo fg torben@erlang-solutions.com