Fault Tolerance and Robustness in Concurrent Systems Faults, - - PowerPoint PPT Presentation

fault tolerance and robustness in concurrent systems
SMART_READER_LITE
LIVE PREVIEW

Fault Tolerance and Robustness in Concurrent Systems Faults, - - PowerPoint PPT Presentation

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault tolerance have many different definitions. What working definition should we use for fault? What does it mean to be fault-tolerant? 2 Faults, errors,


slide-1
SLIDE 1

Fault Tolerance and Robustness in Concurrent Systems

slide-2
SLIDE 2

2

Faults, errors, failures, and fault tolerance have many different definitions.

What working definition should we use for fault? What does it mean to be fault-tolerant?

slide-3
SLIDE 3

3

Faults, errors, failures, and fault tolerance have many different definitions.

  • The definition of these three terms is not

standardized.

  • One is that failures are random things and errors

are designed in. Both cause faults.

  • Another is that a fault is the underlying defect

that may or may not manifest itself and lead to a

  • failure. This would better lead to fault-tolerant

where the system can tolerate faults without failing.

  • What it means to be fault-tolerant is meant as

an open-ended question.

slide-4
SLIDE 4

4

If not handled, faults can exhibit themselves in a system in a number of different ways.

  • Actions – the wrong actions are performed
  • Timing – the right actions are performed but at

the wrong time

  • Sequence – the right actions are performed but

in the wrong sequence

  • Amount – the wrong number of actions are

performed

slide-5
SLIDE 5

5

Fault-tolerance is a system level attribute that needs to be designed in rather than tacked on.

In a broad sense, what are the two major categories of activities that have to go on to achieve fault-tolerance?

slide-6
SLIDE 6

6

Fault-tolerance is a system level attribute that needs to be designed in rather than tacked on.

  • The two major categories of activities are:

detection and recovery or taking action.

  • We need to have mechanisms in place to detect

that something is going wrong, and what the underlying fault is.

  • Then we need to recover without leading to a

failure, or at worst, fail safely.

slide-7
SLIDE 7

7

A simple software watchdog is a first detection mechanism.

Software components are required to report a heartbeat to their supervisor or to a central

  • monitor. The assumption is that as long as the

heartbeat is received the component is working. How much does this tell us about the

  • peration of the component?

What could be an extension to the simple watchdog concept that could tell us more?

slide-8
SLIDE 8

8

How much does this tell us about the operation of the component?

  • Hardware watchdogs are regularly built into the

hardware of safety-critical systems. Unless the watchdog is reset within its timeout period, a hardware reset will be issued to restart the system.

  • The heartbeat only tells us that the component

is regularly getting to the point in its execution where the heartbeat is sent. Nothing much else about the operation of the component.

slide-9
SLIDE 9

9

What could be an extension to the simple watchdog concept that could tell us more?

  • If we have the component send information that is

more than a heartbeat at regular intervals, a watchdog monitor that knows how the component is supposed to operate could check the component for incorrect operation.

  • This would require that the watchdog understands

all of the possible correct paths of execution of the component under observation, that some indication is sent whenever the component gets to a significant point, and the watchdog takes actions when the information does not match with correct

  • peration.
slide-10
SLIDE 10

10

There are a number of responses that can be taken once you find out that something is wrong.

What are some approaches that can be used to deal with a broken component, and an operation that may not have been done correctly? What concerns do you have to consider?

slide-11
SLIDE 11

11

First, we will establish some terminology.

  • Cancellation
  • Task level termination
  • May or may not result in stopping threads
  • Interruption
  • Thread level termination
  • Get a thread to terminate with or without

completion of the current operation

  • Shutdown
  • Application or service level termination
  • Stop all tasks, and associated threads, with or

without completion These definitions are not necessarily universally accepted.

slide-12
SLIDE 12

12

If you are not using a framework with fault handling, you will have to deal with it all yourself.

  • A framework without fault handling may not give

you many options

  • Define cancellation and interruption policies
  • How to do it, when it is checked, what is done
slide-13
SLIDE 13

13

You have some design decisions to make regarding how to handle being interrupted.

  • At the task level
  • Finish current work or stop immediately
  • Does it own the thread?

 Yes, end the thread?  No, i.e. it’s running from a thread pool, let thread manager handle it for the thread

– Preserve interrupted status – Throw InterruptedException

  • At the thread level
  • Propagate interrupt if where it is detected does

not implement interruption policy

  • Otherwise, implement interruption policy
slide-14
SLIDE 14

14

There are other things that you need to consider if you want to build a fault-tolerant system.

Exception in thread "main" java.lang.SomeException at com.example.myproject.Class1.method1(Class2.java:16) at com.example.myproject.Class2.method2(Class3.java:25) at com.example.myproject.TopClass.main(TopClass.java:14)

  • What is the most common indication that your

program had a problem?

If it is operationally critical that the system keeps running, tries to recover from errors, or at a minimum does a graceful, failsafe shutdown, what do you do?

slide-15
SLIDE 15

15

What do you do?

  • The most common response is “handle all

exceptions” but this can not always be done.

  • If a class you use throws an unchecked

exception or an error, you have no indication that it might come at you.

Interface Thread.UncaughtExceptionHandler

  • This provides a mechanism for you to catch all

Throwable things which include all Exceptions and Errors.

slide-16
SLIDE 16

16

Shutdown of a service should take down all tasks and threads that it owns.

  • At the task level
  • Let a running task complete?
  • Let scheduled but not started tasks complete?
  • Provide information about what work was not

finished.

  • Once tasks are handled, interrupt threads in

pool

  • ExecutorServices provide some support
  • shutdown()
  • shutdownNow()
  • awaitTermination()