4/1/2014 1
ECE 753: FAULT-TOLERANT COMPUTING
Kewal K.Saluja
D t t f El t i l d C t E i i Department of Electrical and Computer Engineering
HIGH Level Fault-Tolerance: Checkpointing and recovery Introductory material
Overview
- Introduction and basic concept
- Fault model and fault coverage
- Checkpointing and backward error
recovery (rollback)
General principles
ECE 753 Fault Tolerant Computing 2
– General principles – Uniprocessor systems
- Summary
- Cost, Overhead, Latency issues
- Distributed Systems
Introduction
- References
– Text Chapter 6 – [Prad:96] Chapter 3 – sections on rollback and reconfiguration
ECE 753 Fault Tolerant Computing 3
Introduction (contd.)
- Some what higher level than ECC and
watchdog, uses re-execution as basic recovery strategy
- It is a hardware assisted software
ECE 753 Fault Tolerant Computing 4
method in practice
- Basic concept: save fault-free state of
the system and if and when an error is detected, reload the fault-free state and re-execute
Introduction - Basic Concept (contd.)
- Three phases of recovery
– Error detection – Damage assessment – Recovery – error elimination and arrival at the point where error was detected
- ften entails re-starting fresh on a system presumably
ECE 753 Fault Tolerant Computing 5
- ften entails re starting fresh on a system presumably
fault free
- Backward error recovery
– Current process is rolled back to some error-free point and re-executes – Trivial solution – start afresh from the beginning of the program
Fault model and fault coverage
- Possible scenarios
– Hardware is faulty, software is fault-free – Fault detection mechanism exists – in hardware or in software form – Hardware fault-free, software is faulty
ECE 753 Fault Tolerant Computing 6
– Both hardware software faulty
- Assumptions for backward error recovery