3/25/2014 1
ECE 753: FAULT-TOLERANT COMPUTING
Kewal K.Saluja
Department of Electrical and Computer Engineering
Low Level Fault-Tolerance: Watchdog and Re-execution
Overview
- Introduction
- Watchdog techniques
– Timers, watchdog processors, error model, control flow checking, memory access and assertion checking
R ti f f lt t l
ECE 753 Fault Tolerant Computing 2
- Re-execution for fault-tolerance
– Basic techniques: RESO concept, program re- execution, instruction re-execution – Case studies: Fine grain parallel architecture (CRAY), SMT architecture, multiscalar
- architecture. Chip Multiprocessor
- Summary
Introduction
- References
- Watchdog - [mahm:88]
- Re-execution - [rotenberg:99], [rashid:00]
[subra:10] [kala:13]
ECE 753 Fault Tolerant Computing 3
[subra:10], [kala:13]
- Sohi, Franklin, and Saluja, “A study of time-
redundant fault-tolerant techniques for high- performance pipelined computers,” Proceedings FTCS-19, June 1989, pp. 436- 443.
Introduction (contd.)
- Somewhat higher level than ECC and
masking at circuit level
- Bordering between hardware and
software (hardware often assisted by
ECE 753 Fault Tolerant Computing 4
software)
- These are some of the very first fault-
tolerance methods
Watchdog techniques
- Key concept
– A process or processor is checked by another hardware (normally) unit of its
- actions. Actions checked include if the
process is still active alive not executing
ECE 753 Fault Tolerant Computing 5
process is still active, alive, not executing incorrect paths during execution, etc. Processor
watchdog
Watchdog: Timers
- Check for aliveness
– Processor resets the timer at certain intervals or on certain conditions Timer raises error flag if not reset before it
ECE 753 Fault Tolerant Computing 6
– Timer raises error flag if not reset before it
- verruns
Processor
timer Error