2/25/2014 1
ECE 753: FAULT-TOLERANT COMPUTING
Kewal K Saluja Kewal K.Saluja
Department of Electrical and Computer Engineering
System Diagnosis
Overview
- Introduction
- System Model
- Diagnosis Problem - PMC model
- Other Models and Comments
ECE 753 Fault Tolerant Computing 2
- Sequential Diagnosability
- Other Formulations, Algorithms, and
Problems
- Summary
Introduction
- Reference
- [prad:96] Chapter 8, Original paper in IEEETC (Dec 1967)
- Diagnosis: an important part of recovery,
maintenance and reconfiguration
- What is system level diagnosis: diagnose
failed components in a large possibly
ECE 753 Fault Tolerant Computing 3
failed components in a large, possibly multiprocessor, system
- Underlying needs: failures inevitable, units
are smart/intelligent to test other units, hence need a different model and corresponding theory
System Model
- Model and Assumptions
– Graph model
- Processors/processes expressed as nodes
- Interconnects as links between nodes
E h i ffi i tl f l t
ECE 753 Fault Tolerant Computing 4
– Each processor is sufficiently powerful to test other processors comprehensively – An example model with four nodes – Test model: node Vi tests Vj then draw a directed link from Vi to Vj
Diagnosis - PMC model (contd.)
- Example – Test Model
v2
v1
ECE 753 Fault Tolerant Computing 5
v4 v3
v2
v1
Diagnosis - PMC model (contd.)
- Assumptions
– System with n units – Tests are comprehensive – Test results are binary: good (0) /faulty (1) Faulty units can not be trusted for their test
ECE 753 Fault Tolerant Computing 6
– Faulty units can not be trusted for their test
- utcomes (denote x – means can be 0 or