Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING - - PDF document

overview ece 753 fault tolerant
SMART_READER_LITE
LIVE PREVIEW

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING - - PDF document

1/21/2014 Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction Kewal K.Saluja Kewal K Saluja Fault models at different levels (HW) Department of Electrical and Computer Engineering Error


slide-1
SLIDE 1

1/21/2014 1

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K Saluja Kewal K.Saluja

Department of Electrical and Computer Engineering

Fault Modeling

Lectures Set 2

Overview

  • Fault Modeling
  • References
  • Introduction

ECE 753 Fault Tolerant Computing 2

  • Fault models at different levels (HW)
  • Error models
  • High-level failure models (process or

system failure)

  • Summary

Recap

  • Think about PROJECT
  • Terminology and definitions
  • Fundamental principles - Redundancy

– Hardware - low and high level – Software

ECE 753 Fault Tolerant Computing 3

Software – Time – Information

  • FEF Chain and methods to break it (barriers)

– Attributes of faults and fault types - such as permanent, transient, intermittent (please read)

Fault Modeling

References

  • [abra:86] Abraham and Fuchs, Fault and error

modeling for VLSI, Proc. IEEE, May 1986

  • [kala:13] Kalayappan and Sarangi, A survey of

checker architectures, ACM Computing survey,

ECE 753 Fault Tolerant Computing 4

Aug 2013

  • [mull:93] Hadzilacos and Toueg, Fault tolerant

broadcast and related problems, In Distributed systems (book)

Fault Modeling (contd.)

Introduction

  • What is a model?

– An abstraction that captures the behavior f th i i l t

ECE 753 Fault Tolerant Computing 5

  • f the original system.
  • must be simple
  • must lead to accurate conclusions

Fault Modeling (contd.)

Introduction

  • Why use a model?

– tractability of analysis – a non-destructive method to study (low

ECE 753 Fault Tolerant Computing 6

y ( cost, alternative to fault injection) – manageable study space (can check equivalence and reduce the study space)

slide-2
SLIDE 2

1/21/2014 2

Fault Modeling (contd.)

Introduction

  • Different models at different levels of

abstractions:

ECE 753 Fault Tolerant Computing 7

– Chip level - manufacturing defects, random faults, transistor faults, gate failures, aging,… – System level

  • HW - aging, interconnect failures, chip failures, …
  • SW - bugs, design flaws, incorrect algorithms, ...

Fault Modeling (contd.)

Fault models at different levels (HW)

  • Process level
  • Transistor level
  • Gate level

ECE 753 Fault Tolerant Computing 8

Gate level

  • Function level (often error models)
  • Behaviour level (often timing failure

models)

. . .

  • System level (usually failure models)

Fault Modeling (contd.)

Fault models at different levels (contd.)

  • Process level - Defect models
  • cluster defects
  • point and random defects

sed to predict the process ield

ECE 753 Fault Tolerant Computing 9

  • used to predict the process yield
  • tested using optical and parametric tests
  • effect of defect
  • chip fails to perform its function
  • unacceptable parameters - large capacitance, large

delay, slow speed, high current

Fault Modeling (contd.)

Fault models at different levels (contd.)

  • Transistor level - failure of a transistor
  • fabrication level causes - point defects, mask

misallignment, design rule violation

  • physical facts - shorts, opens, line-bridges,
  • thers

ECE 753 Fault Tolerant Computing 10

  • others
  • size variations -> altered delays
  • coupling/crosstalk
  • degradation of elements - electromigration
  • alpha particle hits
  • power transients
  • missing/extra transistors – PLAs
  • Function modification/alteration - FPGA

Fault Modeling (contd.)

Fault models at different levels (contd.)

  • Transistor level - erroneous behaviors
  • High current
  • incorrect logic output
  • intermediate voltage

ECE 753 Fault Tolerant Computing 11

  • intermediate voltage
  • different performance (operating speed)
  • state change - alpha particle hit

Fault Modeling (contd.)

Fault models at different levels (contd.)

  • Transistor level - prevalent fault models
  • stuck-on and stuck-off faults
  • bridging fault
  • strength of signals

ECE 753 Fault Tolerant Computing 12

strength of signals

  • delay fault
  • coupling and cross talk
  • Limitations
  • very large number of possible faults makes it

difficult to handle these faults (intractability due to large model space)

slide-3
SLIDE 3

1/21/2014 3

Fault Modeling (contd.)

Fault models at different levels (contd.)

  • Transistor level - comments (these are fairly

general and are not restricted to transistor level model)

  • increasing computing power implies that we can handle

ECE 753 Fault Tolerant Computing 13

large number of faults and complex models

  • these models used for test generation and not for fault

tolerance per say

  • methods have been proposed to reduce the number of

faults that need to be studied - e.g. fault equivalence

  • classical method and newer methods (such as current

testing) are employed in real testing

  • design for testability and built-in self-test are becoming

prevelent

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Gate level - causes
  • same as for transistors
  • additional causes in SSI and board level

failed resistor

ECE 753 Fault Tolerant Computing 14

  • additional causes in SSI and board level - failed resistor,

failed solder joint, failed wire wrap, …

  • Gate level - erroneous behaviors
  • similar to those as for transistors

(one of the most commonly used model - why? See next slides)

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Gate level - different models
  • Stuck-at: a line value stays the same

irrespective of the signal applied to the line

  • Advantages

ECE 753 Fault Tolerant Computing 15

g

  • simplicity
  • accuracy
  • can model most real faults
  • tractable model space - count the possible number of

faults

  • easy to use and easy to quantify (for quality metric)
  • substantial empirical evidence of its practical use

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Gate level - different models

Stuck at (contd )

ECE 753 Fault Tolerant Computing 16

  • Stuck-at - (contd.)
  • Disadvantages
  • with increasing device density the model is being

questioned often and loosing many of its advantages

  • Some real defects can not be modeled by this model
  • more powerful computers are making it possible to

handle other models - even at fabrication level

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Gate level - different models
  • Bridging faults - pair of lines in a circuit (at gate

level) are shorted. Many variations such as intergate, intragate, neighboring lines, …

ECE 753 Fault Tolerant Computing 17

g g g g

  • Advantages
  • simple
  • realistic
  • Disadvantages
  • large number of faults
  • difficult to relate to the quality metric

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Gate level - different models
  • Stuck-open/Stuck-On - Transistor based open

fault can be modeled by logic level. Some time extra

ECE 753 Fault Tolerant Computing 18

y g logic gates are used to model opens in this manner similar to modeling bridging faults

slide-4
SLIDE 4

1/21/2014 4

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Gate level - different models
  • Delay faults - delay of a gate or a line is

different than the nominal or know delay in a

ECE 753 Fault Tolerant Computing 19

y perfect process

  • Deals with critical paths - gate delay, path delay, ...
  • Advantages
  • Performance oriented modeling
  • Quite general
  • Disadvantages
  • Difficult to use and intractable (path delay)

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Gate level - different models
  • Other models
  • coupling between pair of lines

ECE 753 Fault Tolerant Computing 20

coupling between pair of lines

  • pin or I/O faults in gates (or chips)
  • speedup/slow down of signals (sub-micron

technologies)

  • aging (such as NBTI in sub-micron technologies)

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Function Level - when used
  • lower level description is not available
  • function level processing (e g simulation) is

ECE 753 Fault Tolerant Computing 21

  • function level processing (e.g. simulation) is
  • ften faster
  • design available only in mixed form (gate and

function)

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • Function Level - where used
  • combinational circuits
  • logic blocks

ECE 753 Fault Tolerant Computing 22

logic blocks

  • decoders
  • finite state machines
  • large complex circuits
  • microprocessors (often only mix format is available, such

as ALU in gate level, memory in functional level, etc.)

  • for other building blocks
  • PLAs, RAMs, FPGAs

Fault Modeling (contd.)

Fault models a different levels (contd.)

  • System Level - when used

interconnected systems

ECE 753 Fault Tolerant Computing 23

  • interconnected systems
  • ad hoc connected systems
  • regular connected systems
  • failure of a system or systems, or interconnects
  • many failure models exist and will be dicussed later

in the course

Fault Modeling (contd.)

Error models

Means of classifying the effect of physical fault(s) in a system - note from modeling point of view it is not necessary that we

ECE 753 Fault Tolerant Computing 24

point of view it is not necessary that we deduce it using a fault model

  • Goals
  • extent of information corrupted
  • extent of error(s) propagated
  • latency issue
slide-5
SLIDE 5

1/21/2014 5

Fault Modeling (contd.)

Error models (contd.)

  • Error effects
  • data
  • control
  • state

ECE 753 Fault Tolerant Computing 25

state

  • Error Types (HW)
  • bit errors (data, control, state) - single bit error

assumption commonly used in practice

  • unidirectional errors (mostly in data)
  • byte errors (data)
  • other - intermediate logic level

Fault Modeling (contd.)

Error models (contd.)

  • Error Types (SW)
  • branch error
  • missing instruction error

ECE 753 Fault Tolerant Computing 26

g

  • missing/dangling pointer errors

Fault Modeling (contd.)

High-level failure models (process or system failure)

  • System model

single or multiple processor system

ECE 753 Fault Tolerant Computing 27

  • single or multiple processor system
  • single - multiple processes executing
  • key - interacting processes - such as message

passing systems, distributed systems, ...

Fault Modeling (contd.)

High-level failure models (process or system failure)

  • General classification
  • crash failure - a faulty processor or system stops

permanently

ECE 753 Fault Tolerant Computing 28

permanently

  • omission failure - a faulty process omits inputs/outputs

some times but when it works, it works correctly

  • timing failure - inputs/outputs are delayed or arrive too

early

  • Byzantine failure or arbitrary failure - a faulty

processor can exhibit arbitrary behavior including malicious nature

Summary

  • Fault modeling

– References – Fault models at different levels E d l

ECE 753 Fault Tolerant Computing 29

– Error models – Process or system failure models