overview ece 753 fault tolerant
play

Overview ECE 753: FAULT-TOLERANT Introduction - Sources COMPUTING - PDF document

2/4/2014 Overview ECE 753: FAULT-TOLERANT Introduction - Sources COMPUTING Hardware redundancy Kewal K Saluja Kewal K.Saluja Information redundancy Information redundancy Department of Electrical and Computer Time


  1. 2/4/2014 Overview ECE 753: FAULT-TOLERANT • Introduction - Sources COMPUTING • Hardware redundancy Kewal K Saluja Kewal K.Saluja • Information redundancy Information redundancy Department of Electrical and Computer • Time redundancy Engineering • Software redundancy Basic Concepts in Fault-Tolerance ECE 753 Fault Tolerant Computing 2 Introduction Introduction (contd.) • Scope - Explain using the example of a filter • Sources • inputs • Main source – Text Chapters 2 and 3 • A/D • digital subsystem - DSP/custom design • Other sources • D/A • [prad:96] Chapter 1 • outputs • [siew:99] Chapter 3 • Problems and solutions • [Shooman:02] Chapter 4 • inputs out of range These three books contain sufficient • add extra code to check out of range inputs and outputs • can also add code to check large deviations between samples material covering this part of the course. • software redundancy normally - could do in hardware but costly ECE 753 Fault Tolerant Computing 3 ECE 753 Fault Tolerant Computing 4 Hardware redundancy Introduction (contd.) • Passive hardware redundancy • Problems and solutions - contd. • TMR with a voter • Power transients may corrupt the values or fault algorithm • main problem • read values twice, execute algorithm twice and compare results • single point of failure in hardware or software • justification - voter is much lower complexity • Time redundancy and can be designed using more reliable d b d i d i li bl • Values transmitted by A/D to the digital system may get corrupted technology • encode the values and decode them at the destination • alternative - use of restoring organ • Information redundancy – TMR with triplicated voter • Components (DSP processor or A/D or D/A) may fail • NMR voter based generalization • duplicate such parts • Hardware voter (1-bit), software voter - simple • Hardware redundancy • Timing issue - sandwich between pairs of FFs ECE 753 Fault Tolerant Computing 5 ECE 753 Fault Tolerant Computing 6 1

  2. 2/4/2014 Hardware redundancy (contd.) Hardware redundancy (contd.) • Passive hardware redundancy (contd.) • Passive hardware redundancy (contd.) – Comparison between hw and sw voter schemes – types of voting hw sw • majority cost high low – in many practical situations it is meaningless flexibilty inflex flexibilty inflex flex flex • average synch tightly loosely – can have poor performance if a sensor always provide perfor high low very low value (fast) (slow) • mid value types of majority diff – a good choice - can be very costly to implement in HW voting* (others costly) (no extra cost) ECE 753 Fault Tolerant Computing 7 ECE 753 Fault Tolerant Computing 8 Hardware redundancy (contd.) Active approach to FT • Active hardware redundancy – Key - detect fault, locate, reconfigure • See figure 1.6 of [prad:96] – duplicate with comparison Basic operations in • single point of failure active fault tolerance active fault tolerance – standby sparing - Source: Pradhand • one operational unit - it has its own fault detection mechanism 1996 • on occurrence of fault a second unit (spare) is used – cold standby - standby is in unknown state – hot standby - standby is same state as system - quick start • can generalize to n - one active and n-1 standby spares ECE 753 Fault Tolerant Computing 9 ECE 753 Fault Tolerant Computing 10 Hardware redundancy (contd.) Hardware redundancy (contd.) • Hybrid hardware redundancy • Active hardware redundancy (contd.) – Key - combine passive and active redundancy – Pair-and-a-spare - this combines “duplicate with schemes comparison” with “standby sparing” – NMR with spares • duplicate units (pair of units) are used to compare and signal an error to the reconfiguration unit • example - 5 units p • second duplicate (pair, and possibly more in case of pair and k- – 3 in TMR mode spare) is used to take over in case the working duplicate (pair) – 2 spares detects an error – all 5 connected to a switch that can be reconfigured • a pair is always operational • comparison with 5MR – Watchdog timer – 5MR can tolerate only two faults where as hybrid scheme • a “timer” - substantially low cost hardware monitors the can tolerate three faults that occur sequentially function of the working unit – cost of the extra fault-tolerance: switch ECE 753 Fault Tolerant Computing 11 ECE 753 Fault Tolerant Computing 12 2

  3. 2/4/2014 Information redundancy Hardware redundancy (contd.) • Key concept - add redundancy to • Hybrid hardware redundancy (contd.) information/data – Self purging redundancy – all schemes use Error detecting or Error correcting • initially start with NMR coding • purge one unit at at time till arrive at 3MR • Use of parity y – can tolerate more faults initially compared to NMR with y p spare – very effective single error detection – cost of the switch - higher? – encoding and decoding cost is low – How does it compare to sift-out redundancy? – commonly used in memories, transmission over short reliable channels – Triple-duplex redundancy – limitations • combines duplication-with-compare and TMR • unable to detect common multiple errors • can not be used in data transformation - for example addition does not preserve parity ECE 753 Fault Tolerant Computing 13 ECE 753 Fault Tolerant Computing 14 Information redundancy (Contd.) Information redundancy (Contd.) • Error correcting codes • Berger codes – triplication – n information bits are encoded into an n+k bit code word. – Hamming code - you have learnt it The k check bits are binary encoding of the number of 1’s (or 0’s) in the n information bits – byte error detection/correction - to be discussed later • can detect all single errors – cyclic code - see book • can detect all unidirectional multiple errors if carefully designed • can detect all unidirectional multiple errors if carefully designed • m-out-of-n codes • Arithmetic codes – encode each word (data/control) such that the coded word is – AN code of length n and each coded word has exactly m 1’s in it • used for arithmetic function unit designs • can detect all single errors • each data word is multiplied by a constant A • can detect all unidirectional multiple errors • makes use of the identity A(N+M) = AN + AM • choice of A is important ECE 753 Fault Tolerant Computing 15 ECE 753 Fault Tolerant Computing 16 Information redundancy (Contd.) Information redundancy (Contd.) • Arithmetic codes (Contd.) • Self-Checking – Residue code – This is a form of hardware redundancy but often it is closely related to ECC techniques, therefore I have chosen to • discussed earlier in the course using modulo addition include it here • makes use of the fact (M+N) mod k = (M mod k + N mod k) mod k – Assumptions: inputs are coded and outputs are coded – Objective: in the presence of a fault the circuit should either – Checksums continue to provide correct output(s) or indicate by providing • data is sent/stored with a checksum and when used the an error indication that there is a fault. checksum is regenerated and compared to the a priory known checksum • Clearly error indication can not be 1-bit output (why?) • functions used for checksum • With 2-bits output, 00 and 11 may indicate no failure • add, exclusive-OR (bit wise), end with end around carry, LFSR, … • other output combinations (10, 01) may indicate a failure • limitation • can only perform (normally) error detection ECE 753 Fault Tolerant Computing 17 ECE 753 Fault Tolerant Computing 18 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend