lars bauer artjom grudnitsky hongyan zhang j rg henkel
play

Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - - PowerPoint PPT Presentation

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2013 Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische Informatik Chair for Embedded Systems - Prof.


  1. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2013 Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jörg Henkel - 1 -

  2. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8. Fault Tolerance and Reliability in FPGA based Systems - 2 -

  3. 1. Introduction 2. Overview 3. Special Instructions 4. Fine-Grained Reconfigurable Processors • Introduction 5. Configuration Prefetching • Fault Detection and Mitigation 6. Coarse-Grained Techniques Reconfigurable Processors • Applications of 7. Adaptive Reliability Techniques Reconfigurable Processors • LHC 8. Fault-tolerance • Space by Reconfiguration • OTERA - 3 - L. Bauer, KIT, 2013

  4. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel - 4 -

  5. � CMOS Scaling increases occurrence of ◦ Manufacturing defects ◦ Post-deployment degradation ◦ Especially important for FPGAs as they have a high amount of transistors and interconnect wires Gordon E. Moore (co-founded Intel in 1968) � Environmental conditions can incur temporary faults ITRS ◦ E.g. Aerospace industry – use hardened # of dopant atoms in T-channel devices for mission critical tasks, FPGAs for non-critical data processing � Unlike ASICs, FPGAs can adapt # dopant atoms to deal with permanent and temporary faults - 5 - L. Bauer, KIT, 2013

  6. � Permanent Faults: e.g. stuck-at failures in CLBs and opens, bridges, shorts in the programmable switching matrix ◦ could occur during the fabrication process without being detected ◦ Damage of device resources may also appear in the life cycle of FPGAs � Intermittent Faults: have a permanent cause in the structure of the circuit but their effect is intermittent, e.g. depending on temperature or power consumption � Transient Faults: have a temporary cause that can alter signal values or state stored in memory cells, which creates indefinite and incorrect states in the computation ◦ e.g. by a high energy particle strike resulting in an energy exchange and charge displacement - 6 - L. Bauer, KIT, 2013

  7. V g � Breakdown of Si-H bonds at gate the silicon-oxide interface S D oxide due to voltage/thermal stress p p n � causes interface traps P-type MOSFET � Affects mostly P-MOSFETs because of negative gate bias ◦ Effect in N-MOSFETS is H + negligible trap O H O H � Despite research focus: Si Si Si Si Si NBTI is observed, but not yet fully understood V g < 0 � STRESS! - 7 - L. Bauer, KIT, 2013

  8. � NBTI manifests itself as a shift in V th V th shift [V] ◦ Causes increase in transistor delay ◦ NBTI leads to delay faults and resulting circuit failure Stress Recovery � Recovery effect in periods of no stress ◦ When voltage and temperature are low, V th can shift back towards 0 V g [V] its original value ◦ Full recovery from a stress period -1 only possible in infinite time � In practice, overall V th shift increases over longer periods, e.g. months or years Time - 8 - L. Bauer, KIT, 2013

  9. � Temperature plays important aspect in NBTI modeling � Higher temperatures increase shift in threshold voltage � Δ Vth approximately 50% higher at 75°C than 55°C � NBTI effect at 75°C is approximately equal to alternating between 85°C and 25°C - 9 - L. Bauer, KIT, 2013

  10. The NBTI effect is minimum here 40% because the NBTI stress will equally (SNM) degradation after 7 Signal to Noise Margin be distributed between the two PMOS 35% transistors existing in the SRAM 30% years in 32nm 25% 20% 15% 10% 5% 0% Percentage of time that the cell stores zero [%] src: S. Kothawade, K. Chakraborty, S. Roy, "Analysis and mitigation of NBTI aging in register file: An end-to-end approach" - 10 - L. Bauer, KIT, 2013

  11. � Hot-Carrier Injection (HCI): build up of trapped charges in the gate-channel interface region ◦ progressive reduction of carrier mobility � increase in CMOS threshold voltage ◦ Switching speed slower, leads to timing problems - 11 - L. Bauer, KIT, 2013

  12. � Time-Dependent Dielectric Breakdown (TDDB): over time conducting path forms in thin oxide layers [CCMA10] G S D - 12 - L. Bauer, KIT, 2013

  13. � Example: Read noise margin p -source fresh V dd breakdown � Worst case: � half-selected � state (wordline+bitlines high) V R 0 V dd n -source drain breakdown WL breakdown p-source pass- V R gate "1" "0" drain V V L R 0 n-source 0 V dd 0 V dd BL BR V L V L src: Stathis, IRPS (2008) - 13 - L. Bauer, KIT, 2013

  14. � Most of device problems can be tracked down to high-field effects – related to the failure to follow Dennard Scaling src: Radhakrishnan et al ., IEDM (2001) - 14 - L. Bauer, KIT, 2013

  15. � Transistor and power scaling are no longer balanced ◦ Scaling is limited by power � Higher power density leads to thermal problems ◦ Accelerates aging effects Classical scaling (Dennard) Power Limited Scaling Device count S 2 Device count S 2 Device frequency S Device frequency S Device power (cap) 1/S Device power (cap) 1/S Device power (V dd ) 1/S 2 Device power (V dd ) ~1 P Power Density 1 Power Density S 2 S: Scaling Factor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 15 - L. Bauer, KIT, 2013

  16. � Electromigration: thermally activated metal ions may leave their potential wells ◦ electric field and momentum exchange through electrons direct metal ion migration ◦ can lead to open/short circuits [wikipedia] - 16 - L. Bauer, KIT, 2013

  17. � Radiation induced faults ◦ Single Event Upsets/Single Event Transients ◦ Most common: single bit flip in SRAM cell ◦ SEU effect on ASIC � Transient (only variation is time duration of fault) � Even if latched, will be eventually overwritten High-Energy Particle (Neutron or Proton) ◦ SEU effect on FPGAs p+ Isolation � Permanent (until reset/ Gate reconfiguration) if n+ n+ N-Well configuration memory + + + - - + - + + - - + - - + - + hit by SEU + - Depletion P-Well - Region P-Substrate Sources: Intel, S. Borker@DAC’03, Patrick-Emil Zörner, W.D. Nix, 1992, L.Finkelstein, Intel 2005, R. Baumann, - 17 - L. Bauer, KIT, 2013 TI@Design&Test’05, Ziegler, IBM@IBM JRD’96

  18. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel - 18 -

  19. � Masks errors, but does not correct underlying fault ◦ Problem: error accumulation � External ◦ Multiple FPGAs working in lockstep, i.e. per- forming the same operation in each cycle ◦ Output sent to radiation hardened voter � Internal ◦ Replicate functional block in FPGA � Popular configuration: Triple Modular Redundancy (TMR) - 19 - L. Bauer, KIT, 2013

  20. src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead Overhead O Modular Fast: as Very large: Very small: Coarse: Good: All Redundancy soon as fault triplicate + Voter delay protect manifest is manifest voter module errors sized detected blocks - 20 - L. Bauer, KIT, 2013

  21. � More space efficient design than modular redundancy � Error coding algorithms (e.g. parity) at data flows/stores � Time redundancy can be used for concurrent error detection ◦ Repeat computation in a way that allows errors to be detected ◦ First computation at t0: compute result in combinational logic, store result ◦ Second computation at t0+d: encode operands, compute in combinational logic, decode result, compare to first result - 21 - L. Bauer, KIT, 2013

  22. src: [LCR03] - 22 - L. Bauer, KIT, 2013

  23. � Different techniques for encode/decode, e.g. bit inversion to detect stuck-at faults � Recomputation with shifted operands (RESO) for faulty arithmetic slices ◦ Encode: left shift operands ◦ Decode: right shift result � Combine with double modular redundancy (DMR) ◦ RESO determines which module is faulty, DMR uses result of other module ◦ Less area required than TMR ◦ Slightly slower (time-shifted re-computation) - 23 - L. Bauer, KIT, 2013

  24. src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead O Overhead Modular Fast: as soon Very large: Very small: Coarse: Good: All Redundancy as fault is triplicate + Voter delay protect manifest manifest voter module errors sized blocks detected Concurrent Fast: as soon Medium: Small: CRC Medium: Medium: Not error as fault is tradeoff logic delay tradeoff with practical for detection manifest with resource all types of coverage functionality - 24 - L. Bauer, KIT, 2013

  25. � Built-in Self-Test: does not use external test equipment � In FPGAs: test configurations containing ◦ Test pattern generator ◦ Output response analyzer ◦ Between them: Device (i.e. logic and interconnect) under test (DUT) � Can test for faults that are difficult to cover in online tests, e.g. clock network � Major drawback: system must enter dedicated test mode - 25 - L. Bauer, KIT, 2013

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend