reconfigurable and adaptive systems ras
play

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom - PDF document

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2013 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische


  1. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2013 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jörg Henkel - 1 - Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Reconfigurable and Adaptive Systems (RAS) 8. Fault Tolerance and Reliability in FPGA based Systems - 2 -

  2. RAS Topic Overview 1. Introduction 2. Overview 3. Special Instructions 4. Fine-Grained Reconfigurable Processors • Introduction 5. Configuration Prefetching • Fault Detection and Mitigation 6. Coarse-Grained Techniques Reconfigurable Processors • Applications of 7. Adaptive Reliability Techniques Reconfigurable Processors • LHC 8. Fault-tolerance • Space by Reconfiguration • OTERA - 3 - L. Bauer, CES, KIT, 2013 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.1 Introduction - 4 -

  3. Why Fault Tolerance? � CMOS Scaling increases occurrence of ◦ Manufacturing defects ◦ Post-deployment degradation ◦ Especially important for FPGAs as they have a high amount of Gordon E. Moore transistors and interconnect wires (co-founded Intel in 1968) � Environmental conditions can incur temporary faults ITRS ◦ E.g. Aerospace industry – use hardened # of dopant atoms in T-channel devices for mission critical tasks, FPGAs for non-critical data processing � Unlike ASICs, FPGAs can adapt # dopant atoms to deal with permanent and temporary faults - 5 - L. Bauer, CES, KIT, 2013 Types of Faults � Permanent Faults: e.g. stuck-at failures in CLBs and opens, bridges, shorts in the programmable switching matrix ◦ could occur during the fabrication process without being detected ◦ Damage of device resources may also appear in the life cycle of FPGAs � Intermittent Faults: have a permanent cause in the structure of the circuit but their effect is intermittent, e.g. depending on temperature or power consumption � Transient Faults: have a temporary cause that can alter signal values or state stored in memory cells, which creates indefinite and incorrect states in the computation ◦ e.g. by a high energy particle strike resulting in an energy exchange and charge displacement - 6 - L. Bauer, CES, KIT, 2013

  4. Negative Bias Temperature Instability (NBTI) V g � Breakdown of Si-H bonds at gate the silicon-oxide interface S D oxide due to voltage/thermal stress p p n � causes interface traps P-type MOSFET � Affects mostly P-MOSFETs because of negative gate bias ◦ Effect in N-MOSFETS is H + negligible trap O H O H � Despite research focus: Si Si Si Si Si NBTI is observed, but not yet fully understood V g < 0 � STRESS! - 7 - L. Bauer, CES, KIT, 2013 Negative Bias Temperature Instability (NBTI) (cont‘d) � NBTI manifests itself as a shift in V th V th shift [V] ◦ Causes increase in transistor delay ◦ NBTI leads to delay faults and resulting circuit failure Stress Recovery � Recovery effect in periods of no stress ◦ When voltage and temperature are low, V th can shift back towards 0 V g [V] its original value ◦ Full recovery from a stress period -1 only possible in infinite time � In practice, overall V th shift increases over longer periods, e.g. months or years Time - 8 - L. Bauer, CES, KIT, 2013

  5. NBTI and Temperature � Temperature plays important aspect in NBTI modeling � Higher temperatures increase shift in threshold voltage � ΔVth approximately 50% higher at 75°C than 55°C � NBTI effect at 75°C is approximately equal to alternating between 85°C and 25°C - 9 - L. Bauer, CES, KIT, 2013 NBTI Impact on Lifetime of SRAM The NBTI effect is minimum here 40% (SNM) degradation after 7 because the NBTI stress will equally Signal to Noise Margin 35% be distributed between the two PMOS transistors existing in the SRAM 30% years in 32nm 25% 20% 15% 10% 5% 0% Percentage of time that the cell stores zero [%] src: S. Kothawade, K. Chakraborty, S. Roy, "Analysis and mitigation of NBTI aging in register file: An end-to-end approach" - 10 - L. Bauer, CES, KIT, 2013

  6. Types of Degradation (cont’d) � Hot-Carrier Injection (HCI): build up of trapped charges in the gate-channel interface region ◦ progressive reduction of carrier mobility � increase in CMOS threshold voltage ◦ Switching speed slower, leads to timing problems - 11 - L. Bauer, CES, KIT, 2013 Types of Degradation (cont‘d) � Time-Dependent Dielectric Breakdown (TDDB): over time conducting path forms in thin oxide layers [CCMA10] G S D - 12 - L. Bauer, CES, KIT, 2013

  7. Example: Effect of TDDB on SRAM � Example: Read noise margin p -source fresh V dd breakdown � Worst case: � half-selected � state (wordline+bitlines high) V R 0 V dd n -source drain breakdown breakdown WL p-source pass- V R gate "1" "0" drain V V L R 0 n-source 0 V dd 0 V dd BL BR V L V L src: Stathis, IRPS (2008) - 13 - L. Bauer, CES, KIT, 2013 Main Reason for many of these effects: High-Fields � Most of device problems can be tracked down to high-field effects – related to the failure to follow Dennard Scaling src: Radhakrishnan et al ., IEDM (2001) - 14 - L. Bauer, CES, KIT, 2013

  8. Dennard Scaling vs. Power Density � Transistor and power scaling are no longer balanced ◦ Scaling is limited by power � Higher power density leads to thermal problems ◦ Accelerates aging effects Classical scaling (Dennard) Power Limited Scaling Device count S 2 Device count S 2 Device frequency S Device frequency S Device power (cap) 1/S Device power (cap) 1/S Device power (V dd ) 1/S 2 Device power (V dd ) ~1 S 2 Power Density 1 Power Density S: Scaling Factor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 15 - L. Bauer, CES, KIT, 2013 Types of Degradation (cont‘d) � Electromigration: thermally activated metal ions may leave their potential wells ◦ electric field and momentum exchange through electrons direct metal ion migration ◦ can lead to open/short circuits [wikipedia] - 16 - L. Bauer, CES, KIT, 2013

  9. Radiation induced faults � Radiation induced faults ◦ Single Event Upsets/Single Event Transients ◦ Most common: single bit flip in SRAM cell ◦ SEU effect on ASIC � Transient (only variation is time duration of fault) � Even if latched, will be eventually overwritten High-Energy Particle (Neutron or Proton) ◦ SEU effect on FPGAs p+ Isolation � Permanent (until reset/ Gate reconfiguration) if n+ n+ N-Well configuration memory + + + - + - + - + - - + - - + - + + - Depletion hit by SEU P-Well - Region P-Substrate Sources: Intel, S. Borker@DAC’03, Patrick-Emil Zörner, W.D. Nix, 1992, L.Finkelstein, Intel 2005, R. Baumann, - 17 - L. Bauer, CES, KIT, 2013 TI@Design&Test’05, Ziegler, IBM@IBM JRD’96 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.2 Fault Detection and Mitigation Techniques - 18 -

  10. Modular Redundancy � Masks errors, but does not correct underlying fault ◦ Problem: error accumulation � External ◦ Multiple FPGAs working in lockstep, i.e. per- forming the same operation in each cycle ◦ Output sent to radiation hardened voter � Internal ◦ Replicate functional block in FPGA � Popular configuration: Triple Modular Redundancy (TMR) - 19 - L. Bauer, CES, KIT, 2013 Fault detection methods comparison src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead O Overhead Modular Fast: as Very large: Very small: Coarse: Good: All Redundancy soon as fault triplicate + Voter delay protect manifest is manifest voter module errors sized detected blocks - 20 - L. Bauer, CES, KIT, 2013

  11. Concurrent Error Detection � More space efficient design than modular redundancy � Error coding algorithms (e.g. parity) at data flows/stores � Time redundancy can be used for concurrent error detection ◦ Repeat computation in a way that allows errors to be detected ◦ First computation at t0: compute result in combinational logic, store result ◦ Second computation at t0+d: encode operands, compute in combinational logic, decode result, compare to first result - 21 - L. Bauer, CES, KIT, 2013 Concurrent Error Detection (cont’d) src: [LCR03] - 22 - L. Bauer, CES, KIT, 2013

  12. Concurrent Error Detection (cont’d) � Different techniques for encode/decode, e.g. bit inversion to detect stuck-at faults � Recomputation with shifted operands (RESO) for faulty arithmetic slices ◦ Encode: left shift operands ◦ Decode: right shift result � Combine with double modular redundancy (DMR) ◦ RESO determines which module is faulty, DMR uses result of other module ◦ Less area required than TMR ◦ Slightly slower (time-shifted re-computation) - 23 - L. Bauer, CES, KIT, 2013 Fault detection methods comparison src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead Overhead O Modular Fast: as soon Very large: Very small: Coarse: Good: All Redundancy as fault is triplicate + Voter delay protect manifest manifest voter module errors sized blocks detected Concurrent Fast: as soon Medium: Small: CRC Medium: Medium: Not error as fault is tradeoff logic delay tradeoff with practical for detection manifest with resource all types of coverage functionality - 24 - L. Bauer, CES, KIT, 2013

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend