Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom - PDF document

Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2013 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jörg Henkel - 1 - Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Reconfigurable and Adaptive Systems (RAS) 8. Fault Tolerance and Reliability in FPGA based Systems - 2 -

RAS Topic Overview 1. Introduction 2. Overview 3. Special Instructions 4. Fine-Grained Reconfigurable Processors • Introduction 5. Configuration Prefetching • Fault Detection and Mitigation 6. Coarse-Grained Techniques Reconfigurable Processors • Applications of 7. Adaptive Reliability Techniques Reconfigurable Processors • LHC 8. Fault-tolerance • Space by Reconfiguration • OTERA - 3 - L. Bauer, CES, KIT, 2013 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.1 Introduction - 4 -

Why Fault Tolerance? � CMOS Scaling increases occurrence of ◦ Manufacturing defects ◦ Post-deployment degradation ◦ Especially important for FPGAs as they have a high amount of Gordon E. Moore transistors and interconnect wires (co-founded Intel in 1968) � Environmental conditions can incur temporary faults ITRS ◦ E.g. Aerospace industry – use hardened # of dopant atoms in T-channel devices for mission critical tasks, FPGAs for non-critical data processing � Unlike ASICs, FPGAs can adapt # dopant atoms to deal with permanent and temporary faults - 5 - L. Bauer, CES, KIT, 2013 Types of Faults � Permanent Faults: e.g. stuck-at failures in CLBs and opens, bridges, shorts in the programmable switching matrix ◦ could occur during the fabrication process without being detected ◦ Damage of device resources may also appear in the life cycle of FPGAs � Intermittent Faults: have a permanent cause in the structure of the circuit but their effect is intermittent, e.g. depending on temperature or power consumption � Transient Faults: have a temporary cause that can alter signal values or state stored in memory cells, which creates indefinite and incorrect states in the computation ◦ e.g. by a high energy particle strike resulting in an energy exchange and charge displacement - 6 - L. Bauer, CES, KIT, 2013

Negative Bias Temperature Instability (NBTI) V g � Breakdown of Si-H bonds at gate the silicon-oxide interface S D oxide due to voltage/thermal stress p p n � causes interface traps P-type MOSFET � Affects mostly P-MOSFETs because of negative gate bias ◦ Effect in N-MOSFETS is H + negligible trap O H O H � Despite research focus: Si Si Si Si Si NBTI is observed, but not yet fully understood V g < 0 � STRESS! - 7 - L. Bauer, CES, KIT, 2013 Negative Bias Temperature Instability (NBTI) (cont‘d) � NBTI manifests itself as a shift in V th V th shift [V] ◦ Causes increase in transistor delay ◦ NBTI leads to delay faults and resulting circuit failure Stress Recovery � Recovery effect in periods of no stress ◦ When voltage and temperature are low, V th can shift back towards 0 V g [V] its original value ◦ Full recovery from a stress period -1 only possible in infinite time � In practice, overall V th shift increases over longer periods, e.g. months or years Time - 8 - L. Bauer, CES, KIT, 2013

NBTI and Temperature � Temperature plays important aspect in NBTI modeling � Higher temperatures increase shift in threshold voltage � ΔVth approximately 50% higher at 75°C than 55°C � NBTI effect at 75°C is approximately equal to alternating between 85°C and 25°C - 9 - L. Bauer, CES, KIT, 2013 NBTI Impact on Lifetime of SRAM The NBTI effect is minimum here 40% (SNM) degradation after 7 because the NBTI stress will equally Signal to Noise Margin 35% be distributed between the two PMOS transistors existing in the SRAM 30% years in 32nm 25% 20% 15% 10% 5% 0% Percentage of time that the cell stores zero [%] src: S. Kothawade, K. Chakraborty, S. Roy, "Analysis and mitigation of NBTI aging in register file: An end-to-end approach" - 10 - L. Bauer, CES, KIT, 2013

Types of Degradation (cont’d) � Hot-Carrier Injection (HCI): build up of trapped charges in the gate-channel interface region ◦ progressive reduction of carrier mobility � increase in CMOS threshold voltage ◦ Switching speed slower, leads to timing problems - 11 - L. Bauer, CES, KIT, 2013 Types of Degradation (cont‘d) � Time-Dependent Dielectric Breakdown (TDDB): over time conducting path forms in thin oxide layers [CCMA10] G S D - 12 - L. Bauer, CES, KIT, 2013

Example: Effect of TDDB on SRAM � Example: Read noise margin p -source fresh V dd breakdown � Worst case: � half-selected � state (wordline+bitlines high) V R 0 V dd n -source drain breakdown breakdown WL p-source pass- V R gate "1" "0" drain V V L R 0 n-source 0 V dd 0 V dd BL BR V L V L src: Stathis, IRPS (2008) - 13 - L. Bauer, CES, KIT, 2013 Main Reason for many of these effects: High-Fields � Most of device problems can be tracked down to high-field effects – related to the failure to follow Dennard Scaling src: Radhakrishnan et al ., IEDM (2001) - 14 - L. Bauer, CES, KIT, 2013

Dennard Scaling vs. Power Density � Transistor and power scaling are no longer balanced ◦ Scaling is limited by power � Higher power density leads to thermal problems ◦ Accelerates aging effects Classical scaling (Dennard) Power Limited Scaling Device count S 2 Device count S 2 Device frequency S Device frequency S Device power (cap) 1/S Device power (cap) 1/S Device power (V dd ) 1/S 2 Device power (V dd ) ~1 S 2 Power Density 1 Power Density S: Scaling Factor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 15 - L. Bauer, CES, KIT, 2013 Types of Degradation (cont‘d) � Electromigration: thermally activated metal ions may leave their potential wells ◦ electric field and momentum exchange through electrons direct metal ion migration ◦ can lead to open/short circuits [wikipedia] - 16 - L. Bauer, CES, KIT, 2013

Radiation induced faults � Radiation induced faults ◦ Single Event Upsets/Single Event Transients ◦ Most common: single bit flip in SRAM cell ◦ SEU effect on ASIC � Transient (only variation is time duration of fault) � Even if latched, will be eventually overwritten High-Energy Particle (Neutron or Proton) ◦ SEU effect on FPGAs p+ Isolation � Permanent (until reset/ Gate reconfiguration) if n+ n+ N-Well configuration memory + + + - + - + - + - - + - - + - + + - Depletion hit by SEU P-Well - Region P-Substrate Sources: Intel, S. Borker@DAC’03, Patrick-Emil Zörner, W.D. Nix, 1992, L.Finkelstein, Intel 2005, R. Baumann, - 17 - L. Bauer, CES, KIT, 2013 TI@Design&Test’05, Ziegler, IBM@IBM JRD’96 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.2 Fault Detection and Mitigation Techniques - 18 -

Modular Redundancy � Masks errors, but does not correct underlying fault ◦ Problem: error accumulation � External ◦ Multiple FPGAs working in lockstep, i.e. per- forming the same operation in each cycle ◦ Output sent to radiation hardened voter � Internal ◦ Replicate functional block in FPGA � Popular configuration: Triple Modular Redundancy (TMR) - 19 - L. Bauer, CES, KIT, 2013 Fault detection methods comparison src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead O Overhead Modular Fast: as Very large: Very small: Coarse: Good: All Redundancy soon as fault triplicate + Voter delay protect manifest is manifest voter module errors sized detected blocks - 20 - L. Bauer, CES, KIT, 2013

Concurrent Error Detection � More space efficient design than modular redundancy � Error coding algorithms (e.g. parity) at data flows/stores � Time redundancy can be used for concurrent error detection ◦ Repeat computation in a way that allows errors to be detected ◦ First computation at t0: compute result in combinational logic, store result ◦ Second computation at t0+d: encode operands, compute in combinational logic, decode result, compare to first result - 21 - L. Bauer, CES, KIT, 2013 Concurrent Error Detection (cont’d) src: [LCR03] - 22 - L. Bauer, CES, KIT, 2013

Concurrent Error Detection (cont’d) � Different techniques for encode/decode, e.g. bit inversion to detect stuck-at faults � Recomputation with shifted operands (RESO) for faulty arithmetic slices ◦ Encode: left shift operands ◦ Decode: right shift result � Combine with double modular redundancy (DMR) ◦ RESO determines which module is faulty, DMR uses result of other module ◦ Less area required than TMR ◦ Slightly slower (time-shifted re-computation) - 23 - L. Bauer, CES, KIT, 2013 Fault detection methods comparison src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead Overhead O Modular Fast: as soon Very large: Very small: Coarse: Good: All Redundancy as fault is triplicate + Voter delay protect manifest manifest voter module errors sized blocks detected Concurrent Fast: as soon Medium: Small: CRC Medium: Medium: Not error as fault is tradeoff logic delay tradeoff with practical for detection manifest with resource all types of coverage functionality - 24 - L. Bauer, CES, KIT, 2013

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom - PDF document

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2013 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 4.

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 8. Fault Tolerance and

Reconfigurable and Adaptive Systems (RAS) 7. Adaptive Reconfigurable Processors Lars Bauer, Jrg

INTRODUCTION TO RAS AL KHAIMAH ECONOMIC ZONE INTRODUCTION TO RAKEZ Online Video RAS AL KHAIMAH

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) New Directions in FPGA Design A. Grudnitsky, L. Bauer,

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Organisation

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Circuit Validity Checker D. Mitch Bailey Shuhari System, Japan WOSET 2020 CVC: Circuit Validity

Single Event Transient Response of InGaAs MOSFET Kai Ni 1 , En Xia Zhang 1 , Nicholas C. Hooten 1

NGI Sweden Next Generation Sequencing at the National Genomics Infrastructure Phil Ewels

Predicting the Transient Signals from Galactic Centers: Circumbinary Disks and Tidal Disruptions

Deposition of Ultra thin Amorphous Carbon Films by Filtered Cathodic Vacuum Arc for the Head

Neutron Backgrounds Shaun Alsum 1 Strategy Simulate neutron energy depositions Cluster

Datasets Real data: data11_7TeV..physics_ JetTauEtmiss.merge.NTUP_SMQCD._p621/ Integral

3D Modeling with Depth Sensors Torsten Sattler Spring 2018

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom - PDF document

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2013 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 4.

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 8. Fault Tolerance and

Reconfigurable and Adaptive Systems (RAS) 7. Adaptive Reconfigurable Processors Lars Bauer, Jrg

INTRODUCTION TO RAS AL KHAIMAH ECONOMIC ZONE INTRODUCTION TO RAKEZ Online Video RAS AL KHAIMAH

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) New Directions in FPGA Design A. Grudnitsky, L. Bauer,

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Organisation

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Circuit Validity Checker D. Mitch Bailey Shuhari System, Japan WOSET 2020 CVC: Circuit Validity

Single Event Transient Response of InGaAs MOSFET Kai Ni 1 , En Xia Zhang 1 , Nicholas C. Hooten 1

NGI Sweden Next Generation Sequencing at the National Genomics Infrastructure Phil Ewels

Predicting the Transient Signals from Galactic Centers: Circumbinary Disks and Tidal Disruptions

Deposition of Ultra thin Amorphous Carbon Films by Filtered Cathodic Vacuum Arc for the Head

Neutron Backgrounds Shaun Alsum 1 Strategy Simulate neutron energy depositions Cluster

Datasets Real data: data11_7TeV.*.physics_ JetTauEtmiss.merge.NTUP_SMQCD.*_p621/ Integral

3D Modeling with Depth Sensors Torsten Sattler Spring 2018

Datasets Real data: data11_7TeV..physics_ JetTauEtmiss.merge.NTUP_SMQCD._p621/ Integral