I m pact of I nterm ittent Faults on Nanocom puting Devices - PowerPoint PPT Presentation

I m pact of I nterm ittent Faults on Nanocom puting Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks

Outline • Fault classes – Permanent faults – Transient faults – Intermittent faults • Field fault/ error data collection • Intermittent faults – Impact of scaling • Mitigation techniques – HW vs. SW solutions • Summary • Q&A 2 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Fault Classes • Perm anent faults , e.g. stuck-at, bridges, opens – Reflect irreversible physical changes – Occur at the same location, are always active • Transient faults , e.g. particle induced SEU, noise, ESD – Induced by temporary environmental conditions – Occur at different locations, at random time instances • I nterm ittent faults , e.g. manufacturing residues, oxide breakdown – Occur due to unstable, marginal hardware – Occur at the same location – May be activated and deactivated – Induce bursts of errors 3 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Fault/ Error Data Collection 4 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Fault/ Error Data Collection Study • Servers from two manufacturers were instrumented to collect errors – Manufacturer A: 193 servers, 16 months – Manufacturer B: 64 servers, 10 months • Examples of reported errors – Memory – Front side bus • Failure analysis performed when possible Source: C. Constantinescu, SELSE 2006 5 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Server I nstrum entation HAL – hardware E ve n t L o g abstraction layer C I S e rv ic e MCH – machine check handler C I D e vic e M C H D rive r CI – component instrumentation H A L Instrumentation C H IP S E T validated by fault C P U injection 6 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Corrected Mem ory Errors NUMBER OF SYSTEMS 140 120 100 80 60 40 20 0 0 0 0 5 0 0 0 5 0 0 0 1 1 0 0 o o o 1 1 t t t o > 1 t 1 6 o 1 1 t 5 1 0 1 NUMBER OF SINGLE-BIT ERRORS • 310.7 server years • Servers experiencing intermittent faults: 16 out of 257, i.e. 6 .2 % • Corrected single-bit errors (SBE) induced by interm ittent faults : 12990 out of 16069, i.e. 8 0 .8 % 7 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Typical Signature of Mem ory I nterm ittent Faults Failure analysis: SBE induced intermittently by poly residue, Daily number of corrected SBE within memory chips 120 100 80 SBE 60 40 20 0 80 86 89 92 95 135 138 344 445 448 Source: Hynix Semiconductor Time (days) 8 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Processor Front Side Bus Errors • Front side bus (FSB) errors – Bursts of single-bit errors (SBE) on data path – SBE detected and corrected (data path protected by ECC) Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3 3264 15 0 0 108 121 97 101 7104 20 0 0 - - - - • Servers experiencing FSB intermittent faults: 2 out of 64 (3% ) – Burst duration examples: 7 1 0 4 errors in 3 sec; 3 2 6 4 errors in 1 8 sec • Failure analysis – I nterm ittent contacts at solder joints 9 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

More on Intermittent Faults 10 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Tim ing Violations BLM delamination • Timing violations due to increased resistance; slow raise and fall times – I nterm ittent behavior occurs before the fault becomes permanent - specific for 90nm node and beyond – Permanent failures for previous technology nodes Source: C. Constantinescu, SELSE 2006 11 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Crosstalk I nduced Errors • Pulse induced by the affecting line into a victim line • Timing violations due to crosstalk – Signal speedup or delay � Signal speedup – two adjacent lines switch in the same direction � Signal delay – two adjacent lines switch in opposite directions • Process, voltage and temperature (PVT) variations amplify crosstalk induced skew • Crosstalk increases with interconnect scaling and higher clock frequencies 12 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Ultra-thin Oxide Faults • Ultrathin oxide reliability – Rate of defect generation decreases with supply voltage – Tunnel current increases exponentially with decreasing gate oxide thickness • Soft breakdow n ( SBD) – I nterm ittent fluctuating current, high leakage – SBD examples � Erratic erasure of flash memory cells � Erratic fluctuations of Vmin in SRAM 0.8 Vmin [V] 0. SRAM Vmin 7 90 nm technology 0.6 0.5 Source: M. Agostinelli et al, 0 300 600 900 1200 1500 IEDM 2005 Time [s] 13 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Scaling Trend of the Vm in Sensitivity Vmin sensitivity to gate leakage 16 Incresed cell 45nm sensitivity 12 65nm Vmin [a.u.] 90nm 8 4 0 1.00E+07 1.00E+06 1.00E+05 Rg [Ohms] Source: M. Agostinelli et al, IEDM 2005 14 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

I m pact of Process Variations • Increasingly difficult to accurately control device parameters – Channel length and width – Oxide thickness – Doping profile • Intra-die variations, e.g., different transistor voltage threshold within the same SRAM cell – I nterm ittent failure of read/ write operations • Impact of process variations is increasing with scaling 15 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Activation of I nterm ittent Faults 1.70V | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1.45V | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * D* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | HVMWV* * ZYZ* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | LH* NDNPQRFST * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1.20V | ABCDEADFGHIJC * * * * * * * * * * * * * * * * * * * * * * * * * * * | 40ns 50ns 60ns 70ns 80ns Voltage and frequency shmoo – Voltage – Frequency – Temperature – Workload 16 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Mitigation Techniques 17 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

HW Solutions: I BM G5 / G6 CPU • Mirrored Instruction and Execution units • Comparator and register unit R - U N IT • Compare outputs in n-1 instruction ITS ITS pipeline stage N N U U COMPARATOR – No error: update checkpoint array (register I & E I & E - - content and instruction address into R-unit) in last pipeline stage and continue normal execution – Error detected: Reset CPU (except R-unit), purge cache and its directory, reload last correct state from checkpoint array, retry CACHE • Transient faults are recovered from • Error threshold can be used for intermittent faults • Permanent faults require activation of a spare CPU under OS control Source: L. Spainhower, T. A. Greg, IBM JR&D,1999 18 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

HW Solutions: I BM G5 / G6 CPU • Pros – Lower design complexity – Shorter development and validation time – No performance penalty (compare and detect cycles are overlapped) • Cons – Total circuit overhead about 40% – It may not scale well with frequency 19 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

SW Solutions: AR-SMT • Active-stream/ Redundant-stream Simultaneous Multithreading (AR-SMT) – Two copies of the same program run concurrently, using the SMT micro architecture – Results of the two threads are compared – A-STREAM errors are detected with a delay – R-STREAM errors are detected before commit – Recovery from transient faults (e.g. particle induced soft error) is possible � Use committed state of R-STREAM - A S T REAM - R S T REAM FERCH COMMIT R - S T REAM A - S T REAM DELAY BUFFER Source: E. Rotenberg, FTCS, 1999 20 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

SW Solutions: AR-SMT • Pros – AR-SMT relies on existing micro-architectural features, e.g. SMT – No HW overhead • Cons – Increased execution time, 10% - 30% – Increased performance penalty or even failure in the case of bursts of high frequency errors 21 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

I m pact of I nterm ittent Faults on Nanocom puting Devices - PowerPoint PPT Presentation

I m pact of I nterm ittent Faults on Nanocom puting Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks Outline Fault classes Permanent faults Transient faults Intermittent faults Field

VC-PACT Myra Medina, DPT VC-Pact Project Coordinator VC-Pact History Funded by Lucile Packard

Malaysia Malaysia-UK UK PACT: PACT: Mark Market et Engage Engagement ment Webinar Webinar

Facing Up to Faults Facing Up to Faults Facing Up to Faults (v.2.0.1) (v.2.0.1) (v.2.0.1)

Ubiquitous faults T-79.4001 Seminar on Theoretical Computer Science Tero Pietilinen 4.4.2007

PACT Project Community Partners Presentation Supporters: PACT Project Community Partners

One year after the pact of Amsterdam EUKN research One year after the pact of amsterdam EUKN

Introducing the IAEA/PACT: Partnership and the PACT Model Demonstration Site (PMDS) approach

PACT to the Future Telepsychiatry in PACT? Nancy Williams, MD The University of Iowa Carver

INTERACTING FAULTS By Tyler Lagasse Faults typically form as a network How do we best

Fault Diagnosis of Discrete-Event Systems Alejandro White, Doctoral Candidate Advisor: Dr.

Int nterm rmountain S n Soci ciety o of f Ameri rican F Fore resters rs Sn Snake R

Beyond the Obvious: Beyond the Obvious: National Econom ic I m pact National Econom ic I m pact

PACT Meeting #7 PACT Meeting #7 ENVIRONMENTAL JUSTICE ENVIRONMENTAL JUSTICE PRESENTATION

Regulatory I m pact Assessm ent Regulatory I m pact Assessm ent - Main Findings and Policy

Current crisis and the Global Jobs Pact Global Jobs Pact Jos Manuel Salazar-Xirinachs

Stakeholder Stakeholder Workshop Workshop 14:30 (GMT+8)/ 07:30 (BST), 10 September 2020 `

Theme: SC Traveler 2 Design Improvements 2012 Mystery Canoe failed during races

America Makes Directed Project Call Opportunity Advanced Tools for Rapid Qualification (ATRQ)

Evolutionary -Convergence for a Delamination Model Thomas Frenzel, Alexander Mielke Sept. 01,

Local versus energetic solutions in rate-independent brittle delamination Marita Thomas (jointly

Effect of Competing TCP Traffic on Interactive Real-Time Communication arvinen , Binoy

Integrated Services in the Internet Lecture for S-38.180 QoS in the Internet 26.9.2002 Mika

Dilution, degradation, and time delays in algebraic models Matthew Macauley Department of

Electroexcitation of nucleon resonances in a light-front relativistic quark model Inna G.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us