i m pact of i nterm ittent faults on nanocom puting
play

I m pact of I nterm ittent Faults on Nanocom puting Devices - PowerPoint PPT Presentation

I m pact of I nterm ittent Faults on Nanocom puting Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks Outline Fault classes Permanent faults Transient faults Intermittent faults Field


  1. I m pact of I nterm ittent Faults on Nanocom puting Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks

  2. Outline • Fault classes – Permanent faults – Transient faults – Intermittent faults • Field fault/ error data collection • Intermittent faults – Impact of scaling • Mitigation techniques – HW vs. SW solutions • Summary • Q&A 2 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  3. Fault Classes • Perm anent faults , e.g. stuck-at, bridges, opens – Reflect irreversible physical changes – Occur at the same location, are always active • Transient faults , e.g. particle induced SEU, noise, ESD – Induced by temporary environmental conditions – Occur at different locations, at random time instances • I nterm ittent faults , e.g. manufacturing residues, oxide breakdown – Occur due to unstable, marginal hardware – Occur at the same location – May be activated and deactivated – Induce bursts of errors 3 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  4. Fault/ Error Data Collection 4 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  5. Fault/ Error Data Collection Study • Servers from two manufacturers were instrumented to collect errors – Manufacturer A: 193 servers, 16 months – Manufacturer B: 64 servers, 10 months • Examples of reported errors – Memory – Front side bus • Failure analysis performed when possible Source: C. Constantinescu, SELSE 2006 5 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  6. Server I nstrum entation HAL – hardware E ve n t L o g abstraction layer C I S e rv ic e MCH – machine check handler C I D e vic e M C H D rive r CI – component instrumentation H A L Instrumentation C H IP S E T validated by fault C P U injection 6 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  7. Corrected Mem ory Errors NUMBER OF SYSTEMS 140 120 100 80 60 40 20 0 0 0 0 5 0 0 0 5 0 0 0 1 1 0 0 o o o 1 1 t t t o > 1 t 1 6 o 1 1 t 5 1 0 1 NUMBER OF SINGLE-BIT ERRORS • 310.7 server years • Servers experiencing intermittent faults: 16 out of 257, i.e. 6 .2 % • Corrected single-bit errors (SBE) induced by interm ittent faults : 12990 out of 16069, i.e. 8 0 .8 % 7 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  8. Typical Signature of Mem ory I nterm ittent Faults Failure analysis: SBE induced intermittently by poly residue, Daily number of corrected SBE within memory chips 120 100 80 SBE 60 40 20 0 80 86 89 92 95 135 138 344 445 448 Source: Hynix Semiconductor Time (days) 8 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  9. Processor Front Side Bus Errors • Front side bus (FSB) errors – Bursts of single-bit errors (SBE) on data path – SBE detected and corrected (data path protected by ECC) Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3 3264 15 0 0 108 121 97 101 7104 20 0 0 - - - - • Servers experiencing FSB intermittent faults: 2 out of 64 (3% ) – Burst duration examples: 7 1 0 4 errors in 3 sec; 3 2 6 4 errors in 1 8 sec • Failure analysis – I nterm ittent contacts at solder joints 9 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  10. More on Intermittent Faults 10 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  11. Tim ing Violations BLM delamination • Timing violations due to increased resistance; slow raise and fall times – I nterm ittent behavior occurs before the fault becomes permanent - specific for 90nm node and beyond – Permanent failures for previous technology nodes Source: C. Constantinescu, SELSE 2006 11 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  12. Crosstalk I nduced Errors • Pulse induced by the affecting line into a victim line • Timing violations due to crosstalk – Signal speedup or delay � Signal speedup – two adjacent lines switch in the same direction � Signal delay – two adjacent lines switch in opposite directions • Process, voltage and temperature (PVT) variations amplify crosstalk induced skew • Crosstalk increases with interconnect scaling and higher clock frequencies 12 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  13. Ultra-thin Oxide Faults • Ultrathin oxide reliability – Rate of defect generation decreases with supply voltage – Tunnel current increases exponentially with decreasing gate oxide thickness • Soft breakdow n ( SBD) – I nterm ittent fluctuating current, high leakage – SBD examples � Erratic erasure of flash memory cells � Erratic fluctuations of Vmin in SRAM 0.8 Vmin [V] 0. SRAM Vmin 7 90 nm technology 0.6 0.5 Source: M. Agostinelli et al, 0 300 600 900 1200 1500 IEDM 2005 Time [s] 13 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  14. Scaling Trend of the Vm in Sensitivity Vmin sensitivity to gate leakage 16 Incresed cell 45nm sensitivity 12 65nm Vmin [a.u.] 90nm 8 4 0 1.00E+07 1.00E+06 1.00E+05 Rg [Ohms] Source: M. Agostinelli et al, IEDM 2005 14 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  15. I m pact of Process Variations • Increasingly difficult to accurately control device parameters – Channel length and width – Oxide thickness – Doping profile • Intra-die variations, e.g., different transistor voltage threshold within the same SRAM cell – I nterm ittent failure of read/ write operations • Impact of process variations is increasing with scaling 15 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  16. Activation of I nterm ittent Faults 1.70V | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1.45V | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * D* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | HVMWV* * ZYZ* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | LH* NDNPQRFST * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1.20V | ABCDEADFGHIJC * * * * * * * * * * * * * * * * * * * * * * * * * * * | 40ns 50ns 60ns 70ns 80ns Voltage and frequency shmoo – Voltage – Frequency – Temperature – Workload 16 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  17. Mitigation Techniques 17 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  18. HW Solutions: I BM G5 / G6 CPU • Mirrored Instruction and Execution units • Comparator and register unit R - U N IT • Compare outputs in n-1 instruction ITS ITS pipeline stage N N U U COMPARATOR – No error: update checkpoint array (register I & E I & E - - content and instruction address into R-unit) in last pipeline stage and continue normal execution – Error detected: Reset CPU (except R-unit), purge cache and its directory, reload last correct state from checkpoint array, retry CACHE • Transient faults are recovered from • Error threshold can be used for intermittent faults • Permanent faults require activation of a spare CPU under OS control Source: L. Spainhower, T. A. Greg, IBM JR&D,1999 18 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  19. HW Solutions: I BM G5 / G6 CPU • Pros – Lower design complexity – Shorter development and validation time – No performance penalty (compare and detect cycles are overlapped) • Cons – Total circuit overhead about 40% – It may not scale well with frequency 19 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  20. SW Solutions: AR-SMT • Active-stream/ Redundant-stream Simultaneous Multithreading (AR-SMT) – Two copies of the same program run concurrently, using the SMT micro architecture – Results of the two threads are compared – A-STREAM errors are detected with a delay – R-STREAM errors are detected before commit – Recovery from transient faults (e.g. particle induced soft error) is possible � Use committed state of R-STREAM - A S T REAM - R S T REAM FERCH COMMIT R - S T REAM A - S T REAM DELAY BUFFER Source: E. Rotenberg, FTCS, 1999 20 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

  21. SW Solutions: AR-SMT • Pros – AR-SMT relies on existing micro-architectural features, e.g. SMT – No HW overhead • Cons – Increased execution time, 10% - 30% – Increased performance penalty or even failure in the case of bursts of high frequency errors 21 June 28 th , 2007 Impact of Intermittent Faults on Nanocomputing Devices

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend