I m pact of I nterm ittent Faults on Nanocom puting Devices - - PowerPoint PPT Presentation

i m pact of i nterm ittent faults on nanocom puting
SMART_READER_LITE
LIVE PREVIEW

I m pact of I nterm ittent Faults on Nanocom puting Devices - - PowerPoint PPT Presentation

I m pact of I nterm ittent Faults on Nanocom puting Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks Outline Fault classes Permanent faults Transient faults Intermittent faults Field


slide-1
SLIDE 1

I m pact of I nterm ittent Faults on Nanocom puting Devices

Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks

slide-2
SLIDE 2

Impact of Intermittent Faults on Nanocomputing Devices

2

June 28th, 2007

Outline

  • Fault classes

– Permanent faults – Transient faults – Intermittent faults

  • Field fault/ error data collection
  • Intermittent faults

– Impact of scaling

  • Mitigation techniques

– HW vs. SW solutions

  • Summary
  • Q&A
slide-3
SLIDE 3

Impact of Intermittent Faults on Nanocomputing Devices

3

June 28th, 2007

Fault Classes

  • Perm anent faults, e.g. stuck-at, bridges, opens

– Reflect irreversible physical changes – Occur at the same location, are always active

  • Transient faults, e.g. particle induced SEU, noise, ESD

– Induced by temporary environmental conditions – Occur at different locations, at random time instances

  • I nterm ittent faults, e.g. manufacturing residues, oxide

breakdown – Occur due to unstable, marginal hardware – Occur at the same location – May be activated and deactivated – Induce bursts of errors

slide-4
SLIDE 4

Impact of Intermittent Faults on Nanocomputing Devices

4

June 28th, 2007

Fault/ Error Data Collection

slide-5
SLIDE 5

Impact of Intermittent Faults on Nanocomputing Devices

5

June 28th, 2007

Fault/ Error Data Collection Study

  • Servers from two manufacturers were

instrumented to collect errors

– Manufacturer A: 193 servers, 16 months – Manufacturer B: 64 servers, 10 months

  • Examples of reported errors

– Memory – Front side bus

  • Failure analysis performed when possible

Source: C. Constantinescu, SELSE 2006

slide-6
SLIDE 6

Impact of Intermittent Faults on Nanocomputing Devices

6

June 28th, 2007

Server I nstrum entation

HAL – hardware abstraction layer MCH – machine check handler CI – component instrumentation Instrumentation validated by fault injection

C P U C H IP S E T C I D e vic e D rive r M C H C I S e rv ic e E ve n t L o g H A L

slide-7
SLIDE 7

Impact of Intermittent Faults on Nanocomputing Devices

7

June 28th, 2007

20 40 60 80 100 120 140 1 t

  • 5

6 t

  • 1

1 1 t

  • 5

5 1 t

  • 1

1 1 t

  • 1

> 1

NUMBER OF SINGLE-BIT ERRORS NUMBER OF SYSTEMS

Corrected Mem ory Errors

  • 310.7 server years
  • Servers experiencing intermittent faults: 16 out of 257, i.e.

6 .2 %

  • Corrected single-bit errors (SBE) induced by interm ittent

faults: 12990 out of 16069, i.e. 8 0 .8 %

slide-8
SLIDE 8

Impact of Intermittent Faults on Nanocomputing Devices

8

June 28th, 2007

Typical Signature of Mem ory I nterm ittent Faults

Daily number of corrected SBE

20 40 60 80 100 120 80 86 89 92 95 135 138 344 445 448 Time (days) SBE

Failure analysis: SBE induced intermittently by poly residue, within memory chips

Source: Hynix Semiconductor

slide-9
SLIDE 9

Impact of Intermittent Faults on Nanocomputing Devices

9

June 28th, 2007

Processor Front Side Bus Errors

  • Front side bus (FSB) errors

– Bursts of single-bit errors (SBE) on data path – SBE detected and corrected (data path protected by ECC)

  • Servers experiencing FSB intermittent faults: 2 out of 64 (3% )

– Burst duration examples: 7 1 0 4 errors in 3 sec; 3 2 6 4 errors in 1 8 sec

  • Failure analysis

– I nterm ittent contacts at solder joints

Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3 3264 15 108 121 97 101 7104 20

slide-10
SLIDE 10

Impact of Intermittent Faults on Nanocomputing Devices

10

June 28th, 2007

More on Intermittent Faults

slide-11
SLIDE 11

Impact of Intermittent Faults on Nanocomputing Devices

11

June 28th, 2007

Tim ing Violations

BLM delamination

  • Timing violations due to increased resistance; slow raise

and fall times – I nterm ittent behavior occurs before the fault becomes permanent - specific for 90nm node and beyond – Permanent failures for previous technology nodes

Source: C. Constantinescu, SELSE 2006

slide-12
SLIDE 12

Impact of Intermittent Faults on Nanocomputing Devices

12

June 28th, 2007

Crosstalk I nduced Errors

  • Pulse induced by the affecting line into a victim line
  • Timing violations due to crosstalk

– Signal speedup or delay

Signal speedup – two adjacent lines switch in the same direction Signal delay – two adjacent lines switch in opposite directions

  • Process, voltage and temperature (PVT) variations

amplify crosstalk induced skew

  • Crosstalk increases with interconnect scaling and higher

clock frequencies

slide-13
SLIDE 13

Impact of Intermittent Faults on Nanocomputing Devices

13

June 28th, 2007

Ultra-thin Oxide Faults

  • Ultrathin oxide reliability

– Rate of defect generation decreases with supply voltage – Tunnel current increases exponentially with decreasing gate oxide thickness

  • Soft breakdow n ( SBD)

– I nterm ittent fluctuating current, high leakage – SBD examples

Erratic erasure of flash memory cells Erratic fluctuations of Vmin in SRAM

0.5 0.6 0. 7 0.8 300 600 900

1200 1500

Time [s] Vmin [V] SRAM Vmin 90 nm technology

Source: M. Agostinelli et al, IEDM 2005

slide-14
SLIDE 14

Impact of Intermittent Faults on Nanocomputing Devices

14

June 28th, 2007

Scaling Trend of the Vm in Sensitivity

Vmin sensitivity to gate leakage

4 8 12 16 1.00E+05 1.00E+06 1.00E+07 Rg [Ohms] Vmin [a.u.] 45nm 65nm 90nm Incresed cell sensitivity

Source: M. Agostinelli et al, IEDM 2005

slide-15
SLIDE 15

Impact of Intermittent Faults on Nanocomputing Devices

15

June 28th, 2007

I m pact of Process Variations

  • Increasingly difficult to accurately control device

parameters

– Channel length and width – Oxide thickness – Doping profile

  • Intra-die variations, e.g., different transistor voltage

threshold within the same SRAM cell

– I nterm ittent failure of read/ write operations

  • Impact of process variations is increasing with scaling
slide-16
SLIDE 16

Impact of Intermittent Faults on Nanocomputing Devices

16

June 28th, 2007

Activation of I nterm ittent Faults

Voltage and frequency shmoo

– Voltage – Frequency – Temperature – Workload

1.70V | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1.45V | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * D* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | HVMWV* * ZYZ* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | LH* NDNPQRFST * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1.20V | ABCDEADFGHIJC * * * * * * * * * * * * * * * * * * * * * * * * * * * | 40ns 50ns 60ns 70ns 80ns

slide-17
SLIDE 17

Impact of Intermittent Faults on Nanocomputing Devices

17

June 28th, 2007

Mitigation Techniques

slide-18
SLIDE 18

Impact of Intermittent Faults on Nanocomputing Devices

18

June 28th, 2007

HW Solutions: I BM G5 / G6 CPU

  • Mirrored Instruction and Execution

units

  • Comparator and register unit
  • Compare outputs in n-1 instruction

pipeline stage

– No error: update checkpoint array (register content and instruction address into R-unit) in last pipeline stage and continue normal execution – Error detected: Reset CPU (except R-unit), purge cache and its directory, reload last correct state from checkpoint array, retry

Source: L. Spainhower, T. A. Greg, IBM JR&D,1999

  • Transient faults are recovered from
  • Error threshold can be used for intermittent faults
  • Permanent faults require activation of a spare CPU

under OS control

R

  • U

N IT COMPARATOR I & E

  • U

N ITS I & E

  • U

N ITS CACHE

slide-19
SLIDE 19

Impact of Intermittent Faults on Nanocomputing Devices

19

June 28th, 2007

HW Solutions: I BM G5 / G6 CPU

  • Pros

– Lower design complexity – Shorter development and validation time – No performance penalty (compare and detect cycles are

  • verlapped)
  • Cons

– Total circuit overhead about 40% – It may not scale well with frequency

slide-20
SLIDE 20

Impact of Intermittent Faults on Nanocomputing Devices

20

June 28th, 2007

SW Solutions: AR-SMT

  • Active-stream/ Redundant-stream Simultaneous

Multithreading (AR-SMT)

– Two copies of the same program run concurrently, using the SMT micro architecture – Results of the two threads are compared – A-STREAM errors are detected with a delay – R-STREAM errors are detected before commit – Recovery from transient faults (e.g. particle induced soft error) is possible

Use committed state of R-STREAM

A

  • S

T REAM R

  • S

T REAM R

  • S

T REAM A

  • S

T REAM COMMIT FERCH DELAY BUFFER Source: E. Rotenberg, FTCS, 1999

slide-21
SLIDE 21

Impact of Intermittent Faults on Nanocomputing Devices

21

June 28th, 2007

SW Solutions: AR-SMT

  • Pros

– AR-SMT relies on existing micro-architectural features, e.g. SMT – No HW overhead

  • Cons

– Increased execution time, 10% - 30% – Increased performance penalty or even failure in the case of bursts of high frequency errors

slide-22
SLIDE 22

Impact of Intermittent Faults on Nanocomputing Devices

22

June 28th, 2007

Com paring Fault/ Error Handling Techniques

  • HW implementations are fast (e.g. ECC) - can handle

bursts of errors induced by intermittent faults

  • SW detection and recovery is slower

– Performance penalty in the case of large bursts of errors – Near coincident fault scenario, in the case of high rate bursts of errors = > SW fault/ error handling may fail before recovery is completed

  • SW solutions are better suited for failure prediction and

resource reconfiguration

slide-23
SLIDE 23

Impact of Intermittent Faults on Nanocomputing Devices

23

June 28th, 2007

Sum m ary

  • Semiconductor technology is a two edge sword

– Lower dimensions and voltages and higher frequencies have led to tremendous performance gains – Intermittent and transient faults have become a serious challenge to developers and manufacturers

  • Designing for particle induced soft errors is too narrowly

focused

  • Software only techniques cannot effectively handle

bursts of errors occurring at a high rate

FAULT TOLERANT CHI PS ARE THE FUTURE

slide-24
SLIDE 24

Impact of Intermittent Faults on Nanocomputing Devices

24

June 28th, 2007

Q & A Performance

Dependability