I m pact of I nterm ittent Faults on Nanocom puting Devices - - PowerPoint PPT Presentation
I m pact of I nterm ittent Faults on Nanocom puting Devices - - PowerPoint PPT Presentation
I m pact of I nterm ittent Faults on Nanocom puting Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks Outline Fault classes Permanent faults Transient faults Intermittent faults Field
Impact of Intermittent Faults on Nanocomputing Devices
2
June 28th, 2007
Outline
- Fault classes
– Permanent faults – Transient faults – Intermittent faults
- Field fault/ error data collection
- Intermittent faults
– Impact of scaling
- Mitigation techniques
– HW vs. SW solutions
- Summary
- Q&A
Impact of Intermittent Faults on Nanocomputing Devices
3
June 28th, 2007
Fault Classes
- Perm anent faults, e.g. stuck-at, bridges, opens
– Reflect irreversible physical changes – Occur at the same location, are always active
- Transient faults, e.g. particle induced SEU, noise, ESD
– Induced by temporary environmental conditions – Occur at different locations, at random time instances
- I nterm ittent faults, e.g. manufacturing residues, oxide
breakdown – Occur due to unstable, marginal hardware – Occur at the same location – May be activated and deactivated – Induce bursts of errors
Impact of Intermittent Faults on Nanocomputing Devices
4
June 28th, 2007
Fault/ Error Data Collection
Impact of Intermittent Faults on Nanocomputing Devices
5
June 28th, 2007
Fault/ Error Data Collection Study
- Servers from two manufacturers were
instrumented to collect errors
– Manufacturer A: 193 servers, 16 months – Manufacturer B: 64 servers, 10 months
- Examples of reported errors
– Memory – Front side bus
- Failure analysis performed when possible
Source: C. Constantinescu, SELSE 2006
Impact of Intermittent Faults on Nanocomputing Devices
6
June 28th, 2007
Server I nstrum entation
HAL – hardware abstraction layer MCH – machine check handler CI – component instrumentation Instrumentation validated by fault injection
C P U C H IP S E T C I D e vic e D rive r M C H C I S e rv ic e E ve n t L o g H A L
Impact of Intermittent Faults on Nanocomputing Devices
7
June 28th, 2007
20 40 60 80 100 120 140 1 t
- 5
6 t
- 1
1 1 t
- 5
5 1 t
- 1
1 1 t
- 1
> 1
NUMBER OF SINGLE-BIT ERRORS NUMBER OF SYSTEMS
Corrected Mem ory Errors
- 310.7 server years
- Servers experiencing intermittent faults: 16 out of 257, i.e.
6 .2 %
- Corrected single-bit errors (SBE) induced by interm ittent
faults: 12990 out of 16069, i.e. 8 0 .8 %
Impact of Intermittent Faults on Nanocomputing Devices
8
June 28th, 2007
Typical Signature of Mem ory I nterm ittent Faults
Daily number of corrected SBE
20 40 60 80 100 120 80 86 89 92 95 135 138 344 445 448 Time (days) SBE
Failure analysis: SBE induced intermittently by poly residue, within memory chips
Source: Hynix Semiconductor
Impact of Intermittent Faults on Nanocomputing Devices
9
June 28th, 2007
Processor Front Side Bus Errors
- Front side bus (FSB) errors
– Bursts of single-bit errors (SBE) on data path – SBE detected and corrected (data path protected by ECC)
- Servers experiencing FSB intermittent faults: 2 out of 64 (3% )
– Burst duration examples: 7 1 0 4 errors in 3 sec; 3 2 6 4 errors in 1 8 sec
- Failure analysis
– I nterm ittent contacts at solder joints
Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3 3264 15 108 121 97 101 7104 20
Impact of Intermittent Faults on Nanocomputing Devices
10
June 28th, 2007
More on Intermittent Faults
Impact of Intermittent Faults on Nanocomputing Devices
11
June 28th, 2007
Tim ing Violations
BLM delamination
- Timing violations due to increased resistance; slow raise
and fall times – I nterm ittent behavior occurs before the fault becomes permanent - specific for 90nm node and beyond – Permanent failures for previous technology nodes
Source: C. Constantinescu, SELSE 2006
Impact of Intermittent Faults on Nanocomputing Devices
12
June 28th, 2007
Crosstalk I nduced Errors
- Pulse induced by the affecting line into a victim line
- Timing violations due to crosstalk
– Signal speedup or delay
Signal speedup – two adjacent lines switch in the same direction Signal delay – two adjacent lines switch in opposite directions
- Process, voltage and temperature (PVT) variations
amplify crosstalk induced skew
- Crosstalk increases with interconnect scaling and higher
clock frequencies
Impact of Intermittent Faults on Nanocomputing Devices
13
June 28th, 2007
Ultra-thin Oxide Faults
- Ultrathin oxide reliability
– Rate of defect generation decreases with supply voltage – Tunnel current increases exponentially with decreasing gate oxide thickness
- Soft breakdow n ( SBD)
– I nterm ittent fluctuating current, high leakage – SBD examples
Erratic erasure of flash memory cells Erratic fluctuations of Vmin in SRAM
0.5 0.6 0. 7 0.8 300 600 900
1200 1500
Time [s] Vmin [V] SRAM Vmin 90 nm technology
Source: M. Agostinelli et al, IEDM 2005
Impact of Intermittent Faults on Nanocomputing Devices
14
June 28th, 2007
Scaling Trend of the Vm in Sensitivity
Vmin sensitivity to gate leakage
4 8 12 16 1.00E+05 1.00E+06 1.00E+07 Rg [Ohms] Vmin [a.u.] 45nm 65nm 90nm Incresed cell sensitivity
Source: M. Agostinelli et al, IEDM 2005
Impact of Intermittent Faults on Nanocomputing Devices
15
June 28th, 2007
I m pact of Process Variations
- Increasingly difficult to accurately control device
parameters
– Channel length and width – Oxide thickness – Doping profile
- Intra-die variations, e.g., different transistor voltage
threshold within the same SRAM cell
– I nterm ittent failure of read/ write operations
- Impact of process variations is increasing with scaling
Impact of Intermittent Faults on Nanocomputing Devices
16
June 28th, 2007
Activation of I nterm ittent Faults
Voltage and frequency shmoo
– Voltage – Frequency – Temperature – Workload
1.70V | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1.45V | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | * * * * D* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | HVMWV* * ZYZ* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | | LH* NDNPQRFST * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1.20V | ABCDEADFGHIJC * * * * * * * * * * * * * * * * * * * * * * * * * * * | 40ns 50ns 60ns 70ns 80ns
Impact of Intermittent Faults on Nanocomputing Devices
17
June 28th, 2007
Mitigation Techniques
Impact of Intermittent Faults on Nanocomputing Devices
18
June 28th, 2007
HW Solutions: I BM G5 / G6 CPU
- Mirrored Instruction and Execution
units
- Comparator and register unit
- Compare outputs in n-1 instruction
pipeline stage
– No error: update checkpoint array (register content and instruction address into R-unit) in last pipeline stage and continue normal execution – Error detected: Reset CPU (except R-unit), purge cache and its directory, reload last correct state from checkpoint array, retry
Source: L. Spainhower, T. A. Greg, IBM JR&D,1999
- Transient faults are recovered from
- Error threshold can be used for intermittent faults
- Permanent faults require activation of a spare CPU
under OS control
R
- U
N IT COMPARATOR I & E
- U
N ITS I & E
- U
N ITS CACHE
Impact of Intermittent Faults on Nanocomputing Devices
19
June 28th, 2007
HW Solutions: I BM G5 / G6 CPU
- Pros
– Lower design complexity – Shorter development and validation time – No performance penalty (compare and detect cycles are
- verlapped)
- Cons
– Total circuit overhead about 40% – It may not scale well with frequency
Impact of Intermittent Faults on Nanocomputing Devices
20
June 28th, 2007
SW Solutions: AR-SMT
- Active-stream/ Redundant-stream Simultaneous
Multithreading (AR-SMT)
– Two copies of the same program run concurrently, using the SMT micro architecture – Results of the two threads are compared – A-STREAM errors are detected with a delay – R-STREAM errors are detected before commit – Recovery from transient faults (e.g. particle induced soft error) is possible
Use committed state of R-STREAM
A
- S
T REAM R
- S
T REAM R
- S
T REAM A
- S
T REAM COMMIT FERCH DELAY BUFFER Source: E. Rotenberg, FTCS, 1999
Impact of Intermittent Faults on Nanocomputing Devices
21
June 28th, 2007
SW Solutions: AR-SMT
- Pros
– AR-SMT relies on existing micro-architectural features, e.g. SMT – No HW overhead
- Cons
– Increased execution time, 10% - 30% – Increased performance penalty or even failure in the case of bursts of high frequency errors
Impact of Intermittent Faults on Nanocomputing Devices
22
June 28th, 2007
Com paring Fault/ Error Handling Techniques
- HW implementations are fast (e.g. ECC) - can handle
bursts of errors induced by intermittent faults
- SW detection and recovery is slower
– Performance penalty in the case of large bursts of errors – Near coincident fault scenario, in the case of high rate bursts of errors = > SW fault/ error handling may fail before recovery is completed
- SW solutions are better suited for failure prediction and
resource reconfiguration
Impact of Intermittent Faults on Nanocomputing Devices
23
June 28th, 2007
Sum m ary
- Semiconductor technology is a two edge sword
– Lower dimensions and voltages and higher frequencies have led to tremendous performance gains – Intermittent and transient faults have become a serious challenge to developers and manufacturers
- Designing for particle induced soft errors is too narrowly
focused
- Software only techniques cannot effectively handle
bursts of errors occurring at a high rate
FAULT TOLERANT CHI PS ARE THE FUTURE
Impact of Intermittent Faults on Nanocomputing Devices
24
June 28th, 2007