Approximate Computing on Unreliable Silicon
Georgios Karakonstantis2 Jeremy Constantin, Andreas Burg1 Adam Teman1
1Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland 2Queen’s University Belfast, U.K.
Dagstuhl 30-11/15
Approximate Computing on Unreliable Silicon Georgios Karakonstantis - - PowerPoint PPT Presentation
Approximate Computing on Unreliable Silicon Georgios Karakonstantis 2 Jeremy Constantin, Andreas Burg 1 Adam Teman 1 1 Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland 2 Queens University Belfast, U.K. Dagstuhl 30-11/15
Georgios Karakonstantis2 Jeremy Constantin, Andreas Burg1 Adam Teman1
1Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland 2Queen’s University Belfast, U.K.
Dagstuhl 30-11/15
2
Objective: Improve Energy Efficiency Main Idea: Reduce the complexity
Classical Main Idea: Utilize application’s error resiliency to address hardware induced errors New
Techniques
Metrics
Techniques
computations and variables
degradation Metrics
3
misbehavior
plane failure
Variability
True randomness
Lack of knowledge Inability to model: (chaotic behavior)
Need for overdesign to account for worst case assumptions
Static components … Dynamic/runtime factors … Wearout/aging …
4
Random dopant fluctuation Process variations Line-edge roughness NBTI
Vdd
Voltage variation Thermal Single event upsets
1010010 0100100 0100100
Data dependencies
Only errors that are truely random (intentionally not covered in this talk)
averaging requires great care
5
Manufacturing Runtime/Dynamic Wearout Time [s] Time [y]
failure failure failure
Die to die and within die variations
manufacturing Behavior of each circuit mostly deterministic and on short time scale
data and model uncertainty
with true random input Aging is a slow process
is meaningless
6
7
Quality (SNR) degradation
frequency-over-scaling Some key observations:
quality degradation is small
more sensitive to errors (smaller transition region)
Objective: exploit timing margin in low-power processors
in all pipeline stages
clock period
8
Critical Range Optimization in OpenRISC
Opportunity +38% speedup
9
10
Execution time Quality
Task deadline
Approximate computing Scalable algorithms Stochastic computing Application/algorithm-level fault tolerance
New paradigm: Allow for graceful performance degradation
path delay # of occurances
VDD=nominal VDD=low target delay target delay
X
Consideration of the application level provides additional scalability: graceful performance degradation
Application to Communications
Iterative algorithms adjust to process variations
11
12
Application of unreliable memories to forward error correction decoders
Transmitter
HSPA+ System
System tolerates surprisingly high number of defects in costly memories
study of inherent fault-tolerance of wireless systems Compact “better-than-worst-case” memory design for FT applications
50x
performance degradation
1 2
3
Controlled errors with a modified test criterion
13
Bit errors per die % of dies <5 <100 >100 80% 40% 20% Bit errors per die % of dies <5 <100 >100 80% 40% 20%
Conventional yield criterion: accept only dies with no errors Modified yield criterion: accept dies with less than N errors
Bit errors per die <5 <100 >100 80% 40% 20% Bit errors per die <5 <100 >100 80% 40% 20%
80% yield (OK) 90% yield (high) 60% yield (too low) 80% yield (OK)
Nominal VDD Reduced VDD
toward low power operation Yield loss
Problem:
error pattern (number of errors and error locations)
Non-ergodicity invalidates quality assessment across dies Impact on quality distribution:
14
Different instances of same memory
LSB MSB LSB MSB LSB MSB LSB MSB
Very different performance impact Few errors in MSBs Many errors in LSBs
patterns (predicting impact of each pattern on quality during test is impossible) Proper test criteria are hard to define and ensuring consistent quality is difficult Solution: ensure that all chips (with given number of errors have the same average quality)
15
LSB MSB LSB MSB LSB MSB LSB MSB
Physical to logical bit/address mapping
Time/algorithm iteration Physical failures remain in same location Logical bit-failures wonder around in the memory Quality changes with each application
16
Best-effort statistical data correction Data representations for unreliable memories
Roth, Christian, et al. "Data mapping for unreliable memories." Communication, Control, and Computing, Annual Allerton Conference on. IEEE, 2012.
Idea: Identify failing bit locations during runtime and store bits of lower significance (LSB) in those locations
17
at run-time
significance (LSB) in those locations
levels of granularity
always stores the LSB
are shifted
bit integer in 2’s complement mode
fm segments/word
18
SECDED ECC.
SECDED ECC
protected in a 32-Bit word
latency overhead by as much as 83%, 89% and 77% respectively
(Elasticnet, PCA and KNN),
and 7% of fault-free memory with SECDED ECC.
19
20