SLIDE 1 Measurement of Timing Error Detection Performance
- f Software-based Error Detection Mechanisms
and Its Correlation with Simulation
Yutaka Masuda, Masanori Hashimoto, Takao Onoye
- Dept. of Information Systems Eng., Osaka University
{masuda.yutaka, hasimoto}@ist.osaka-u.ac.jp
1
SLIDE 2
Agenda
Background and objective Silicon measurement Correlation between silicon measurement and simulation Conclusion
2
SLIDE 3
Challenges in Post-Silicon Validation
3
To localize errors w/ trace buffer, we need to quickly detect errors !! A number of tests Unexpected behavior happens due to logic bug Electrical timing error (This work)
. Error Cannot record in trace buffer!
Test Long detection latency (e.g. billions cycles)
System crash, Blue screen etc. Trace buffer depth discarded
SLIDE 4 EDM* Trans. for quick error detection
4
a0 = b0; a1 = b1; if (a0 != a1) error(); a = b; EDM-L Duplicate all instructions Check : When variable written
- Eg. EDM-L (EDM for short Latency) [1]
Processor
c
Input & Run
1001010 10100 ・・・ RAM a0=b0; a1=b1; Check; ・・・
c
C/C++ compile No HW modification EDM trans.
a=b; ・・・
Original EDM program
(*) Error Detection Mechanisms, one of SW-based error detec. tech.
EDM-L quickly detects 86 % of elect. timing errors that vary exec. results [1]. (only evaluated in simulation. )
[1] Y. Masuda, M. Hashimoto, and T. Onoye, “Performance Evaluation of Software-based Error Detection Mechanisms for Localizing Electrical Timing Failures under Dynamic Supply Noise,” Proc. ICCAD, 2015.
SLIDE 5 Objective
Scenario 2 : Localize electrical errors that vary exec. results.
5
Scenario 1 : Localize electrical errors in original program.
- 1. To answer “How much electrical errors can EDM* localize?”
based on silicon measurement!
- 2. To evaluate correlation between sim. and silicon results.
EDM Original reproduce Short latency
SLIDE 6
Reproducibility and Detectability
6
For making EDM work well, 2 conditions should be satisfied.
COND1 : Reproducibility
(necessary for Scenario1) Original Duplicated Check Original
Original EDM
reproduce
COND2 : Detectability
(necessary for Scenario1 and Scenario2) error latency ≤ 1000 cycles → satisfied Detect quickly Original Duplicated Check
EDM
SLIDE 7
Agenda
Background and objective Silicon measurement Correlation between silicon measurement and simulation Conclusion
7
SLIDE 8 Preparation
8
Evaluate error occurrence border
- freq. for each workload and Vdd
PC
USB
DC voltage source Supply Vdd Border freq.
Test chip
(MeP processor fabricated in 65nm)
1 1.1 1.2 1.3 1.4 200 220 240 260 280 300 320 340 360 380 400 voltage [v] frequency [MHz] "false_result.txt" "true_result.txt"
Frequency Vdd
Border frequency
SLIDE 9 Measurement
9
Evaluate error occurrence time for computing error detection latency.
User Program @10 MHz Nfast cycle @10 MHz Initiali- zation @ border freq. Time
– repeat program execution by changing Nfast in binary search manner
User program : dijkstra, sha, crc (MiBench) Supply voltage : 1.0 - 1.4 V with 0.1V interval Test chip : 5 chips Total : 75 measurements
SLIDE 10
Evaluation Result
10 COND1 : Reproducibility COND2 : Detectability
25%
4% 31%
40%
Both COND1 and COND2 satisfied Only COND1 satisfied Only COND2 satisfied Neither COND1 nor COND2 satisfied
Scenario1
Detect 25 % of original errors.
Scenario2
Detect 56 % of errors varying results.
56% 11% 0% 33%
Detected & Latency < 1000 cycles Detected & latency > 1000 cycles Not detected & correct results Not detected & incorrect results
SLIDE 11
Agenda
Background and objective Silicon measurement Correlation between silicon measurement and simulation Conclusion
11
SLIDE 12 Simulation setup
12
Evaluation setup
PDN design Border freq. Silicon Low noise
Previous Sim.[1] 3% - 7% Vdd drop Timing error occurs Updated Sim. Zero noise
Consider 2 simulation setup
- 1. Previous Sim.[1]
- 2. Sim. which updates PDN and definition of border freq.
Freq.
# of errors Error
Results vary
SLIDE 13 Correlation between silicon and sim. (Scenario1)
13
Updated Sim.
Consistent between updated sim. and silicon
–Detectability for original errors : 25%(Silicon) 23%(updated Sim.)
25%
4% 31%
40%
Silicon
COND1 : Reproducibility, COND2: Detectability
23%
7% 20%
50% 0%
4% 20%
76%
Both COND1 and COND2 satisfied Only COND1 satisfied Only COND2 satisfied Neither COND1 nor COND2 satisfied
Previous Sim.[1]
(Localize electrical timing errors in original program)
SLIDE 14 14
Updated Sim.
Silicon
Correlation between silicon and sim. (Scenario2)
Previous Sim[1].
(Localize potential errors that vary results)
44 % (Updated Sim.) For errors varying results, EDM detects 56 % (Silicon)
Consistency improvement by simulation update
87 % =
. . (Previous Sim.)
56% 11% 0% 33% 44% 43% 0% 13% 20% 1% 77% 2%
Detected & Latency < 1000 cycles Detected & latency > 1000 cycles Not detected & correct results Not detected & incorrect results
SLIDE 15
Agenda
Background and objective Silicon measurement Correlation between silicon measurement and simulation Conclusion
15
SLIDE 16 Conclusion
- Evaluated error detection performance of EDM transformation for
supply noise induced timing errors based on silicon measurement. – Considered two EDM usage scenarios – In scenario1, EDM detected 25% of original errors. – In scenario2, EDM detected 56% of errors varying results.
- Evaluate correlation of EDM performance between sim. and silicon.
–Update PDN design and definition of border frequency. –Consistent between updated sim. and silicon.
16
SLIDE 17 Backup Slide Difficulty of Electrical Error Localization
17
Can SW-based trans. debug the original error ? Program transformation change inst. sequence.
Voltage Voltage Time Original program Transformed program
Supply voltage varies.
- Inst. seq. #2 ・・・
- Inst. seq. #1 + #1’
Check ・・・ Time
Error
The same error appear?
SLIDE 18 Even when the same instructions are executed, memory and registers usage changes. ⇒ EDM changes inductive noise and this prevents the error reproduction.
940 960 980 1000 1020 1040 1 2 3 4 5 6 voltage [mv] time [ns] "dijkstra_full-EDM" "dijkstra_original"
Voltage [mV] Time [ns]
10 20 30 40
941 942 943 944 945 946 947
Ratio[%]
Minimum voltage in the MOV instruction [mV]
18
Backup Slide Why low reproduction ratio?