Inconsistent Executions Andrew DeOrio Daya Shanker Khudia Valeria - - PowerPoint PPT Presentation

inconsistent executions
SMART_READER_LITE
LIVE PREVIEW

Inconsistent Executions Andrew DeOrio Daya Shanker Khudia Valeria - - PowerPoint PPT Presentation

Post-Silicon Bug Diagnosis with Inconsistent Executions Andrew DeOrio Daya Shanker Khudia Valeria Bertacco University of Michigan ICCAD11 9 November 2011 Impact of errors $475 M Functional bugs 17 Jan FDIV bug : Intel announces a


slide-1
SLIDE 1

Post-Silicon Bug Diagnosis with Inconsistent Executions

Andrew DeOrio Daya Shanker Khudia Valeria Bertacco

9 November 2011

University of Michigan

ICCAD’11

slide-2
SLIDE 2

Impact of errors

  • Functional bugs
  • Electrical failures
  • Transistor faults

17 Jan 1995

FDIV bug: Intel announces a pre-tax charge of $475M dollars against earnings for replacement of flawed processors

Kris Kaspersky: Remote Code Execution Through Intel CPU Bugs

1024-bit RSA secret key extracted in 100 hours

Sandy Bridge Bug 2X Costly as Pentium FDIV Bug

$475 M $1 B

2

slide-3
SLIDE 3

Post-silicon validation

+ Fast prototypes + High coverage + Test full system + Find deep bugs

3

Pre-Silicon Post-Silicon Product

  • Poor observability
  • Slow off-chip transfer
  • Noisy
  • Intermittent bugs

Debug prototypes before shipment

slide-4
SLIDE 4

Post-silicon bugs

  • Intermittent post-silicon bugs are challenging

– A same test does not expose the bug in every run – Each run exhibits different behaviors

  • Our goal: locate intermittent bugs

4

post-si platform

pushl %epb movl %epb

same post- silicon test

difficult to debug!

many different results

slide-5
SLIDE 5

Post-silicon debugging

  • Scan chains, logic analyzers

[Whetsel 1991, Abramovici 2006, Dahlgren 2003]

– Limited observability – Large manual effort

  • Processor-core specific debugging

[Park 2009]

– Limited areas of chip – Limited time to catch bug

  • Deterministic replay

[Gao 2009, Li 2010, Yang 2008]

– HW/performance overhead – Perturbation may prevent bug manifestation

5

slide-6
SLIDE 6

post-si test post-si platform hw sensors

HW logging SW post-analysis signatures

pass fail

bug location bug

  • ccurrence

time band model

BPS: “Bug Positioning System”

  • Localize failures

– Time (cycle) and space (signals)

  • Tolerate non-repeatable executions

– Statistical approach

  • Scalable, adaptable to many HW subsystems

6

slide-7
SLIDE 7

Signatures

7

  • Goal: summarize signal value
  • Encodings (hamming, CRC, etc.)

– Large hardware – Small change in input -> large change in output

  • Counting schemes (time@1, toggles)

signal A

window window time@1=2 time@1=1

HW logging SW post-analysis

slide-8
SLIDE 8

HW logging SW post-analysis

Statistical approach

passing testcase failing testcase

traditional debugging

match?

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Distribution Signature value

passing testcases failing testcases

statistical debugging

8

distribution of signature values: same test can yield different results time@1

  • window size
slide-9
SLIDE 9

HW logging SW post-analysis

Signatures for statistical approach

  • Characterize populations of signatures
  • Statistical separation between noise and bug

0.2 0.4 0.6 0.8 1

Distribution Signature value

passing testcases failing testcases 0.2 0.4 0.6 0.8 1

Distribution Signature value

passing testcases failing testcases

Example: time@1

9

Example: CRC

slide-10
SLIDE 10

Signature hardware

  • Measure time@1
  • Use custom hardware or reuse existing debug

infrastructure

10

1 Memory Buffer chip under test Off-chip through debug port register

EN

register

EN

1

11KB for 100 signals x 100 windows

HW logging SW post-analysis

slide-11
SLIDE 11

BPS: “Bug Positioning System”

  • 1. Hardware logging
  • 2. Software post-analysis

11

post-si test post-si platform hw sensors

HW logging SW post-analysis signatures

pass fail

bug location bug

  • ccurrence

time band model

slide-12
SLIDE 12

HW logging SW post-analysis

0.2 0.4 0.6 0.8 1 4 8 12 16 20 24

Signature value Window Failing band Passing band

behavior of 1 signal from the MEM stage of a 5-stage pipeline processor

0.5 1 0.5 1

µ ± 2σ

bug band

bug occurrence bug detected

Bug band model

12

slide-13
SLIDE 13

SW post-analysis

Failing group

... ... ...

signalA signalB signalC

... ... ... ...

1 2 3 4

windows

Passing group

... ... ... ...

signalA signalB signalC

... ... ...

1 2 3 4

windows

bug band signals signals windows signatures signatures

13

HW logging SW post-analysis

slide-14
SLIDE 14

Experimental setup

14

10 testcases 10 random seeds: variable memory delay, crossbar random traffic 10 bugs: e.g., functional bug in PCX, electrical error in Xbar BPS HW BPS SW monitored 41,744 top level control signals

detected signals detection latency

1000 buggy runs 100 passing runs

slide-15
SLIDE 15

Signal Localization

PCX gnt SA Xbar elect BR fxn MMU fxn PCX atm SA PCX fxn Xbar combo MCU combo MMU combo EXU elect blimp_rand

√+ √ √ √ √+ √+ √+ f.n. √+ f.n.

fp_addsub

n.b. f.p. √ √ √ √+ f.p. n.b. √+ f.p.

fp_muldiv

n.b. f.p. √

√ √+ f.p. f.p. √+ f.p.

isa2_basic

n.b. f.n. √ n.b. √+ √+ √+ √+ n.b. f.n.

isa3_asr_pr

n.b. √ √ f.n. √+ √ √+ √+ √ √

isa3_window

n.b. √ √ n.b. √+ √ f.n. f.n. n.b. √

ldst_sync

n.b. √+ √ √ √+ √+ √+ √+ √+ n.b.

mpgen_smc

n.b. √+ √ √ √+ √+ √+ √+ √+ √+

n2_lsu_asi

n.b. f.n. √ f.n. √+ √+ √+ √+ √+ n.b.

tlu_rand

n.b. √+ √ √ √+ √+ √+ √+ √+ √+

15

bug signal not

  • bservable

n.b. no bug √ found √+ exact signal f.p. false pos. f.n. false neg.

Testcases Bugs

slide-16
SLIDE 16

Signal Localization

PCX gnt SA Xbar elect BR fxn MMU fxn PCX atm SA PCX fxn Xbar combo MCU combo MMU combo EXU elect blimp_rand

√+ √ √ √ √+ √+ √+ f.n. √+ f.n.

fp_addsub

n.b. f.p. √ √ √ √+ f.p. n.b. √+ f.p.

fp_muldiv

n.b. f.p. √

√ √+ f.p. f.p. √+ f.p.

isa2_basic

n.b. f.n. √ n.b. √+ √+ √+ √+ n.b. f.n.

isa3_asr_pr

n.b. √ √ f.n. √+ √ √+ √+ √ √

isa3_window

n.b. √ √ n.b. √+ √ f.n. f.n. n.b. √

ldst_sync

n.b. √+ √ √ √+ √+ √+ √+ √+ n.b.

mpgen_smc

n.b. √+ √ √ √+ √+ √+ √+ √+ √+

n2_lsu_asi

n.b. f.n. √ f.n. √+ √+ √+ √+ √+ n.b.

tlu_rand

n.b. √+ √ √ √+ √+ √+ √+ √+ √+

16

n.b. no bug √ found √+ exact signal f.p. false pos. f.n. false neg.

Testcases Bugs

3 noisy signals excited by floating point benchmarks

slide-17
SLIDE 17

Signal Localization

PCX gnt SA Xbar elect BR fxn MMU fxn PCX atm SA PCX fxn Xbar combo MCU combo MMU combo EXU elect blimp_rand

√+ √ √ √ √+ √+ √+ f.n. √+ f.n.

fp_addsub

n.b. f.p. √ √ √ √+ f.p. n.b. √+ f.p.

fp_muldiv

n.b. f.p. √

√ √+ f.p. f.p. √+ f.p.

isa2_basic

n.b. f.n. √ n.b. √+ √+ √+ √+ n.b. f.n.

isa3_asr_pr

n.b. √ √ f.n. √+ √ √+ √+ √ √

isa3_window

n.b. √ √ n.b. √+ √ f.n. f.n. n.b. √

ldst_sync

n.b. √+ √ √ √+ √+ √+ √+ √+ n.b.

mpgen_smc

n.b. √+ √ √ √+ √+ √+ √+ √+ √+

n2_lsu_asi

n.b. f.n. √ f.n. √+ √+ √+ √+ √+ n.b.

tlu_rand

n.b. √+ √ √ √+ √+ √+ √+ √+ √+

17

n.b. no bug √ found √+ exact signal f.p. false pos. f.n. false neg.

Testcases Bugs

wider effects, easier to catch

slide-18
SLIDE 18

Time to detect bug

1,000 2,000 3,000 4,000 5,000 6,000

PCX gnt SA XBar elect BR fxn MMU fxn PCX atm SA PCX fxn XBar combo MCU combo MMU combo EXU elect AVERAGE

Δ time bug injection to detection (cycles)

18

1,273 cycles

slide-19
SLIDE 19

Number of signals detected

40 80 120 160 200 PCX gnt SA XBar elect BR fxn MMU fxn PCX atm SA PCX fxn XBar combo MCU combo MMU combo EXU elect AVERAGE

Number of signals detected

19

75 signals (0.2%)

slide-20
SLIDE 20

Threshold selection

20 40 60 0.2 0.4 0.6 0.8 1

Sum total Bug detection threshold (bug band)

false negatives false positives sum

threshold trade-off

20

bug band

slide-21
SLIDE 21

Area overhead

  • Option 1: reuse existing debug structures
  • Option 2: add counters and memory buffer

– Record a few signals at a time – 11KB for 100 signals x 100 windows @9bit precision – 1.35mm2 with 65nm library – 0.4% of OpenSPARC

21

1 Memory Buffer chip under test Off-chip through debug port register

EN

register

EN

1

slide-22
SLIDE 22

Conclusions

  • BPS automatically localizes bug time and

location

  • Leverages a statistical approach to tolerate

noise

  • Effective for a variety of bugs: functional,

electrical and manufacturing

– 1,273 cycles, 75 signals on average

22