Machine Learning-based Anomaly Detection for Post-silicon Bug - - PowerPoint PPT Presentation

machine learning based anomaly detection for post silicon
SMART_READER_LITE
LIVE PREVIEW

Machine Learning-based Anomaly Detection for Post-silicon Bug - - PowerPoint PPT Presentation

Machine Learning-based Anomaly Detection for Post-silicon Bug Diagnosis Andrew DeOrio , Qingkun Li, Matthew Burgess and Valeria Bertacco University of Michigan University of Illinois Verification trends Wilson Research Group and Mentor


slide-1
SLIDE 1

Machine Learning-based Anomaly Detection for Post-silicon Bug Diagnosis

Andrew DeOrio , Qingkun Li, Matthew Burgess and Valeria Bertacco

University of Michigan University of Illinois

slide-2
SLIDE 2

Verification trends

Wilson Research Group and Mentor Graphics 2010 Functional Verification Study

Andrew DeOrio / University of Michigan 20-Mar-2013 2

slide-3
SLIDE 3

Increasing post-silicon validation

Bob Barton, Intel. Invited talk at GSRC.

20-Mar-2013 Andrew DeOrio / University of Michigan 3

Design and pre-silicon verification effort Post-silicon validation effort

slide-4
SLIDE 4

Post-silicon validation

Pre-silicon Post-silicon Product

Goal: locate bug

20-Mar-2013 Andrew DeOrio / University of Michigan 4

+ Fast prototypes + High coverage + Test full system + Find deep bugs

  • Poor observability
  • Slow off-chip transfer
  • Noisy
  • Intermittent bugs
slide-5
SLIDE 5

Post-silicon and credit cards

20-Mar-2013 Andrew DeOrio / University of Michigan 5

pushl %epb movl %epb

same test

many different results

difficult to locate bug!

same card

many different transactions

difficult to locate fraud!

slide-6
SLIDE 6

Post-silicon and credit cards

20-Mar-2013 Andrew DeOrio / University of Michigan 6

pushl %epb movl %epb

same test same card

new transaction

anomaly?

compare

anomalous time and location

failing test compare

slide-7
SLIDE 7

Post-silicon and credit cards

20-Mar-2013 Andrew DeOrio / University of Michigan 7

pushl %epb movl %epb

same test

… …

unknown example training data: positive examples clustering algorithm anomalous time and location signal A signal B

time@1=2 time@1=1

feature feature

time@1=1 time@1=2

feature feature

slide-8
SLIDE 8

Learning clusters

signal A feature value signal B feature value

20-Mar-2013 Andrew DeOrio / University of Michigan 8

clustering algorithm

clusters

time@1

feature values of passing examples One test, 1st time window signal A signal B 1st time window

slide-9
SLIDE 9

Searching for anomalies

signal A feature value signal B feature value

20-Mar-2013 Andrew DeOrio / University of Michigan 9

clustering algorithm

One test, 1st time window signal A signal B 1st time window feature values of unknown examples

inside clusters: no bug Added after clustering

slide-10
SLIDE 10

Searching for anomalies

signal A feature value signal B feature value

20-Mar-2013 Andrew DeOrio / University of Michigan 10

clustering algorithm

One test, 2nd time window signal A signal B 2nd time window

Outside clusters: bug found

# anomalies > threshold

slide-11
SLIDE 11

Clustering in X,000 dimensions

signal A feature value signal B feature value

20-Mar-2013 Andrew DeOrio / University of Michigan 11

clustering algorithm

  • Each signal is a dimension

– Circular clusters become hyper-spheres – High dimensionality is a challenge

  • In practice:

– Cap #signals in one clustering set (500) – Group signals by module(s) (100-500 signals) – Apply clustering to each group

slide-12
SLIDE 12

Experimental Setup

10 testcases 100 random seeds: variable memory delay, crossbar random traffic 10 bugs: e.g., functional bug in PCX, electrical error in Xbar monitored 41,743 top level control signals 1000 buggy runs 1000 passing runs

HW

20-Mar-2013 Andrew DeOrio / University of Michigan 12

training data unknown data

10 seeds

slide-13
SLIDE 13

Bug injection

Bug Description PCX_gnt SA Stuck-at in PCX grant Xbar elect Electrical error in crossbar BR fxn Functional bug in branch logic MMU fxn Functional bug in memory controller PCX_atm SA Stuck-at in PCX atomic grant PCX fxn Functional bug in PCX XBar combo Combined electrical errors in Xbar/PCX MCU combo Combined electrical errors in mem/PCX MMU combo Combined functional bugs in MMU/PCX EXU elect Electrical error in execute unit

20-Mar-2013 Andrew DeOrio / University of Michigan 13

slide-14
SLIDE 14

Bug detection on OpenSPARC T2

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of Testcases

exact signal detected

  • ther

signals detected no bug effect false negative false positive

Bug not detected Bug signal not

  • bservable

20-Mar-2013 Andrew DeOrio / University of Michigan 14

Bug detected

9/10 bugs caught

slide-15
SLIDE 15

Bug signal vs. noise

More training data -> more accuracy

20-Mar-2013 Andrew DeOrio / University of Michigan 15

slide-16
SLIDE 16

Conclusions

  • Machine learning automatically localizes bug

time and location

  • Leverages a statistical approach to tolerate

noise

  • Effective for a variety of bugs: functional,

electrical and manufacturing –336 cycles, 347 signals on average

20-Mar-2013 Andrew DeOrio / University of Michigan 16