Applying Classification Techniques to Remotely-Collected Program - - PowerPoint PPT Presentation

applying classification techniques to remotely collected
SMART_READER_LITE
LIVE PREVIEW

Applying Classification Techniques to Remotely-Collected Program - - PowerPoint PPT Presentation

Applying Classification Techniques to Remotely-Collected Program Execution Data Alessandro Orso Murali Haran Georgia Institute Penn State of Technology University Alan Karr, Ashish Sanil Adam Porter National Institute of University of


slide-1
SLIDE 1

Applying Classification Techniques to Remotely-Collected Program Execution Data

This work was supported in part by NSF awards CCF-0205118 to NISS, CCR-0098158 and CCR-0205265 to University of Maryland, and CCR-0205422, CCR-0306372, and CCR-0209322 to Georgia Tech.

Murali Haran

Penn State University

Alan Karr, Ashish Sanil

National Institute of Statistical Sciences

Adam Porter

University of Maryland

Alessandro Orso

Georgia Institute

  • f Technology
slide-2
SLIDE 2

Alex Orso - ESEC-FSE - Sep 2005

Testing & Analysis after Deployment

Program P User User User User User User User User User User User User User User User User

Field Data SE Tasks

[Pavlopoulou99] Test adequacy Residual coverage data [Hilbert00] Usability testing GUI interactions [Dickinson01] Failure classification Caller/callee profiles [Bowring02] Coverage analysis Partial coverage data [Orso03] Impact analysis Dynamic slices [Liblit05] Fault localization Various profiles (returns, …) … … …

slide-3
SLIDE 3

Alex Orso - ESEC-FSE - Sep 2005

Tradeoffs of T&A after Deployment

  • In-house

(+) Complete control (measurements, reruns, …) (-) Small fraction of behaviors

  • In the field

(+) All (exercised) behaviors (-) Little control

  • Only partial measures, no reruns, …
  • In particular, no oracles
  • Currently, mostly crashes
slide-4
SLIDE 4

Alex Orso - ESEC-FSE - Sep 2005

Our Goal Provide a technique for automatically identifying failures

  • Mainly, in the field
  • Useful in-house too
  • Automatically generated test cases
slide-5
SLIDE 5

Alex Orso - ESEC-FSE - Sep 2005

Overview

  • Motivation and Goal
  • General Approach
  • Empirical Studies
  • Conclusion and Future Work
slide-6
SLIDE 6

Alex Orso - ESEC-FSE - Sep 2005

Overview

  • Motivation and Goal
  • General Approach
  • Empirical Studies
  • Conclusion and Future Work
slide-7
SLIDE 7

Alex Orso - ESEC-FSE - Sep 2005

Background: Classification Techniques

Classification -> Supervised learning -> Machine learning Many existing techniques (logistic regression, neural networks, tree-based classifiers, SVM, …)

  • bj 1

  • bj 2

  • bj n

… Learning Algorithm

Model

label x label y label z

  • bj i

predicted label Training Classification Classifier Model

Pass/Fail Random Forests Executions Execution Data

slide-8
SLIDE 8

Alex Orso - ESEC-FSE - Sep 2005

Background: Random Forests Classifiers

  • Tree-based classifiers
  • Partition predictor space in

hyper-rectangular regions

  • Regions are assigned a label

(+) Easy to interpret (-) Unstable

  • Random forests [Breiman01]
  • Integrate many (500) tree classifiers
  • Classification via a voting scheme

(+) Easy to interpret (+) Stable

fail size ≥ 14.5 size ≥ 8.5 time ≤ 111 time > 55 pass fail pass fail (size=10, time=80)

slide-9
SLIDE 9

Alex Orso - ESEC-FSE - Sep 2005

Our Approach

Some critical open issues

  • What data should we collect?
  • What tradeoffs exist between different types of data?
  • How reliable/generalizable are the statistical analyses?

Instrumentor P P inst Test Cases Runtime Execution Data Labels (pass/fail) Learning Algorithm Model (random forest) Training Set Training (In-House) Classification (In the Field) P inst Users Runtime Execution Data Classifier Model Predicted Labels (pass/fail)

slide-10
SLIDE 10

Alex Orso - ESEC-FSE - Sep 2005

Specific Research Questions

RQ1: Can we reliably classify program outcomes using execution data? RQ2: If so, what type of execution data should we collect? RQ3: How can we reduce runtime data collection

  • verhead while still producing accurate and

reliable classifications? ⇒ Set of exploratory studies

slide-11
SLIDE 11

Alex Orso - ESEC-FSE - Sep 2005

Overview

  • Motivation and Goal
  • General Approach
  • Empirical Studies
  • Conclusion and Future Work
slide-12
SLIDE 12

Alex Orso - ESEC-FSE - Sep 2005

Experimental Setup (I)

Subject program

  • JABA bytecode analysis library
  • 60 KLOC, 400 classes, 3000 methods
  • 19 single-fault versions (“golden version” + 1 real fault)

Training set

  • 707 test cases (7 drivers applied to 101 input programs)
  • Collected various kinds of execution data (e.g., counts

for throws, catch blocks, basic blocks, branches, methods, call edges, …)

  • “Golden version” to label passing/failing runs
slide-13
SLIDE 13

Alex Orso - ESEC-FSE - Sep 2005

Experimental Setup (II)

Users’ Runs

Classifier

Predicted Outcome (pass/fail)

Model

In-House In the Field Training Set

Learning Algorithm

Model (random forest)

Ideal setting, but

  • Expensive
  • Difficult to get enough data points
  • Oracle problem

=> Simulate users’ runs

Users’ Runs Training Set 1/3 Training Set 2/3 Training Set 1/3 Training Set 2/3

classification error (misclassification rate)

slide-14
SLIDE 14

Alex Orso - ESEC-FSE - Sep 2005

RQ1 & RQ2: Can We Classify at All? How?

  • RQ1: Can we reliably classify program outcomes using

execution data?

  • RQ2: Assuming we can classify

program outcomes, what type of execution data should we collect?

  • We first considered a specific kind of

execution data: basic-block counts (~20K) (simple measure, intuitively related to faults)

  • Results: classification error estimates always almost 0!
  • But, time overheard ~15% and data volume not negligible

=> Other kinds of execution data

exec i …

pass/fail

Basic-block counts

slide-15
SLIDE 15

Alex Orso - ESEC-FSE - Sep 2005

RQ1 & RQ2: Can We Classify at All? How?

  • We considered other kinds of execution data:
  • Basic-block counts yielded almost perfect predictors

=> richer data not considered

  • Counts for: throws, catch-blocks, methods, and call-edges
  • Results
  • Throw and catch-block counts are poor predictors
  • Method counts produced nearly perfect models
  • As accurate as block counts, but much cheaper to collect
  • 3000 methods vs. 20000 blocks (overhead < 5%)
  • Branch and call-edge counts equally accurate, but more costly

than method counts

Preliminary conclusion (1): Possible to classify program runs; method counts provided high accuracy at low cost

slide-16
SLIDE 16

Alex Orso - ESEC-FSE - Sep 2005

RQ3: Can We Collect Less Information?

  • Method-count models used between 2 and 7 method
  • counts. Great for instrumentation, but…
  • Two alternative hypotheses
  • Few methods are relevant -> must choose specific methods well
  • Many, redundant methods -> method selection less important
  • To investigate, performed 100 random samplings
  • Took 10% random samples of method counts and rebuilt models
  • Models were excellent 90% of the times
  • Evidence that many method counts are good predictors

Preliminary conclusion (2): “failure signal” spread, rather than localized to single entities => estimates can be based on a few data, collected with negligible overhead

slide-17
SLIDE 17

Alex Orso - ESEC-FSE - Sep 2005

Validity of the Analysis

Two main issues to consider

  • Multiplicity
  • Generality
slide-18
SLIDE 18

Alex Orso - ESEC-FSE - Sep 2005

Statistical Issues -- Multiplicity

When # of predictors far exceeds # of data points, the likelihood of finding spurious relationship increases

  • i.e., random relationships confused for real ones

We took two steps to address the problem

  • Consider method counts

(least number of predictors)

  • Conducted study in which we
  • Randomly permuted method counts
  • Took a 10% random sample of method

counts and rebuilt models (100 times) => Never found good models based on this data

Preliminary conclusion (3): Results are unlikely to be due to random chance

Executions Methods 3 7 11 2 … 21 8 69 4 … 0 58 7 12 … 3376 0 3 … Executions Methods 3 7 11 2 … 69 8 4 21 … 0 58 7 12 … 3376 0 3 …

slide-19
SLIDE 19

Alex Orso - ESEC-FSE - Sep 2005

Statistical Issues -- Generality

Classifiers for 1 specific bug are useful, but…

  • We would like to have models that encode “correct

behavior” for the application in general

  • Looked for predictors that worked in general

⇒Found 11 excellent predictors for all versions

Programs typically contain more than 1 bug

  • Applied our approach to 6 multi-bug versions
  • Models had error rates less than 2% in most cases

Preliminary conclusion (4): Results promising w.r.t. generality (but need to investigate further)

slide-20
SLIDE 20

Alex Orso - ESEC-FSE - Sep 2005

Overview

  • Motivation and Goal
  • General Approach
  • Empirical Studies
  • Conclusion and Future Work
slide-21
SLIDE 21

Alex Orso - ESEC-FSE - Sep 2005

Summary

  • Possible to classify program outcomes using

execution data

  • Method counts gave high accuracy at low cost
  • Estimates can be computed based on very few

data, collected with negligible overhead

  • Our results are unlikely to depend on random

chance and are promising in terms of generality

  • But, these are still preliminary results, and we

need to investigate further

slide-22
SLIDE 22

Alex Orso - ESEC-FSE - Sep 2005

Future Work

  • Multiple faults
  • Investigate relationship between

predictors and failures

  • Investigate relationship between

predictors and faults

  • Conduct further experiments with

system(s) in actual use