Predicting Fix Locations from Bug Reports Master Thesis by Markus - - PowerPoint PPT Presentation

predicting fix locations from bug reports
SMART_READER_LITE
LIVE PREVIEW

Predicting Fix Locations from Bug Reports Master Thesis by Markus - - PowerPoint PPT Presentation

Predicting Fix Locations from Bug Reports Master Thesis by Markus Thiele (Supervised by Rahul Premraj and Tom Zimmermann) 1 Deadline for thesis: 2009-03-31 2 Motivation. 3 4 A typical Bugzilla bug report. Questions Who should fix this


slide-1
SLIDE 1

Predicting Fix Locations from Bug Reports

Master Thesis by Markus Thiele

(Supervised by Rahul Premraj and Tom Zimmermann)

1

Deadline for thesis: 2009-03-31

slide-2
SLIDE 2

2

Motivation.

slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

A typical Bugzilla bug report.

slide-5
SLIDE 5

Questions

  • Who should fix this bug?

(Anvik, Hiew, and Murphy 2006)

  • How long will it take to fix this bug?

(Weiss, Premraj, Zimmermann, and Zeller 2007)

  • Where should I fix this bug?

5

Previous, related work, and finally the question we focus on.

slide-6
SLIDE 6

The Problem

  • More than 35400 files in Eclipse
  • More than 1390 packages in Eclipse
  • Developer time is expensive

6

Some impression of the size of the search space.

slide-7
SLIDE 7

The Vision

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

Likely Fix Locations:

  • FontDialog.java
  • FontData.java
  • FontMetrics.java

9

What we would ultimately like to have: A tool (possibly integrated into a bug database) to automatically predict likely fix locations.

slide-10
SLIDE 10

The Tool: SVM

Support Vector Machines

10

SVMs find the best fitting boundary (a hyper-plane) between two (or possibly more) classes in a feature space.

slide-11
SLIDE 11

Our Choice: libsvm

  • Light weight and easy to use
  • Easily parallelizable (with OpenMP)
  • Supports multiple classes
  • Supports predicting probabilities

11

Widely used SVM implementation.

slide-12
SLIDE 12

Training Data

  • Data points: Bug reports
  • Features: Extracted from bug reports
  • Classes: Locations

12

Application of SVMs to our problem.

slide-13
SLIDE 13

Data Points

  • Bugs with known fix locations

(possibly several data points per bug)

  • No enhancements

(inherently hard to predict)

13

A simplification to improve prediction results; enhancements may involve new locations, etc.

slide-14
SLIDE 14

Features

  • Only unstructured data
  • Short description
  • Long description
  • Keywords
  • No structured data
  • No Priority, etc.

14

Structured data is hard to integrate into a single feature vector with unstructured data (how to represent it? how to weight it?). Also difgerent bug databases provide difgerent types of meta data.

slide-15
SLIDE 15

Locations

  • Files
  • Packages (for Java projects)
  • (Sub-)Directories

15

Finer grained locations seem unlikely to work well, coarser grained locations would probably be useless.

slide-16
SLIDE 16

The Process

Old Bug Reports Feature Extraction Training Model New Bug Report Prediction Location(s)

with known locations

Feature Extraction

16

Overview over the general process used by a possible tool.

slide-17
SLIDE 17

Feature Extraction

Plain Text Bag of Words (BOW) Stemming Stop Words Scaling (TF)

{The, Program, Crashes} {Program, Crashes} {Program, Crash}

Vector

[0.5, 0, 0, 0, 0.5, 0, 0, 0, ...]

Program Crash GUI

17

Plain text needs to be converted into feature vectors for the SVM.

slide-18
SLIDE 18

Training: Kernels

  • Linear Kernel

Recommended for problems with many data points and many features

  • Radial Basis Function (RBF) Kernel

At least as good, if optimized (optimization expensive)

18

Two commonly used SVM “kernels” (functions to map data points into a difgerent, possibly non-linear space, to find difgerent kinds of boundaries).

slide-19
SLIDE 19

Kernel Comparison

Unoptimized RBF Kernel Optimized RBF Kernel Linear Kernel

19

Why we chose to use linear kernels. Intuitively the vertical spread of these graphs is related to precision, the horizontal spread to recall; Notice that a linear kernel performs as well as an optimized RBF kernel (and note that the parameter optimization step is very computationally expensive).

slide-20
SLIDE 20

Evaluation: Data

  • iBUGS & Co.
  • maybe also: JBoss, ALMA, ...

20

Data provided by the SE chair, part of which is available to the public as “iBUGS”. This talk shows results from Eclipse and AspectJ.

slide-21
SLIDE 21

Experimental Setup

All Bug Reports Feature Extraction

with known locations

Training Set Testing Set Training Model Prediction Location(s) Evaluator(s) Results

split

21

Specific experimental setup (to generate results that can be evaluated).

slide-22
SLIDE 22

Splitting

  • Random Splitting

May predict past from the future: unrealistic

  • Splitting along time axis

Always predict future from the past: realistic

22

Why we chose splitting along a time axis (this is usually the better choice).

slide-23
SLIDE 23

Splitting

  • We split into 11 folds
  • Up to 10 for training
  • The following one for testing

23

11 just so we have 10 test results

slide-24
SLIDE 24

Splitting: Full History

Train Test

Fold 2: . . . . . . Fold 0:

Train Test

Fold 1:

Train Test

24

One possibility: Always include all the previous data.

slide-25
SLIDE 25

Splitting: Partial History

. . . . . . Fold 0:

Train Test

Fold 1:

Test Train Test

Fold 2:

Train

25

Another possibility: Only include part of the previous data (here the history length is 1 fold, but it may also be longer).

slide-26
SLIDE 26

Evaluators

  • Precision and Recall

Precision = #correct predictions / #predictions Recall = #correct predictions / #total correct

  • Accuracy

Accuracy is a Synonym for Recall when we don’t care about Precision

  • “At Least One Right”

How often do we get an Accuracy greater than 0?

26

Good precision is diffjcult to achieve and just getting many of the actually correct locations (good recall) may well be enough to get useful results. Control flow analysis, tools like eROSE (which predicts likely additional locations that need to be changed with another location), etc. may provide some help.

slide-27
SLIDE 27

Benchmark: The Usual Suspects

  • Simply predict the locations where the

most bugs were fixed in the past

  • Easy to implement
  • Proven to be useful

27

Based on a Pareto-type law: 80% of all bugs are in only 20% of all locations (or similar).

slide-28
SLIDE 28

Missing Results

  • Files in Eclipse and Mozilla
  • Very many locations, thus very high

memory usage and computation time

  • Results currently unavailable due to

technical difficulties

28

This data may be recomputed later.

slide-29
SLIDE 29

Packages in AspectJ

Total Locations Bugs per Fold Average Locations per Bug

~140 ~60 1.7 - 2.3

29

First example: relatively few locations, few bug reports in each training fold. Average locations per Bug means the average number of locations that had to be touched to fix a bug (i.e. an average bug fix in AspectJ touches about 2 packages).

slide-30
SLIDE 30

Packages in AspectJ

Recall Precision

30

Just an overview, don’t panic! It looks like we’re doing better than the Usual Suspects. These graphs show the relationship between Precision and Recall, when varying the number

  • f predictions, for each fold (fainter colors mean earlier folds).
slide-31
SLIDE 31

Packages in AspectJ

Recall Precision Fold 1

31

Relatively rare (in this case) example where the Usual Suspects get close.

slide-32
SLIDE 32

Packages in AspectJ

Recall Precision Fold 8

32

More common case (again, for this example).

slide-33
SLIDE 33

Packages in AspectJ

Top 10: Average Accuracy

25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Usual Suspects SVM

33

Average accuracy is positively influenced for the Usual Suspects, because there are many bugs in AspectJ (especially early on) with the exact same fix locations; Average accuracy is influenced negatively for SVM, because identical vectors with difgerent locations (i.e. generally predicting more than one correct location) does not extremely well with SVMs (which are mostly designed to find one correct class for each data point).

slide-34
SLIDE 34

Packages in AspectJ

Top 10: At Least One Right

25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Usual Suspects SVM Random Chance

34

See notes from the previous slide; If we just check how often we get at least one right, we do better than the Usual Suspects (because they mostly just perfectly predict aforementioned special set of bugs, but not much else).

slide-35
SLIDE 35

Files in AspectJ

Total Locations Bugs per Fold Average Locations per Bug

~2390 ~60 5.6 - 9

35

Second example: Many more locations than bug reports in each training fold. We can intuitively predict that this will not work well.

slide-36
SLIDE 36

Files in AspectJ

Recall Precision

36

This does not bode well either.

slide-37
SLIDE 37

Files in AspectJ

Recall Precision

37

A closer look.

slide-38
SLIDE 38

Files in AspectJ

Recall Precision Fold 6

38

A relatively common case of the Usual Suspects doing better in this example.

slide-39
SLIDE 39

Files in AspectJ

Top 10: Average Accuracy

25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Usual Suspects SVM

39

Average performance is poor all over (which is to be expected with so many possible locations).

slide-40
SLIDE 40

Files in AspectJ

Top 10: At Least One Right

25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Usual Suspects SVM Random Chance

40

The Usual Suspects still benefit from aforementioned special bugs (which are not only fixed in the same packages, but often also in the same files); SVM sufgers from even more conflicting locations.

slide-41
SLIDE 41

Files in AspectJ

Top 10: At Least One Right

25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Usual Suspects SVM Random Chance SVM (Shorter History)

41

We try to counteract the “pollution” of our model by the aforementioned special bugs by reducing the history size, with moderate success.

slide-42
SLIDE 42

Packages in Eclipse

Total Locations Bugs per Fold Average Locations per Bug

~1390 ~2300 1.5 - 2

42

Third example: Many more bug reports in each training set than locations. We intuitively expect this to work well.

slide-43
SLIDE 43

Packages in Eclipse

Recall Precision

43

Nice, relatively linear, relationship between Precision and Recall. We also clearly outdo the Usual Suspects.

slide-44
SLIDE 44

Packages in Eclipse

Top 10: Average Accuracy

25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Usual Suspects SVM

44

Average performance looks relatively unimpressive (again, SVM is not very good at getting several locations right).

slide-45
SLIDE 45

Packages in Eclipse

Top 10: At Least One Right

25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Usual Suspects SVM Random Chance

45

However, for single locations, we do nearly twice as good as the Usual Suspects in every fold!

slide-46
SLIDE 46

Lessons Learned

  • We need enough training data
  • Too many locations are very harmful
  • Limiting the training history can have a

positive effect

  • Averaging may obscure useful results
  • SVM generally works better for predicting

single locations

46

Things to keep in mind for future work.

slide-47
SLIDE 47

Threats to Validity

... and Future Work

  • Scalability

How much time (and memory) can we afford to spend?

  • Applicability

How many locations are too many? How much training data is enough?

  • Program Design

What effects do different paradigms have?

47

How sure can we be of our results? What are possible problems?

slide-48
SLIDE 48

Threats to Validity

... and Future Work

  • Language

Will it work with non-English bug reports?

  • Usefulness / Acceptance

Will a real user want to use this?

  • When to retrain?

When should we update our model? When shouldn’t we?

48

In a real world scenario, reasonable cut-ofg points for training sets are not as easy to find. Experimental results may also improve with cut-ofg points that are meaningful on a higher level (e.g. training after a feature freeze).

slide-49
SLIDE 49

Related Work

  • “Can Information Foraging Pick the Fix?”

by Margaret Burnett, et al. (2008)

  • Several papers on Impact Analysis by
  • G. Canfora and L. Cerulo (2005/2006)

49

Just a small sample of closely related work.

  • The first method uses similarity measures between bug reports and source code.
  • The second method uses information retrieval methods with bug reports as “features” and

locations as “documents”.

slide-50
SLIDE 50

Predicting Fix Locations from Bug Reports

50

slide-51
SLIDE 51

Predicting Fix Locations from Bug Reports

Thank you!

Questions? Suggestions?

51

Suggestions:

  • Some additional analysis (e.g. test for statistically significant results, etc.) is in order.
  • Slides could have used a summary box after each set of results.
  • Reasonable uses for the method should be explored, as well as combinations with other

methods to improve prediction results.