Predicting Fix Locations from Bug Reports
Master Thesis by Markus Thiele
(Supervised by Rahul Premraj and Tom Zimmermann)
1
Deadline for thesis: 2009-03-31
Predicting Fix Locations from Bug Reports Master Thesis by Markus - - PowerPoint PPT Presentation
Predicting Fix Locations from Bug Reports Master Thesis by Markus Thiele (Supervised by Rahul Premraj and Tom Zimmermann) 1 Deadline for thesis: 2009-03-31 2 Motivation. 3 4 A typical Bugzilla bug report. Questions Who should fix this
Master Thesis by Markus Thiele
(Supervised by Rahul Premraj and Tom Zimmermann)
1
Deadline for thesis: 2009-03-31
2
Motivation.
3
4
A typical Bugzilla bug report.
(Anvik, Hiew, and Murphy 2006)
(Weiss, Premraj, Zimmermann, and Zeller 2007)
5
Previous, related work, and finally the question we focus on.
6
Some impression of the size of the search space.
7
8
9
What we would ultimately like to have: A tool (possibly integrated into a bug database) to automatically predict likely fix locations.
10
SVMs find the best fitting boundary (a hyper-plane) between two (or possibly more) classes in a feature space.
11
Widely used SVM implementation.
12
Application of SVMs to our problem.
13
A simplification to improve prediction results; enhancements may involve new locations, etc.
14
Structured data is hard to integrate into a single feature vector with unstructured data (how to represent it? how to weight it?). Also difgerent bug databases provide difgerent types of meta data.
15
Finer grained locations seem unlikely to work well, coarser grained locations would probably be useless.
Old Bug Reports Feature Extraction Training Model New Bug Report Prediction Location(s)
with known locations
Feature Extraction
16
Overview over the general process used by a possible tool.
Plain Text Bag of Words (BOW) Stemming Stop Words Scaling (TF)
Vector
Program Crash GUI
17
Plain text needs to be converted into feature vectors for the SVM.
18
Two commonly used SVM “kernels” (functions to map data points into a difgerent, possibly non-linear space, to find difgerent kinds of boundaries).
19
Why we chose to use linear kernels. Intuitively the vertical spread of these graphs is related to precision, the horizontal spread to recall; Notice that a linear kernel performs as well as an optimized RBF kernel (and note that the parameter optimization step is very computationally expensive).
20
Data provided by the SE chair, part of which is available to the public as “iBUGS”. This talk shows results from Eclipse and AspectJ.
All Bug Reports Feature Extraction
with known locations
Training Set Testing Set Training Model Prediction Location(s) Evaluator(s) Results
21
Specific experimental setup (to generate results that can be evaluated).
22
Why we chose splitting along a time axis (this is usually the better choice).
23
11 just so we have 10 test results
Train Test
Train Test
Train Test
24
One possibility: Always include all the previous data.
Train Test
Test Train Test
Train
25
Another possibility: Only include part of the previous data (here the history length is 1 fold, but it may also be longer).
Precision = #correct predictions / #predictions Recall = #correct predictions / #total correct
Accuracy is a Synonym for Recall when we don’t care about Precision
How often do we get an Accuracy greater than 0?
26
Good precision is diffjcult to achieve and just getting many of the actually correct locations (good recall) may well be enough to get useful results. Control flow analysis, tools like eROSE (which predicts likely additional locations that need to be changed with another location), etc. may provide some help.
27
Based on a Pareto-type law: 80% of all bugs are in only 20% of all locations (or similar).
28
This data may be recomputed later.
Total Locations Bugs per Fold Average Locations per Bug
~140 ~60 1.7 - 2.3
29
First example: relatively few locations, few bug reports in each training fold. Average locations per Bug means the average number of locations that had to be touched to fix a bug (i.e. an average bug fix in AspectJ touches about 2 packages).
30
Just an overview, don’t panic! It looks like we’re doing better than the Usual Suspects. These graphs show the relationship between Precision and Recall, when varying the number
31
Relatively rare (in this case) example where the Usual Suspects get close.
32
More common case (again, for this example).
25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9
Usual Suspects SVM
33
Average accuracy is positively influenced for the Usual Suspects, because there are many bugs in AspectJ (especially early on) with the exact same fix locations; Average accuracy is influenced negatively for SVM, because identical vectors with difgerent locations (i.e. generally predicting more than one correct location) does not extremely well with SVMs (which are mostly designed to find one correct class for each data point).
25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9
Usual Suspects SVM Random Chance
34
See notes from the previous slide; If we just check how often we get at least one right, we do better than the Usual Suspects (because they mostly just perfectly predict aforementioned special set of bugs, but not much else).
Total Locations Bugs per Fold Average Locations per Bug
~2390 ~60 5.6 - 9
35
Second example: Many more locations than bug reports in each training fold. We can intuitively predict that this will not work well.
36
This does not bode well either.
37
A closer look.
38
A relatively common case of the Usual Suspects doing better in this example.
25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9
Usual Suspects SVM
39
Average performance is poor all over (which is to be expected with so many possible locations).
25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9
Usual Suspects SVM Random Chance
40
The Usual Suspects still benefit from aforementioned special bugs (which are not only fixed in the same packages, but often also in the same files); SVM sufgers from even more conflicting locations.
25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9
Usual Suspects SVM Random Chance SVM (Shorter History)
41
We try to counteract the “pollution” of our model by the aforementioned special bugs by reducing the history size, with moderate success.
Total Locations Bugs per Fold Average Locations per Bug
~1390 ~2300 1.5 - 2
42
Third example: Many more bug reports in each training set than locations. We intuitively expect this to work well.
43
Nice, relatively linear, relationship between Precision and Recall. We also clearly outdo the Usual Suspects.
25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9
Usual Suspects SVM
44
Average performance looks relatively unimpressive (again, SVM is not very good at getting several locations right).
25 50 75 100 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9
Usual Suspects SVM Random Chance
45
However, for single locations, we do nearly twice as good as the Usual Suspects in every fold!
46
Things to keep in mind for future work.
47
How sure can we be of our results? What are possible problems?
48
In a real world scenario, reasonable cut-ofg points for training sets are not as easy to find. Experimental results may also improve with cut-ofg points that are meaningful on a higher level (e.g. training after a feature freeze).
49
Just a small sample of closely related work.
locations as “documents”.
50
51
Suggestions:
methods to improve prediction results.