6/17/2011 1
Information Retrieval Methods for Software Engineering
Andrian Marcus
with substantial contributions from Giuliano Antoniol
1
Why use information retrieval in software engineering?
2
Information Retrieval Methods for Software Engineering Andrian - - PDF document
6/17/2011 Information Retrieval Methods for Software Engineering Andrian Marcus with substantial contributions from Giuliano Antoniol 1 Why use information retrieval in software engineering? 2 1 6/17/2011 Information in Software S t
with substantial contributions from Giuliano Antoniol
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
i i iy
i i i i i i i
2 2
i i i i i i i
2 2
i i i i i i i i i i
y x y x y x
2 2
18
19
20
21
22
23
24
DocID Nova Galaxy Film Role Diet Fur Web Tax Fruit D1 2 3 5 D2 3 7 1 D3 4 11 15 D4 9 4 7 D5 4 7 9 5 1
25
T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn
26
Example:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3
T3 T1 T2
D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3
7 3 2 5
similarity? Distance? Angle? Projection?
27
28
29
k
k k k k k i k ik i k
30
31
32
1 2 1 2 1 1 , 2 1
2 1
t j d t j qj t j d qj i t j d qj i qt q q d d d i
ij ij ij it i i
33
2 2 2 2 2 2
34
1.0 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 1.0 D2 D1 Q
1
2
Term B Term A
t j t j d q t j d q i
ij j ij j
1 1 2 2 1
Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)
2 2 2 2
1
35
the query are explicitly present in the relevant documents
to text to be retrieved
36
37
38
39
– where k is the rank of X
T
40
41
=
words documents w1 wM d1 dN
u1 u2 : : uM v1
T v2 T …..
vN
T
42
43
1 2 DNA 1 1 1 genome 1 sheep 1 1 Dolly 1 1 database 1 1 Gentoo 1 1 Debian 1 1 1 released 1 Linux 1 1 software 1 1
d10 d9 d8 d7 d6 d5 d4 d3 d2 d1 44
45 0.846 0.422 0.518 1.795 0.19 0.114
0.002 DNA 0.521 0.262 0.329 1.107 0.116 0.107 0.014 0.034 0.014 0.025 genome 0.036 0.018 0.022 0.077 0.008 0.003
sheep 0.198 0.098 0.12 0.419 0.044 0.022
Dolly 0.117 0.067 0.111 0.258 0.025 0.159 0.193 0.259 0.187 0.092 database 0.002 0.021 0.092 0.028
0.326 0.457 0.606 0.445 0.208 Gentoo
0.006 0.083
0.364 0.519 0.687 0.505 0.235 Debian
0.008 0.118
0.522 0.744 0.986 0.724 0.337 released
0.002 0.035
0.158 0.226 0.299 0.22 0.102 Linux
0.005 0.054
0.23 0.328 0.435 0.319 0.149 software 0.017 0.02 0.062 0.05 0.002 0.188 0.26 0.346 0.253 0.119
d10 d9 d8 d7 d6 d5 d4 d3 d2 d1
46
47
+ However it works better with stemming!
48
49
50
51
52
T t j K j j i T t t j K j j i i
1 1 1 1
53
j i j
54
55
56
57
58
59
60
61
Relevant documents Retrieved documents Entire document collection retrieved & relevant not retrieved but relevant retrieved & irrelevant Not retrieved & irrelevant
retrieved not retrieved relevant irrelevant
62
63
The ideal Returns relevant documents but misses many useful ones too Returns most relevant documents but includes lots of junk
64
0.4 0.8 1.0 0.8 0.6 0.4 0.2 0.2 1.0 0.6
Recall Precision
65
0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision NoStem Stem
66
67
Precision/Recall/F-Measure
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 Returned documents Precision Recall F-Measure
68
Good links Top
69
70
Good links Top Good links Top
71
72
– position of the first item
– position of the last item
73
74
75
76
77
78
79
80
81
82
83
84
85
public void run IProgressMonitor monitor throws InvocationTargetException InterruptedException if m_iFlag processCorpus monitor checkUpdate else if m_iFlag processCorpus monitor UD_UPDATECORPUS else processQueryString monitor if monitor isCancelled throw new InterruptedException the long running
86
87
public void run IProgressMonitor monitor throws InvocationTargetException
InterruptedException if m_iFlag the processCorpus monitor checkUpdate else if m_iFlag processCorpus monitor UD_UPDATECORPUS else a processQueryString monitor if monitor isCancelled throw new InterruptedException the long running
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
Table
createTable
tableViewer getTable tableValue, keyTable
Header
setHeaderVisible setLineVisible …
39.TableTreeEditor.resize
106
JFace Text Editor Leaves a Black Rectangle on Content Assist text
information popup causes a black rectangle to appear on top of the display.
107
108
109
* Wilde, N. and Scully, M., "Software Reconnaissance: Mapping Program Features to Code", S
Research and Pract ice, vol. 7, no. 1, Jan.-Feb. 1995, pp. 49-62. ** Antoniol, G. and Guéhéneuc, Y . G., "Feature Identification: An Epidemiological Metaphor", IEEE Trans. on S
Feature Invoked Feature Not Invoked t1 t2 t3 I1 R I2 I1 R I1 I2 R
mk mk mk mk mk 109
110
111
TR Tracer Static Analysis
112
113
114
115
116
117
Software Reconn SPR ASDGs LSI NLP Cerberus FLAT3 PROMESIR SITIR SNIAFL DORA FCA
SUADE
118
119
120
121
– the degree to which the requirements and design of a given software component match;
– the degree to which each element in a graphical environment references the requirement that it satisfies.
122
123
– MIL-STD-498, IEEE/EIA 12207 (Military) – ISO/IEC 12207 – DO178B, DO254 (Avionic) – EN50128 (Railways)
–
124
125
126
127
128
129
130
131
method1 method2 method3 method1 method2 method3
Class A Class B
0 .5 0 .7 0 .3 0 .5 0 .6 0 .4 0 .2 0 .4 0 .2 0 .3 0 .4 0 .3
Conceptual coupling between A and B = 0.4
132
method1 method2 method3 method1 method2 method3
Class A Class B
0 .5 0 .7 0 .3 0 .7 0 .6 0 .4 0 .2 0 .6 0 .2 0 .3 0 .4 0 .4 Conceptual coupling between A and B = 0.56
133
134
135
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Proportion 0.19% 9.49% 9.41% 12.19% 19.53% 9.45% 9.05% 9.14% Cumulative 19.00% 28.80% 38.22% 50.41% 69.95% 79.40% 88.46% 97.61% CoCC
0.941 0.042 0.000
0.279 0.129
CoCCm 0.064 0.343
0.024 0.115 0.904 0.041 0.074 CBO 0.260
0.185 0.558 0.309 0.341 0.017 0.473 RFC 0.264
0.046 0.266 0.422 0.067 0.075 0.803 MPC 0.233
0.154 0.929 0.081 0.024 0.202 DAC 0.931
0.074 0.161 0.268 0.043 0.084 0.136 ICP 0.346
0.139 0.903 0.074 0.028 0.162 ACAIC 0.052 0.035 0.950 0.022
0.272 0.046 OCAIC 0.935
0.006 0.162 0.274 0.046 0.067 0.127 ACMIC 0.113 0.129 0.281 0.050 0.040 0.041 0.939 0.049 OCMIC 0.222 0.029
0.928 0.181
0.052 0.157 136
137
138
139
140
–
–
–
1. Lack of cohesion in m ethods 2. Inform ation-flow based cohesion 3. Tight class cohesion 4. Loose class cohesion 5. Logical relatedness of m ethods 6. Sem antic class definition entropy 7. Sem antic cohesion of files 141
142
N i j i m
1
143
144
145
146
[Ferenc’04], C3 and LCSM – our tool
147
148
149
150
151
152
153
154
155
156
157
158
159
topics
160
Adrian Kuhn, Stéphane Ducasse, and Tudor Gîrba, “ Semantic Clustering: Identifying Topics in Source Code”, In Information and Software Technology 49(3) p. 230—243, March 2007 161
162
163
164
165
166
167
Agglomerative clustering treats each data point as a singleton cluster, and then successively merges clusters until all points have been merged into a single remaining cluster. Divisive clustering works the other way around.
168
Single link
In single-link hierarchical clustering, we merge in each step the two clusters whose two closest members have the smallest distance.
169
Complete link
In complete-link hierarchical clustering, we merge in each step the two clusters whose merging has the smallest diameter.
170
171
172
173
174
175
176