Improving IR-based Traceability Recovery Using Smoothing Filters - - PowerPoint PPT Presentation

improving ir based traceability recovery using smoothing
SMART_READER_LITE
LIVE PREVIEW

Improving IR-based Traceability Recovery Using Smoothing Filters - - PowerPoint PPT Presentation

Improving IR-based Traceability Recovery Using Smoothing Filters Andrea Massimiliano Rocco Annibale Sebastiano De Lucia Di Penta Oliveto Panichella Panichella Software traceability The degree to which a


slide-1
SLIDE 1

Improving IR-based Traceability Recovery Using Smoothing Filters

Andrea Massimiliano Rocco Annibale Sebastiano De Lucia Di Penta Oliveto Panichella Panichella

slide-2
SLIDE 2

Software traceability

“The degree to which a relationship can be established between two products of a software development process” [IEEE Glossary for Software Terminology]

n Important for:

n program comprehension n requirement tracing n impact analysis n software reuse n …

Up-to-date traceability links rarely exist → need to recover them

Use case Source code Test case Use case Source code Test case

slide-3
SLIDE 3

IR-based traceability recovery

Antoniol et al., 2002 (VSM+Probabilistic model) Marcus and Maletic, 2003 (LSI)

slide-4
SLIDE 4

Traditional IR vs. IR applied to Software Engineering

Traditional IR

n Deals with

heterogeneous documents for what concerns:

n Linguistic choices n Syntax n Semantics

n We just live with that

differences IR applied to SE

n We have sets of

homogeneous documents for what concerns

n Syntax, linguistic

choices

n Examples:

n Use cases, test

documents, design documents follow a common template and contain recurrent words

slide-5
SLIDE 5

Test case Change the date for a visit: C51 Version: 0 02 000 Use case Satisfies the request to modify a visit for a patient UcModVis Priority High .... Test description Input Select a visit: 26/09/2003 11:00 First visit Change: 03/10/2003 11:00 Oracle Invalid sequence: The system does not allow to change a booking Coverage Valid classes: CE1 CE8 CE14 CE19 CE21 Invalid classes: None

Problem

n Different kinds of software artifacts require specific

preprocessing

slide-6
SLIDE 6

Test case Change the date for a visit: C51 Version: 0 02 000 Use case Satisfies the request to modify a visit for a patient UcModVis Priority High .... Test description Input Select a visit: 26/09/2003 11:00 First visit Change: 03/10/2003 11:00 Oracle Invalid sequence: The system does not allow to change a booking Coverage Valid classes: CE1 CE8 CE14 CE19 CE21 Invalid classes: None

Problem

n Different kinds of software artifacts require specific

preprocessing Artifact-specific words do not bring useful information

slide-7
SLIDE 7

A similar problem: image processing

slide-8
SLIDE 8

Pixels with peaks of low color intensity

Noise

Noisy images

Pixels with peaks of high color intensity

slide-9
SLIDE 9

Mean filter

Reducing noise using smoothing filters

=

S m n f

m n f M y x g

) , (

) , ( 1 ) , (

slide-10
SLIDE 10

Image vs. traceability noise

Image noise:

n Pixels with high or

low color intensity

n Pixels are position

dependent

Traceability noise:

n Terms and linguistic

patterns occurring in many artifacts of a given category

n Use cases, test cases.. n Artifacts (columns) are

position independent

d1 ¡ ¡ ¡d2 ¡ ¡ ¡ d2 ¡ ¡ ¡d1 ¡ ¡ ¡

slide-11
SLIDE 11

s1 s2 s3  sk t1 t2 t3  tz word1 word2  wordn v1,1 v1,2 v1,3 v2,1 v2,2 v2,3      vn,1  vn,2  vn,3     v1,k v2,k  vn,k v1,1 v1,2 v1,3 v2,1 v2,2 v2,3      vn,1  vn,2  vn,3     v1,z v2,z  vn,z ! " # # # # # # # $ % & & & & & & &

Source Documents Target Documents Linguistic information strictly belonging to source documents Linguistic information strictly belonging to target documents Common Information for source documents Common Information For target documents

Representing the noise

s1 s2 s3 … sk t1 t2 t3 … tz

slide-12
SLIDE 12

Mean ¡source ¡vector ¡ Mean ¡target ¡vector ¡

S = 1 k v1, j

j=1 k

1 k v2, j

j=1 k

 1 k vn, j

j=1 k

" # $ $ $ $ $ $ $ $ $ $ % & ' ' ' ' ' ' ' ' ' '

1 z v1, j

j=k+1 m

1 z v2, j

j=k+1 m

 1 z vn, j

j=k+1 m

" # $ $ $ $ $ $ $ $ $ $ % & ' ' ' ' ' ' ' ' ' '

Representing the noise

Source Documents Target Documents Common Information for source documents Common Information for target documents

The Mean vectors are like the continuous component of a signal…

S= T=

s1 s2 s3  sk t1 t2 t3  tz word1 word2  wordn v1,1 v1,2 v1,3 v2,1 v2,2 v2,3      vn,1  vn,2  vn,3     v1,k v2,k  vn,k v1,1 v1,2 v1,3 v2,1 v2,2 v2,3      vn,1  vn,2  vn,3     v1,z v2,z  vn,z ! " # # # # # # # $ % & & & & & & &

s1 s2 s3 … sk t1 t2 t3 … tz

slide-13
SLIDE 13

S

(mean ¡source ¡vector) ¡

T

(mean ¡target ¡vector) ¡

Representing the noise

Source Documents Target Documents

Filtered ¡ Source ¡Set ¡

  • Filtered ¡

Target ¡Set ¡

  • s1 s2 s3  sk t1 t2 t3  tz

word1 word2  wordn v1,1 v1,2 v1,3 v2,1 v2,2 v2,3      vn,1  vn,2  vn,3     v1,k v2,k  vn,k v1,1 v1,2 v1,3 v2,1 v2,2 v2,3      vn,1  vn,2  vn,3     v1,z v2,z  vn,z ! " # # # # # # # $ % & & & & & & &

s1 s2 s3 … sk t1 t2 t3 … tz

slide-14
SLIDE 14

Empirical Study

n Goal: analyze the effect of smoothing filter n Purpose: investigating how the filter affects

traceability recovery

n Quality focus: traceability recovery performance n Perspective:

n Researchers: evaluating the novel technique

n Context: artifacts from two systems

n EasyClinic and Pine

slide-15
SLIDE 15

Context

EasyClinic Pine Description Medical doctor office management Text-based email client Language Java C Files/Classes 37 31 KLOC 20 130 Documents 113 100 Language Italian English Artifacts Use cases Interaction diagrams Source code Test cases Requirements Use cases

slide-16
SLIDE 16

Research Questions and Factors

n RQ1: Does the smoothing filter improve the

recovery performances of traceability recovery?

n RQ2: How effective is the smoothing filter in

filtering out non-relevant words, as compared to stop word removal?

n Factors:

n Use of filter: YES, NO n Technique: VSM, LSI n Artifact: Req., UC, Int. Diagrams, Code, TC n System: Easyclinic, Pine

slide-17
SLIDE 17

Analysis Method – RQ1

n We statistically compare the #

  • f false positives of different

methods for each correct link identified

n Wilcoxon Rank Sum test n Cliff’s delta effect size

correct retrieved correct recall ∩ =

retrieved retrieved correct precision ∩ =

n Performances evaluated by precision and recall:

2 2 3

M1 M2

slide-18
SLIDE 18

Analysis Method – RQ2

n We replace stop word filtering by one of the

following treatments:

1.

Standard stop word removal

2.

Manually customized stop word removal

3.

Smoothing filter

4.

Standard stop word removal + filter

5.

Customized stop word removal + filter

n …and compare the performances

slide-19
SLIDE 19

Results

slide-20
SLIDE 20

EasyClinic: Use cases into source (VSM)

Recall Precision

Filtered Not Filtered

[-60, -74]% of false positives for recall<80%

slide-21
SLIDE 21

EasyClinic: Use cases into source (LSI)

Filtered Not Filtered

Precision Recall

[-60, -77]% of false positives for recall<80%

slide-22
SLIDE 22

EasyClinic: Test cases into source (LSI)

Filtered Not Filtered

Precision Recall

Test cases are: § Short documents § Limited vocabulary § Mostly consistent with source code

slide-23
SLIDE 23

Pine: Use cases into requirements (LSI)

Filtered Not Filtered

Precision Recall

[-42, -62]% of false positives for recall<80%

slide-24
SLIDE 24

Statistical Comparison

Data set Traced Artifacts

VSM LSI

p-value Effect size p-value Effect size EasyClinic UC→Code <0.01 0.50 (large) <0.01 0.50 (large)

  • Int. Diag.

→ Code <0.01 0.52 (large) <0.01 0.34 (medium) TC → Code 1.00

  • (negligible)

1.00

  • (negligible)

Pine

  • Req. → UC

<0.01 0.58 (large) <0.01 0.58 (large)

slide-25
SLIDE 25

RQ2 – Summary of results

Comparison EasyClinic Pine UCè èCC IDè èCC TCè èCC HLRè è UC

Smoothing filter Standard list YES (small) YES (small) NO (large) YES (large) Smoothing filter Cust. list YES (small) YES (small) NO (large) YES (large) Standard list+ Smoothing filter

  • Cust. list

YES (large) YES (large) NO (medium) YES (large) Standard list+ Smoothing filter Cist list + Smoothing filter NO (small)

  • YES

(medium) YES (small)

slide-26
SLIDE 26

Link precision improvement

Login Patient

  • vs. Person

Poor vocabulary

  • verlap (10%)
slide-27
SLIDE 27

Threats to validity

n Construct validity

n Mainly related to our oracle n Provided by developers and for EasyClinic also peer-

reviewed

n Internal validity

n Improvements could be due to other reasons… n However, we compared different techniques (VSM, LSI) n The approach works well regardless of stop word removal,

stemming and use of tf-idf

n Conclusion validity

n Conclusions based on proper (non-parametric) statistics

n External validity

n We considered systems with different characteristics and

artifacts

n … but further studies are desirable

slide-28
SLIDE 28

Conclusions

1 2 3 1 2 3 1,1 1,2 1,3 1, 1 2,1 2,2 2,3 2, 2 ,1 ,2 ,3

s s s s t t t t

k z k n n n n

v v v v word v v v v word v v v word L L L L M O M O M M O L L L

1,1 1,2 1,3 1, 2,1 2,2 2,3 2, , ,1 ,2 ,3 , z k z n k n n n n z

v v v v v v v v v v v v v ! " # $ # $ # $ # $ % & L L M M O M O M M O L L L

S

(mean&target&vector)!

T

(mean&target&vector)&

Representing the noise

Source Documents Target Documents

Filtered& Source&Set!

  • Filtered&

Target&Set!

  • EasyClinic: Use cases into source (LSI)

Filtered Not Filtered

Precision Recall

[-60, -77]% of false positives for recall<80%

EasyClinic: Test cases into source (LSI)

Filtered Not Filtered

Precision Recall

Test cases are: ! Short documents ! Limited vocabulary ! Mostly consistent with source code

RQ2 – Summary of results

Comparison EasyClinic Pine UC!CC ID!CC TC!CC HLR! UC

Smoothing filter Standard list YES (small) YES (small) NO (large) YES (large) Smoothing filter Cust. list YES (small) YES (small) NO (large) YES (large) Standard list+ Smoothing filter

  • Cust. list

YES (large) YES (large) NO (medium) YES (large) Standard list+ Smoothing filter Cist list + Smoothing filter NO (small)

  • YES

(medium) YES (small)

slide-29
SLIDE 29

Work-in-progress

n Study replication

n Different systems and artifacts n Use of relevance feedback

n More sophisticated smoothing technique

n Non-linear filters

n Use in other applications of IR to software

engineering

n impact analysis n feature location