recovering traceability links via informa7on retrieval

RecoveringTraceabilityLinks viaInforma7onRetrievalMethods - PowerPoint PPT Presentation

RecoveringTraceabilityLinks viaInforma7onRetrievalMethods ChallengesandOpportuni7es Dr.RoccoOliveto,Ph.D. DepartmentofMathemaFcsandInformaFcs, UniversityofSalerno


  1. Recovering
Traceability
Links
 via
Informa7on
Retrieval
Methods ‐
Challenges
and
Opportuni7es
‐ Dr.
Rocco
Oliveto,
Ph.D. Department
of
MathemaFcs
and
InformaFcs,
 University
of
Salerno 84084,
Fisciano
(SA),
Italy roliveto@unisa.it École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 1

  2. Agenda • Traceability
recovery:
why? – Context
and
moFvaFon • IR‐based
traceability
recovery:
how? – Canonical
IR‐based
traceability
recovery
process – A
two
step
process:
incremental
process
and
coverage
link
analysis • IR‐based
traceability
recovery
in
pracFce – Lesson
learned
from
case
studies
and
controlled
experiments • Conclusion
and
challanges
in
traceability
recovery École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 2

  3. Traceability
recovery:
why? Recovering
Traceability
Links
via
Informa7on
 Retrieval
Methods:
Challenges
and
Opportuni7es by
Rocco
Oliveto École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 3

  4. Context • Traceability... – the
ability
to
describe
and
follow
the
artefact
life‐cycle – Example:
a
use
case
is
implemented
by
one
or
more
classes
that
are
tested
by
 a
set
of
test
cases • Mantaining
traceability
between
so[ware
artefacts
is
important
for
 so[ware
development
and
maintenance – program
comprehension – requirement
tracing – impact
analysis – so[ware
reuse – … École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 4

  5. Mo7va7ons • Maintaining
traceability
links
during
so[ware
evoluFon – Tedious
and
error
prone
task – O[en
this
informaFon
becomes
out
of
date
or
it
is
completely
absent – Inadequate
traceability
contributes
to
project
over‐runs
and
failures
 • Artefact
management
tools
that
support
traceability
do
not
 provide
adequate
automaFc
or
semi‐automaFc
traceability
 link
generaFon
and
maintenance – The
traceability
matrix
has
to
be
manually
managed – Need
for
automaFc
(or
semi‐automaFc)
traceability
link
recovery • Promising
results
have
been
achieved
by
using
InformaFon
 Retrieval
methods – The
approach
was
proposed
in
1999
by
Antoniol
et
al. École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 5

  6. IR‐based
Traceability
Recovery • RaFonale... – Most
so[ware
artefacts
contains
text – Requirement
specificaFons,
design
documents,
idenFfiers
and
comments
in
 UML
diagrams
and
source
code,
test
case
specificaFons,
manual
pages,
 maintenance
reports,
change
logs • Conjecture... – Artefacts
having
a
high
text
similarity
are
likely
good
candidates
to
be
traced
 onto
each
other – Artefacts
with
high
similairty
probably
describe
similar
concepts
 • AssumpFon... – Consistent
use
of
domain
terms
in
the
so[ware
documents
(e.g.,
programmers
 use
meaningful
names
for
program’s
items,
such
as
funcFons,
variables,
types,
 classes,
and
methods. École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 6

  7. IR‐based
traceability
recovery:
how? Recovering
Traceability
Links
via
Informa7on
 Retrieval
Methods:
Challenges
and
Opportuni7es by
Rocco
Oliveto École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 7

  8. The
traceability
recovery
process École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 8

  9. Indexer
and
classifier:
two
basic
 models • ProbabilisFc
model – The
similarity
between
a
source
and
a
target
artefact
is
based
on
the
 probability
that
the
target
artefact
is
related
to
the
source
artefact – Not
discussed
in
details
in
this
talk… • Vector
space
model – Source
and
target
artefacts
are
represented
in
a
vector
space
and
the
 similarity
is
computed
through
vector
operaFons,
e.g.
cosine
of
the
angle
 between
the
two
vectors
 • Many
improvements
of
the
basic
models – Latent
SemantIc
Indexing – Keyword
list – Relevance
feedback
analysis École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 9

  10. Vector
Space
Model
(VSM) • So[ware
artefacts
are
represented
as
vectors
in
the
space
of
the
 terms
(vocabulary)
 – Also
possible
to
use
a
combinaFon
of
terms
(i.e.,
n‐grams)
as
vector
 characterisFcs
(…expensive) – The
artefact
space
is
represented
by
the
 term‐by‐document 
matrix T2 Term-by-document matrix Geometrical representation of term-by-document matrix D1 D2 D3 D2 T1 1 4 0 T2 2 1 3 D1 D3 T1 École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 10

  11. Term
weigh7ng • How
to
represent
the
importance
(i.e.,
weight)
of
a
term
in
a
 document?
 – Term
occurrences – Boolean
value
(0
if
the
term
occurs,
1
otherwise) – An
advanced
approach
considers
local
and
global
weights • Generally,
a
generic
entry
a i,j 
of
the
term‐by‐document
matrix
is
 calculated
as
follow:
 a i,j = L ( i, j ) · G ( i ) • Tf‐Idf
term
weighFng: n i,j � � s tf i,j = k n k,j , n s id , f i = log P doc i École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 11

  12. Artefact
similarity • How
to
define
the
textual
similairty
between
artefacts?
 – Using
the
corresponding
vectors – Dot
product
or... – cosine
of
the
angle
between
the
two
corresponding
vectors
(beger) − → D · − → � t i ∈ D,Q w t i D · w t i Q Q sim ( D, Q ) = = �− → D � · � − → �� �� Q � t i ∈ D w 2 t i ∈ Q w 2 t i D · t i Q • The
cosine: – Has
values
in
[0,
1]
since
the
maximum
angle
is
90°
 – Increases
as
more

terms
are
shared • Thus,
two
artefacts
are
considered
similar
if
their
corresponding
 vectors
point
in
the
same
direcFon
(the
angle
is
close
to
0°) École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 12

  13. Limita7ons
of
the
VSM • The
vector
space
model
does
not
take
into
account
relaFons
 between
terms – It
soffers
of
the
synonymy
and
polysemy
problems – synonymy:
different
words
with
the
same
meaning – polysemy:
same
words
with
different
meanings
(depending
on
the
context) • For
instance,
having
“automobile”
in
one
artefacts
and
“car”
in
 another
artefact
does
not
contribute
to
the
similarity
measure
 between
these
two
documents • How
to
try
to
miFgate
such
problems – Using
a
dicFonary – By
using
morphological
analysis,
like
stemming • Stemming
aims
at
removing
suffixes
of
words
to
extract
their
stems • Example:
working,
worker,
worked
have
the
same
stem
work École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 13

  14. Latent
Seman7c
Indexing
(LSI) • Extension
of
the
vector
space
model – Provides
a
way
to
automaFcally
deal
with
synonymy
and
polisemy
 – Avoids
preliminary
morphological
analysis
 • How
does
LSI
miFgate
the
synonumy
and
polisemy
problems? – It
analyses
the
co‐occurrence
of
the
terms
by
using
the
Singular
Value
 DecomposiFon
(SVD) • SVD
is
used
to
decompose
the
term‐by‐document
matrix
into
a
set
 of
k
orthogonal
factors
from
which
the
original
matrix
can
be
 approximated
by
linear
combinaFon – The
idea
is
to
reduce
the
space
of
the
terms – Reducing
the
term
space
we
also
reduce
the
noice
in
the
word
usage
caused
 by
synonymy
and
polisemy
words École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 14

Recommend


More recommend