recovering traceability links via informa7on retrieval
play

RecoveringTraceabilityLinks viaInforma7onRetrievalMethods - PowerPoint PPT Presentation

RecoveringTraceabilityLinks viaInforma7onRetrievalMethods ChallengesandOpportuni7es Dr.RoccoOliveto,Ph.D. DepartmentofMathemaFcsandInformaFcs, UniversityofSalerno


  1. Recovering
Traceability
Links
 via
Informa7on
Retrieval
Methods ‐
Challenges
and
Opportuni7es
‐ Dr.
Rocco
Oliveto,
Ph.D. Department
of
MathemaFcs
and
InformaFcs,
 University
of
Salerno 84084,
Fisciano
(SA),
Italy roliveto@unisa.it École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 1

  2. Agenda • Traceability
recovery:
why? – Context
and
moFvaFon • IR‐based
traceability
recovery:
how? – Canonical
IR‐based
traceability
recovery
process – A
two
step
process:
incremental
process
and
coverage
link
analysis • IR‐based
traceability
recovery
in
pracFce – Lesson
learned
from
case
studies
and
controlled
experiments • Conclusion
and
challanges
in
traceability
recovery École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 2

  3. Traceability
recovery:
why? Recovering
Traceability
Links
via
Informa7on
 Retrieval
Methods:
Challenges
and
Opportuni7es by
Rocco
Oliveto École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 3

  4. Context • Traceability... – the
ability
to
describe
and
follow
the
artefact
life‐cycle – Example:
a
use
case
is
implemented
by
one
or
more
classes
that
are
tested
by
 a
set
of
test
cases • Mantaining
traceability
between
so[ware
artefacts
is
important
for
 so[ware
development
and
maintenance – program
comprehension – requirement
tracing – impact
analysis – so[ware
reuse – … École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 4

  5. Mo7va7ons • Maintaining
traceability
links
during
so[ware
evoluFon – Tedious
and
error
prone
task – O[en
this
informaFon
becomes
out
of
date
or
it
is
completely
absent – Inadequate
traceability
contributes
to
project
over‐runs
and
failures
 • Artefact
management
tools
that
support
traceability
do
not
 provide
adequate
automaFc
or
semi‐automaFc
traceability
 link
generaFon
and
maintenance – The
traceability
matrix
has
to
be
manually
managed – Need
for
automaFc
(or
semi‐automaFc)
traceability
link
recovery • Promising
results
have
been
achieved
by
using
InformaFon
 Retrieval
methods – The
approach
was
proposed
in
1999
by
Antoniol
et
al. École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 5

  6. IR‐based
Traceability
Recovery • RaFonale... – Most
so[ware
artefacts
contains
text – Requirement
specificaFons,
design
documents,
idenFfiers
and
comments
in
 UML
diagrams
and
source
code,
test
case
specificaFons,
manual
pages,
 maintenance
reports,
change
logs • Conjecture... – Artefacts
having
a
high
text
similarity
are
likely
good
candidates
to
be
traced
 onto
each
other – Artefacts
with
high
similairty
probably
describe
similar
concepts
 • AssumpFon... – Consistent
use
of
domain
terms
in
the
so[ware
documents
(e.g.,
programmers
 use
meaningful
names
for
program’s
items,
such
as
funcFons,
variables,
types,
 classes,
and
methods. École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 6

  7. IR‐based
traceability
recovery:
how? Recovering
Traceability
Links
via
Informa7on
 Retrieval
Methods:
Challenges
and
Opportuni7es by
Rocco
Oliveto École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 7

  8. The
traceability
recovery
process École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 8

  9. Indexer
and
classifier:
two
basic
 models • ProbabilisFc
model – The
similarity
between
a
source
and
a
target
artefact
is
based
on
the
 probability
that
the
target
artefact
is
related
to
the
source
artefact – Not
discussed
in
details
in
this
talk… • Vector
space
model – Source
and
target
artefacts
are
represented
in
a
vector
space
and
the
 similarity
is
computed
through
vector
operaFons,
e.g.
cosine
of
the
angle
 between
the
two
vectors
 • Many
improvements
of
the
basic
models – Latent
SemantIc
Indexing – Keyword
list – Relevance
feedback
analysis École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 9

  10. Vector
Space
Model
(VSM) • So[ware
artefacts
are
represented
as
vectors
in
the
space
of
the
 terms
(vocabulary)
 – Also
possible
to
use
a
combinaFon
of
terms
(i.e.,
n‐grams)
as
vector
 characterisFcs
(…expensive) – The
artefact
space
is
represented
by
the
 term‐by‐document 
matrix T2 Term-by-document matrix Geometrical representation of term-by-document matrix D1 D2 D3 D2 T1 1 4 0 T2 2 1 3 D1 D3 T1 École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 10

  11. Term
weigh7ng • How
to
represent
the
importance
(i.e.,
weight)
of
a
term
in
a
 document?
 – Term
occurrences – Boolean
value
(0
if
the
term
occurs,
1
otherwise) – An
advanced
approach
considers
local
and
global
weights • Generally,
a
generic
entry
a i,j 
of
the
term‐by‐document
matrix
is
 calculated
as
follow:
 a i,j = L ( i, j ) · G ( i ) • Tf‐Idf
term
weighFng: n i,j � � s tf i,j = k n k,j , n s id , f i = log P doc i École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 11

  12. Artefact
similarity • How
to
define
the
textual
similairty
between
artefacts?
 – Using
the
corresponding
vectors – Dot
product
or... – cosine
of
the
angle
between
the
two
corresponding
vectors
(beger) − → D · − → � t i ∈ D,Q w t i D · w t i Q Q sim ( D, Q ) = = �− → D � · � − → �� �� Q � t i ∈ D w 2 t i ∈ Q w 2 t i D · t i Q • The
cosine: – Has
values
in
[0,
1]
since
the
maximum
angle
is
90°
 – Increases
as
more

terms
are
shared • Thus,
two
artefacts
are
considered
similar
if
their
corresponding
 vectors
point
in
the
same
direcFon
(the
angle
is
close
to
0°) École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 12

  13. Limita7ons
of
the
VSM • The
vector
space
model
does
not
take
into
account
relaFons
 between
terms – It
soffers
of
the
synonymy
and
polysemy
problems – synonymy:
different
words
with
the
same
meaning – polysemy:
same
words
with
different
meanings
(depending
on
the
context) • For
instance,
having
“automobile”
in
one
artefacts
and
“car”
in
 another
artefact
does
not
contribute
to
the
similarity
measure
 between
these
two
documents • How
to
try
to
miFgate
such
problems – Using
a
dicFonary – By
using
morphological
analysis,
like
stemming • Stemming
aims
at
removing
suffixes
of
words
to
extract
their
stems • Example:
working,
worker,
worked
have
the
same
stem
work École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 13

  14. Latent
Seman7c
Indexing
(LSI) • Extension
of
the
vector
space
model – Provides
a
way
to
automaFcally
deal
with
synonymy
and
polisemy
 – Avoids
preliminary
morphological
analysis
 • How
does
LSI
miFgate
the
synonumy
and
polisemy
problems? – It
analyses
the
co‐occurrence
of
the
terms
by
using
the
Singular
Value
 DecomposiFon
(SVD) • SVD
is
used
to
decompose
the
term‐by‐document
matrix
into
a
set
 of
k
orthogonal
factors
from
which
the
original
matrix
can
be
 approximated
by
linear
combinaFon – The
idea
is
to
reduce
the
space
of
the
terms – Reducing
the
term
space
we
also
reduce
the
noice
in
the
word
usage
caused
 by
synonymy
and
polisemy
words École
Polytechnique
de
Montréal,
Montréal,
Québec,
Canada
‐
September
3rd,
2009 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend