Recovering Traceability Links via Informa7on Retrieval Methods ‐ Challenges and Opportuni7es ‐ Dr. Rocco Oliveto, Ph.D. Department of MathemaFcs and InformaFcs, University of Salerno 84084, Fisciano (SA), Italy roliveto@unisa.it École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 1
Agenda • Traceability recovery: why? – Context and moFvaFon • IR‐based traceability recovery: how? – Canonical IR‐based traceability recovery process – A two step process: incremental process and coverage link analysis • IR‐based traceability recovery in pracFce – Lesson learned from case studies and controlled experiments • Conclusion and challanges in traceability recovery École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 2
Traceability recovery: why? Recovering Traceability Links via Informa7on Retrieval Methods: Challenges and Opportuni7es by Rocco Oliveto École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 3
Context • Traceability... – the ability to describe and follow the artefact life‐cycle – Example: a use case is implemented by one or more classes that are tested by a set of test cases • Mantaining traceability between so[ware artefacts is important for so[ware development and maintenance – program comprehension – requirement tracing – impact analysis – so[ware reuse – … École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 4
Mo7va7ons • Maintaining traceability links during so[ware evoluFon – Tedious and error prone task – O[en this informaFon becomes out of date or it is completely absent – Inadequate traceability contributes to project over‐runs and failures • Artefact management tools that support traceability do not provide adequate automaFc or semi‐automaFc traceability link generaFon and maintenance – The traceability matrix has to be manually managed – Need for automaFc (or semi‐automaFc) traceability link recovery • Promising results have been achieved by using InformaFon Retrieval methods – The approach was proposed in 1999 by Antoniol et al. École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 5
IR‐based Traceability Recovery • RaFonale... – Most so[ware artefacts contains text – Requirement specificaFons, design documents, idenFfiers and comments in UML diagrams and source code, test case specificaFons, manual pages, maintenance reports, change logs • Conjecture... – Artefacts having a high text similarity are likely good candidates to be traced onto each other – Artefacts with high similairty probably describe similar concepts • AssumpFon... – Consistent use of domain terms in the so[ware documents (e.g., programmers use meaningful names for program’s items, such as funcFons, variables, types, classes, and methods. École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 6
IR‐based traceability recovery: how? Recovering Traceability Links via Informa7on Retrieval Methods: Challenges and Opportuni7es by Rocco Oliveto École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 7
The traceability recovery process École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 8
Indexer and classifier: two basic models • ProbabilisFc model – The similarity between a source and a target artefact is based on the probability that the target artefact is related to the source artefact – Not discussed in details in this talk… • Vector space model – Source and target artefacts are represented in a vector space and the similarity is computed through vector operaFons, e.g. cosine of the angle between the two vectors • Many improvements of the basic models – Latent SemantIc Indexing – Keyword list – Relevance feedback analysis École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 9
Vector Space Model (VSM) • So[ware artefacts are represented as vectors in the space of the terms (vocabulary) – Also possible to use a combinaFon of terms (i.e., n‐grams) as vector characterisFcs (…expensive) – The artefact space is represented by the term‐by‐document matrix T2 Term-by-document matrix Geometrical representation of term-by-document matrix D1 D2 D3 D2 T1 1 4 0 T2 2 1 3 D1 D3 T1 École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 10
Term weigh7ng • How to represent the importance (i.e., weight) of a term in a document? – Term occurrences – Boolean value (0 if the term occurs, 1 otherwise) – An advanced approach considers local and global weights • Generally, a generic entry a i,j of the term‐by‐document matrix is calculated as follow: a i,j = L ( i, j ) · G ( i ) • Tf‐Idf term weighFng: n i,j � � s tf i,j = k n k,j , n s id , f i = log P doc i École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 11
Artefact similarity • How to define the textual similairty between artefacts? – Using the corresponding vectors – Dot product or... – cosine of the angle between the two corresponding vectors (beger) − → D · − → � t i ∈ D,Q w t i D · w t i Q Q sim ( D, Q ) = = �− → D � · � − → �� �� Q � t i ∈ D w 2 t i ∈ Q w 2 t i D · t i Q • The cosine: – Has values in [0, 1] since the maximum angle is 90° – Increases as more terms are shared • Thus, two artefacts are considered similar if their corresponding vectors point in the same direcFon (the angle is close to 0°) École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 12
Limita7ons of the VSM • The vector space model does not take into account relaFons between terms – It soffers of the synonymy and polysemy problems – synonymy: different words with the same meaning – polysemy: same words with different meanings (depending on the context) • For instance, having “automobile” in one artefacts and “car” in another artefact does not contribute to the similarity measure between these two documents • How to try to miFgate such problems – Using a dicFonary – By using morphological analysis, like stemming • Stemming aims at removing suffixes of words to extract their stems • Example: working, worker, worked have the same stem work École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 13
Latent Seman7c Indexing (LSI) • Extension of the vector space model – Provides a way to automaFcally deal with synonymy and polisemy – Avoids preliminary morphological analysis • How does LSI miFgate the synonumy and polisemy problems? – It analyses the co‐occurrence of the terms by using the Singular Value DecomposiFon (SVD) • SVD is used to decompose the term‐by‐document matrix into a set of k orthogonal factors from which the original matrix can be approximated by linear combinaFon – The idea is to reduce the space of the terms – Reducing the term space we also reduce the noice in the word usage caused by synonymy and polisemy words École Polytechnique de Montréal, Montréal, Québec, Canada ‐ September 3rd, 2009 14
Recommend
More recommend