Using Data Fusion and Web Mining to Support Feature Location in - PowerPoint PPT Presentation

Using Data Fusion and Web Mining to Support Feature Location in Software SEMERU

Feature: a requirement that user can invoke and that has an observable behavior.

Feature Location Impact Analysis

Existing Feature Location Work Static ASDGs SUADE SNIAFL FCA DORA Cerberus Software LSI Textual Dynamic Reconn PROMESIR NLP SPR SITIR Meghan Revelle and Denys Poshyvanyk. “Feature Location in Source Code: A Taxonomy and Survey.” Submission to Journal of Software Maintenance and Evolution: Research and Practice .

Textual Feature Location • Information Retrieval (IR) – Searching for documents or within docs for relevant information • First used for feature location by Marcus et al. in 2004 * . – Latent Semantic Indexing ** (LSI) • Utilized by many existing approaches: PROMESIR, SITIR, HIPIKAT etc. * Marcus, A., Sergeyev, A., Rajlich, V., and Maletic, J., "An Information Retrieval Approach to Concept Location in Source Code", in Proc. of Working Conference on Reverse Engineering, 2004, pp. 214-223. ** Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., "Indexing by Latent Semantic 5 Analysis", Journal of the American Society for Information Science , vol. 41, no. 6, Jan. 1990, pp. 391-407.

Applying LSI to Source Code • Corpus creation synchronized void print(TestResult print test result ... – Choose granularity result, long runTime) { • Preprocessing printHeader(runTime); m 1 5 1 3 ... printErrors(result); – Stop word removal, printFailures(result); m 2 ... ... ... ... splitting, stemming printFooter(result); • Indexing } – Term-by-document print test result result run time print header print test result result run time print header print test result result run time print head matrix run time print errors result print failure run time print errors result print failure run time print error result print fail – Singular Value result print footer result result print footer result result print foot result Decomposition • Querying – User-formulated print test result ... • Generate results m 1 5 1 3 ... – Ranked list m 2 ... ... ... ... 6

Dynamic Feature Location Software Scenario-based Reconnaissance * Probabilistic Ranking (SPR) ** I 1 R I 2 Feature t 1 m k Invoked R I 1 t 2 m k m k I 1 I 2 R Feature Not t 3 m k m k Invoked * Wilde, N. and Scully, M., "Software Reconnaissance: Mapping Program Features to Code", Software Maintenance: Research and Practice , vol. 7, no. 1, Jan.-Feb. 1995, pp. 49-62. ** Antoniol, G. and Guéhéneuc, Y. G., "Feature Identification: An Epidemiological Metaphor", IEEE Trans. on 7 Software Engineering , vol. 32, no. 9, Sept. 2006, pp. 627-641.

Hybrid Feature Location PROMESIR * SITIR ** LSI SPR PROMESIR LSI Execution score score Score score Trace m 15 0.91 m 52 0.80 m 6 0.715 m 15 0.91 main m 16 0.88 m 47 0.66 m 47 0.70 m 16 0.88 | m 1 m 2 0.85 m 6 0.64 m 52 0.70 m 2 0.85 | m 2 m 6 0.79 m 2 0.53 m 2 0.69 m 6 0.79 | | m 6 m 47 0.74 m 15 0.37 m 15 0.64 m 47 0.74 | | m 15 m 52 0.60 m 16 0.34 m 16 0.61 m 52 0.60 | m 3 ... ... ... ... ... ... ... ... | m 47 ... * P robabilistic R anking o f M ethods Based on E xecution ** SI ngle T race and I nformation R etrieval S cenarios and I nformation R etrieval Poshyvanyk, D., Guéhéneuc, Y. G., Marcus, A., Antoniol, G., Liu, D., Marcus, A., Poshyvanyk, D., and Rajlich, V., "Feature and Rajlich, V., "Feature Location using Probabilistic Ranking of Location via Information Retrieval based Filtering of a Single Methods based on Execution Scenarios and Information 8 Scenario Execution Trace", in Proc. of International Conference Retrieval", IEEE Trans. on Software Engineering , vol. 33, no. 6, on Automated Software Engineering, 2007, pp. 234-243. June 2007, pp. 420-432.

Data Fusion Example Global Positioning System (GPS) χ Actual Position - Discrete measurements χ - Meter accuracy χ - Noisy χ Position + No drift χ χ Inertial Navigation System (INS) + Continuous measurements χ χ + Centimeter accuracy χ + Low noise χ χ - Drifts over time χ χ Time 9

Data Fusion for Feature Location • Combining information from multiple sources will yield better results than if the data is used separately – Previous • Textual, Dynamic, and Static (i.e., Cerberus) – Current • Textual info from IR • Execution info from dynamic tracing • Web mining 10

Web Mining m 1 m 2 m 3 m 4 m 5 m 6 m 7 m 8 m 9 m 10 m 11 m 12 m 14 m 13 m 15 m 16 m 17 m 18 m 19 m 20 11

Web Mining Algorithms PageRank – Measure the relative importance of a web page – Used by the Google search engine – Link from X to Y means a vote by X for Y – A node’s PageRank depends on # incoming links and the PageRank of nodes that link to it Image source: http://en.wikipedia.org/wiki/Pagerank 12 Brin, S. and Page, L., "The Anatomy of a Large-Scale Hypertextual Web Search Engine", in Proc. of 7th International Conference on World Wide Web, Brisbane, Australia, 1998, pp. 107-117.

Web Mining Algorithms HITS – Hyperlinked-Induced Topic Search – Identifies hub and authority pages – Hubs point to many good authorities – Authorities are pointed to by many hubs Hubs Authorities Kleinberg, J. M., "Authoritative sources in a hyperlinked 13 environment", Journal of the ACM , vol. 46, no. 5, 1999, pp. 604-632.

Probabilistic Program 15 Dependence Graph* 16 20 1/7 1/6 m 1 1/7 1/6 1/6 1/7 13 2/7 1/6 1/7 1/6 1/7 1/6 17 m 2 m 3 m 4 m 5 m 6 2/8 1/4 3/8 1/4 18 2/8 1/4 1/8 1/4 19 m 7 m 8 m 9 PPDG 14 1/1 1/1 1/2 1/5 1/3 1/2 1/2 4/5 1/2 2/3 10 – Derived from m 10 m 11 m 12 12 feature- m 14 1/1 1/1 1/1 3/3 11 specific trace 1/1 1/1 m 13 7 1/1 1/1 – Binary 2/4 1/2 8 2/4 1/2 weights m 15 2/6 1/2 9 3/9 1/4 – Execution 3/9 1/4 1/4 1/9 2/9 1/4 2 m 16 frequency m 17 m 18 m 19 3 4/6 1/2 weights 4 m 20 5 *Baah, G. K., Podgurski, A., and Harrold, M. J. 2008. The probabilistic program dependence graph 6 14 and its application to fault diagnosis. In Proceedings of the 2008 International Symposium on 1 Software Testing and Analysis , 2008.

Incorporating Web Mining with Feature Location PR m 15 0.14 m 16 0.09 m 20 0.07 m 13 0.04 m 17 0.001 LSI score ... ... m 15 0.91 m 16 0.88 m 2 0.85 m 6 0.79 m 47 0.74 15 m 52 0.60

Feature Location Techniques Evaluated LSI & LSI, Dyn, & Dynamic Web Mining LSI, Dyn, & HITS PageRank Analysis LSI PR(bin) LSI+Dyn+PR(bin) top LSI+Dyn+HITS(h,bin) top LSI+Dyn+HITS(h,bin) bottom LSI+Dyn PR(freq) LSI+Dyn+PR(bin) bottom LSI+Dyn+HITS(h,freq) top LSI+Dyn+HITS(h,freq) bottom (baseline) HITS(h, bin) LSI+Dyn+PR(freq) top LSI+Dyn+HITS(a,bin) top LSI+Dyn+HITS(a,bin) bottom HITS(h, freq) LSI+Dyn+PR(freq) bottom LSI+Dyn+HITS(a,freq) top LSI+Dyn+HITS(a,freq) bottom HITS(a, bin) HITS(a, freq) Use LSI to Use web Use LSI to rank methods. Prune unexecuted. Use web mining algorithm to rank mining also rank methods and prune top- or bottom- ranked methods from LSI+Dyn’s methods, algorithm to results. prune rank 16 unexecuted methods.

Feature Location Techniques Explained m 1 + m 2 m 3 m 4 m 5 m 6 m 7 m 8 m 9 m 10 m 11 m 12 PR(bin) PR(bin) m 14 HITS(h, bin) bottom PR(bin) top LSI+Dyn LSI+Dyn m 13 Web Scenario Executed Tracer m 15 Mining Methods Query m 16 m 17 m 18 m 19 Ranked Source Ranked + LSI Methods Code Methods LSI m 20 Ranked, + Executed Methods 17 Final Results

Subject Systems • Eclipse 3.0 – 10K classes, 120K methods, and 1.6 million LOC – 45 features – Gold set: methods modified to fix bug – Queries: short description from bug report – Traces: steps to reproduce bug 18

Subject Systems • Rhino 1.5 – 138 classes, 1,870 methods, and 32,134 LOC – 241 features – Gold set: Eaddy et al.’s dataset * – Queries: description in specification – Traces: test cases 20 * http://www.cs.columbia.edu/~eaddy/concerntagger/

Size of Traces Min Max 25% Med 75% σ μ Methods 88K 1.5MM 312K 525K 1MM 666K 406K Unique Methods 1.9K 9.3K 3.9K 5K 6.3K 5.1K 2K Eclipse Size-MB 9.5 290 55 98 202 124 83 Threads 1 26 7 10 12 10 5 Methods 160K 12MM 612K 909K 1.8MM 1.8MM 2.3MM Unique Methods 777 1.1K 870 917 943 912 54 Rhino Size-MB 18 1,668 71 104 214 210 273 Threads 1 1 1 1 1 1 0 21

Research Questions • RQ1 – Does combining web mining algorithms with an existing approach to feature location improve its effectiveness? • RQ2 – Which web-mining algorithms, HITS or PageRank, produces better results? 22

Data Collection & Testing • Effectiveness measure LSI score – Descriptive statistics m 15 0.91 m 16 0.88 • 45 Eclipse features m 2 0.85 Effectiveness = 4 • 241 Rhino features m 6 0.79 m 47 0.74 m 52 0.60 • Statistical Testing – Wilcoxon rank sum test – Null hypothesis • There is no significant difference between the effectiveness of X and the baseline (LSI+Dyn). – Alternative hypothesis • The effectiveness of X is significantly better than the baseline (LSI+Dyn). 23

Using Data Fusion and Web Mining to Support Feature Location in - PowerPoint PPT Presentation

Using Data Fusion and Web Mining to Support Feature Location in Software SEMERU Feature: a requirement that user can invoke and that has an observable behavior. Feature Location Impact Analysis Existing Feature Location Work Static

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Update of Magnetic Fusion Energy Research Brian A. Nelson for the UW Fusion Energy Research Group

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Social Media Strategy Lee Frederiksen, Ph.D. Presenter Lee Frederiksen, Ph.D. Managing Partner,

Web 2.0 features Collective intelligence Chapter 6 Design for Collective Intelligence

1 Basic Definitions Below are some basic definitions and terminology that will be used throughout

Using CRIS to power research data discovery Alex Ball 1 Christopher Brown 2 Laura Molloy 3 Veerle

SEARCH RESULTS CLUSTERING (and its applications to the Polish language) Dawid Weiss Pozna

The Bandera Perspective This talk will focus on Bandera and Cadena and will give the

Interaction Design 9-12-2012 Overview of Interaction Design Understanding the Problem

Micro Content Its Kind of a Big Deal PRESENTED BY Paul Stoecklein MadCap Software Director