Concept Location in Source Code Feature: a requirement that user - PowerPoint PPT Presentation

Concept Location in Source Code

Feature: a requirement that user can invoke and that has an observable behavior.

Feature Location Impact Analysis

Concept Location • “… discovering human oriented concepts and assigning them to their implementation instances within a program …” [Biggerstaff’93] • Concept location is needed whenever a change is to be made • Change requests are most often formulated in terms of domain concepts – the programmer must find in the code the locations where concept “paste” is located – this is the start of the change

Concept Location = Point of Change

Assumption • The programmer understands domain concepts, but not the code – the knowledge of domain concepts is based on program use and is easier to acquire • user of a word processor learns about cut-and- paste, fonts, and other concepts of the domain • All domain concepts map onto the fragments of the code – finding that fragment is concept location

Partial Comprehension of the Code • Large programs cannot be completely comprehended – programmers seek the minimum essential understanding for the particular software task – they use an as-needed strategy – they attempt to understand how certain specific concepts are reflected in the code

Existing Feature Location Work Static ASDGs SUADE SNIAFL FCA DORA Cerberus Software LSI Textual Dynamic Reconn PROMESIR NLP SPR SITIR Dit, Revelle, Gethers and Denys Poshyvanyk. “Feature Location in Source Code: A Taxonomy and Survey.” Submission to Journal of Software Maintenance and Evolution: Research and Practice .

Concepts vs. Features vs. Concerns • Features correspond to user visible behavior of the systems – e.g., print, open file, copy, paste, etc. – usually captured by the functional requirements of the systems • All features are concepts but not the other way around – e.g., linked list – part of the solution domain, not problem domain – can not use dynamic techniques to locate such concepts • Concerns are synonym with concepts – aspects = crosscutting concerns

Concept Location as Text Search • Source code is regarded as text data • Techniques differ by: – source code pre-processing – query/search mechanism – granularity and structure of the results

Grep-based Concept Location • Source code is not processed • Queries are regular expressions (i.e., formal language): [hc]at, .at, *at, [^b] at, ^[hc]at, [hc]at$ , etc. • Search mechanism is regular expression matching • Results are unordered lines of text where the query is matched

Grep-based Concept Location in an IDE

How Can We Do Better?

What is Information Retrieval? • An Information Retrieval System (IR) is capable of storage, retrieval, and maintenance of information (e.g., text, images, audio, video, and other multi- media objects) • IR methods: signature files, inversion, clustering, probabilistic classifiers, vector space models, etc.

What is Text Retrieval? • TR = IR of textual data – a.k.a document retrieval • Basis for internet search engines • Search space is a collection of documents • Search engine creates a cache consisting of indexes of each document – different techniques create different indexes

Terminology • Document – unit of text – set of words • Corpus – collection of documents • Term vs. word – basic unit of text - not all terms are words • Query • Index • Rank • Relevance

TR-based Concept Location • Source code is processed into documents • Queries are sets of terms/words • Search mechanism based on the TR technique used • Results are documents and are ranked w.r.t. the query

TR-based Concept Location - Process 1. Creating a corpus of a software system 2. Indexing the corpus with the TR method (we used LSI, Lucene, GDS, LDA) 3. Formulating a query 4. Ranking methods 5. Examining results 6. Go to 3 if needed

Creating a Corpus of a Software System • Parsing source code and extracting documents – corpus – collection of documents (e.g., methods) • Removing non-literals and stop words – common words in English, standard function library names, programming language keywords • Preprocessing: – split_identifiers and SplitIdentifiers • NLP methods can be applied such as stemming

Parsing Source Code and Extracting Documents • Documents can be at different granularities (e.g., methods, classes, files)

Source Code is Text Too public void run IProgressMonitor monitor throws InvocationTargetException InterruptedException if m_iFlag processCorpus monitor checkUpdate else if m_iFlag processCorpus monitor UD_UPDATECORPUS else processQueryString monitor if monitor isCancelled throw new InterruptedException the long running

Splitting Identifiers public void run IProgressMonitor monitor throws InvocationTargetException InterruptedException if m_iFlag the processCorpus monitor checkUpdate else if m_iFlag processCorpus monitor UD_UPDATECORPUS else a processQueryString monitor if monitor isCancelled throw new InterruptedException the long running • IProgressMonitor = i progress monitor • InvocationTargetException = invocation target exception • m_IFlag = m i flag • UD_UPDATECORPUS = ud updatecorpus

Removing Stop Words • Common words in English • Programming language keywords public void run IProgressMonitor monitor throws InvocationTargetException InterruptedException if m_iFlag the processCorpus monitor checkUpdate else if m_iFlag processCorpus monitor UD_UPDATECORPUS else a processQueryString monitor if monitor isCancelled throw new InterruptedException the long running

More Processing • NLP methods can be used such as stemming, part of speech tagging, etc. • Example: – fishing , fished , fish , fishes , and fisher all reduce to the root word fish

Vector Space Model j : Rows - Documents i : Column - Terms [i, j] : Weighted frequency of terms within a document Typical weight – TF-IDF term frequency-inverse document frequency Similarity Measure: Cosine of the contained angle between the vectors

Query and Ranking of Results • Any unit of text – one word, one sentence, entire documents, piece of code, change request, etc. • The query is interpreted as a pseudo- document and represented in the VSM • The results are documents, ranked based on the similarity to the query (pseudo-) document

Evaluation Measures • Precision - a measure of exactness or fidelity • Recall - a measure of completeness

JIRiSS GES IRiSS

Textual Feature Location • Information Retrieval (IR) – Searching for documents or within docs for relevant information • First used for feature location by Marcus et al. in 2004 * . – Latent Semantic Indexing ** (LSI) • Utilized by many existing approaches: PROMESIR, SITIR, HIPIKAT etc. * Marcus, A., Sergeyev, A., Rajlich, V., and Maletic, J., "An Information Retrieval Approach to Concept Location in Source Code", in Proc. of Working Conference on Reverse Engineering, 2004, pp. 214-223. ** Deerwester, S., Dumais, S. T ., Furnas, G. W., Landauer, T . K., and Harshman, R., "Indexing by Latent Semantic 30 Analysis", Journal of the American Society for Information Science , vol. 41, no. 6, Jan. 1990, pp. 391-407.

Applying IR to Source Code • Corpus creation prin synchronized void print(TestResult result, test result ... t – Choose granularity long runTime) { printHeader(runTime); m 1 5 1 3 ... • Preprocessing printErrors(result); – Stop word removal, printFailures(result); m 2 ... ... ... ... splitting, stemming printFooter(result); } • Indexing – Term-by-document print test result result run time print header print test result result run time print header print test result result run time print head matrix run time print errors result print failure run time print errors result print failure run time print error result print fail – Singular Value result print footer result result print footer result result print foot result Decomposition • Querying – User-formulated print test result ... • Generate results m 1 5 1 3 ... – Ranked list m 2 ... ... ... ... 31

Concept Location in Source Code Feature: a requirement that user - PowerPoint PPT Presentation

Concept Location in Source Code Feature: a requirement that user can invoke and that has an observable behavior. Feature Location Impact Analysis Concept Location discovering human oriented concepts and assigning them to their

Using Data Fusion and Web Mining to Support Feature Location in Software SEMERU Feature: a

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Requirement Requirement Requirement Requirement Engineering Engineering Engineering

Location, Location, Location, Location, Location: Location: GPS and Google Earth GPS and

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

MOBILE COMPUTING CSE 40814/60814 Fall 2015 Location, Location, Location Location information

CS371m - Mobile Computing Location (Location, Location, Location) Cheap GPS

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Facility location II. Chapter 10 Location-Allocation Model Plant Location Model Network

Facility location I. Chapter 10 Facility location Continuous facility location models Single

Comparative Study of Traditional Requirement Engineering and Agile Requirement Engineering

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Feature Location in Models (FLiMEA) Universidad San Jorge 1 FLiMEA REVAMP 2 What? Why? Where?

Presentation 1: R Murray Logan July 15, 2017 Table of contents 1 Preparation 1 1.

NLP!!! April 7, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs:

Based on Simon Peytons Jones Article and Presentation (see reading list) 1 Maria Hybinette,

A Micro Crowdsourcing Architecture to Localize A Micro Crowdsourcing Architecture to Localize Web

Review of VPIM V2 draft-ietf-vpim-v2r2-00.txt Formerly draft-ema-vpim-v2r2-01.txt What has

Get Started with Voice User Interfaces Amber Matz @amberhimesmatz DrupalCon Vienna September

Apache Spark Tutorial Future Cloud Summer School Paco Nathan @pacoid 2015-08-06

The characteristic features of Chinese Syntax 1. Topic oriented vs subject