concept location in source code feature a requirement
play

Concept Location in Source Code Feature: a requirement that user - PowerPoint PPT Presentation

Concept Location in Source Code Feature: a requirement that user can invoke and that has an observable behavior. Feature Location Impact Analysis Concept Location discovering human oriented concepts and assigning them to their


  1. Concept Location in Source Code

  2. Feature: a requirement that user can invoke and that has an observable behavior.

  3. Feature Location Impact Analysis

  4. Concept Location • “… discovering human oriented concepts and assigning them to their implementation instances within a program …” [Biggerstaff’93] • Concept location is needed whenever a change is to be made • Change requests are most often formulated in terms of domain concepts – the programmer must find in the code the locations where concept “paste” is located – this is the start of the change

  5. Concept Location = Point of Change

  6. Assumption • The programmer understands domain concepts, but not the code – the knowledge of domain concepts is based on program use and is easier to acquire • user of a word processor learns about cut-and- paste, fonts, and other concepts of the domain • All domain concepts map onto the fragments of the code – finding that fragment is concept location

  7. Partial Comprehension of the Code • Large programs cannot be completely comprehended – programmers seek the minimum essential understanding for the particular software task – they use an as-needed strategy – they attempt to understand how certain specific concepts are reflected in the code

  8. Existing Feature Location Work Static ASDGs SUADE SNIAFL FCA DORA Cerberus Software LSI Textual Dynamic Reconn PROMESIR NLP SPR SITIR Dit, Revelle, Gethers and Denys Poshyvanyk. “Feature Location in Source Code: A Taxonomy and Survey.” Submission to Journal of Software Maintenance and Evolution: Research and Practice .

  9. Concepts vs. Features vs. Concerns • Features correspond to user visible behavior of the systems – e.g., print, open file, copy, paste, etc. – usually captured by the functional requirements of the systems • All features are concepts but not the other way around – e.g., linked list – part of the solution domain, not problem domain – can not use dynamic techniques to locate such concepts • Concerns are synonym with concepts – aspects = crosscutting concerns

  10. Concept Location as Text Search • Source code is regarded as text data • Techniques differ by: – source code pre-processing – query/search mechanism – granularity and structure of the results

  11. Grep-based Concept Location • Source code is not processed • Queries are regular expressions (i.e., formal language): [hc]at, .at, *at, [^b] at, ^[hc]at, [hc]at$ , etc. • Search mechanism is regular expression matching • Results are unordered lines of text where the query is matched

  12. Grep-based Concept Location in an IDE

  13. How Can We Do Better?

  14. What is Information Retrieval? • An Information Retrieval System (IR) is capable of storage, retrieval, and maintenance of information (e.g., text, images, audio, video, and other multi- media objects) • IR methods: signature files, inversion, clustering, probabilistic classifiers, vector space models, etc.

  15. What is Text Retrieval? • TR = IR of textual data – a.k.a document retrieval • Basis for internet search engines • Search space is a collection of documents • Search engine creates a cache consisting of indexes of each document – different techniques create different indexes

  16. Terminology • Document – unit of text – set of words • Corpus – collection of documents • Term vs. word – basic unit of text - not all terms are words • Query • Index • Rank • Relevance

  17. TR-based Concept Location • Source code is processed into documents • Queries are sets of terms/words • Search mechanism based on the TR technique used • Results are documents and are ranked w.r.t. the query

  18. TR-based Concept Location - Process 1. Creating a corpus of a software system 2. Indexing the corpus with the TR method (we used LSI, Lucene, GDS, LDA) 3. Formulating a query 4. Ranking methods 5. Examining results 6. Go to 3 if needed

  19. Creating a Corpus of a Software System • Parsing source code and extracting documents – corpus – collection of documents (e.g., methods) • Removing non-literals and stop words – common words in English, standard function library names, programming language keywords • Preprocessing: – split_identifiers and SplitIdentifiers • NLP methods can be applied such as stemming

  20. Parsing Source Code and Extracting Documents • Documents can be at different granularities (e.g., methods, classes, files)

  21. Parsing Source Code and Extracting Documents • Documents can be at different granularities (e.g., methods, classes, files)

  22. Source Code is Text Too public void run IProgressMonitor monitor throws InvocationTargetException InterruptedException if m_iFlag processCorpus monitor checkUpdate else if m_iFlag processCorpus monitor UD_UPDATECORPUS else processQueryString monitor if monitor isCancelled throw new InterruptedException the long running

  23. Splitting Identifiers public void run IProgressMonitor monitor throws InvocationTargetException InterruptedException if m_iFlag the processCorpus monitor checkUpdate else if m_iFlag processCorpus monitor UD_UPDATECORPUS else a processQueryString monitor if monitor isCancelled throw new InterruptedException the long running • IProgressMonitor = i progress monitor • InvocationTargetException = invocation target exception • m_IFlag = m i flag • UD_UPDATECORPUS = ud updatecorpus

  24. Removing Stop Words • Common words in English • Programming language keywords public void run IProgressMonitor monitor throws InvocationTargetException InterruptedException if m_iFlag the processCorpus monitor checkUpdate else if m_iFlag processCorpus monitor UD_UPDATECORPUS else a processQueryString monitor if monitor isCancelled throw new InterruptedException the long running

  25. More Processing • NLP methods can be used such as stemming, part of speech tagging, etc. • Example: – fishing , fished , fish , fishes , and fisher all reduce to the root word fish

  26. Vector Space Model j : Rows - Documents i : Column - Terms [i, j] : Weighted frequency of terms within a document Typical weight – TF-IDF term frequency-inverse document frequency Similarity Measure: Cosine of the contained angle between the vectors

  27. Query and Ranking of Results • Any unit of text – one word, one sentence, entire documents, piece of code, change request, etc. • The query is interpreted as a pseudo- document and represented in the VSM • The results are documents, ranked based on the similarity to the query (pseudo-) document

  28. Evaluation Measures • Precision - a measure of exactness or fidelity • Recall - a measure of completeness

  29. JIRiSS GES IRiSS

  30. Textual Feature Location • Information Retrieval (IR) – Searching for documents or within docs for relevant information • First used for feature location by Marcus et al. in 2004 * . – Latent Semantic Indexing ** (LSI) • Utilized by many existing approaches: PROMESIR, SITIR, HIPIKAT etc. * Marcus, A., Sergeyev, A., Rajlich, V., and Maletic, J., "An Information Retrieval Approach to Concept Location in Source Code", in Proc. of Working Conference on Reverse Engineering, 2004, pp. 214-223. ** Deerwester, S., Dumais, S. T ., Furnas, G. W., Landauer, T . K., and Harshman, R., "Indexing by Latent Semantic 30 Analysis", Journal of the American Society for Information Science , vol. 41, no. 6, Jan. 1990, pp. 391-407.

  31. Applying IR to Source Code • Corpus creation prin synchronized void print(TestResult result, test result ... t – Choose granularity long runTime) { printHeader(runTime); m 1 5 1 3 ... • Preprocessing printErrors(result); – Stop word removal, printFailures(result); m 2 ... ... ... ... splitting, stemming printFooter(result); } • Indexing – Term-by-document print test result result run time print header print test result result run time print header print test result result run time print head matrix run time print errors result print failure run time print errors result print failure run time print error result print fail – Singular Value result print footer result result print footer result result print foot result Decomposition • Querying – User-formulated print test result ... • Generate results m 1 5 1 3 ... – Ranked list m 2 ... ... ... ... 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend