A Theory of Aspects as Latent Topics t Pierre Baldi, Cristina - PowerPoint PPT Presentation

A Theory of Aspects as Latent Topics t Pierre Baldi, Cristina Lopes, Erik Linstead, Sushil Bajracharya Donald Bren School of Information and Computer Science University of California, Irvine {pfbaldi,lopes,elinstea,sbajrach}@ics.uci.edu 1 OOPSLA 2008. Nashville, TN.

Overview  Motivation  Aspects as Latent Topics  Machine Learning for Concern Extraction  Latent Dirichlet Allocation  Data  Sourcerer  Vocabulary Selection  Results  Scattering and Tangling in the Large  Scattering and Tangling in the Small  Conclusions 2

Motivation  AOP is still a controversial idea  Hypotheses put forth by AOP have yet to be validated on the very large scale  Cross-cutting concerns exist and are subject to scattering and tangling  Excessive scattering and tangling are “bad” for software  Alternative composition mechanisms (eg. AspectJ) alleviate problems caused by cross-cutting concerns  Advances in machine learning provide the necessary tools for such a validation  Here we focus on empirical validation of first hypothesis  Contributions  Unsupervised learning of cross-cutting concerns  An information-theoretic definition for scattering and tangling  Empirical validation across multiple scales 3

Learning Cross-Cutting Concerns  Availability of Open-Source software facilitates large-scale empirical analysis of many software facets  Recent advances in statistical text mining techniques offer new opportunities to mine Internet-scale software repositories  Unsupervised  Probabilistic  Proven to give better results than “traditional” methods  Scalable 4

Statistical Topic Models  Statistical Topic Models represent documents as probability distributions over words and topics  Benefits of working in probabilistic framework  Robust – model documents directly  Finding patterns is intuitive and easily automated  Active research area yielding exciting results  Traditional Text  Source Code (Linstead et al. ASE 2007, NIPS 2007) 5

Latent Dirichlet Allocation (LDA)  Blei, Ng, Jordan (2003)  Simple “Bag of Words” approach  Models documents as mixtures of topics (multinomial)  Topics are distributions over words (multinomial)  Bayesian (Symmetric Dirichlet priors)  Well analyzed in literature 6

Documents as “Bags of Words” public class TextMiner { private ListtrainCollection; text words private Matrix bagOfWords; miner random public void nearestNeighbor(){ matrix calc ... nearest cosine bagOfWords.calcCosineDistance(); neighbor distance ... train Random r = new Random(); } collection } bag 7

LDA – In a nutshell  Given a document-word matrix  Probabilistically determine X most likely topics  For each topic determine Y most likely words  Do it without human intervention  Humans do not supply hints for topic list  Humans do not tune algorithm on the fly  No need for iterative refinement  Output  Document-Topic Matrix  Topic-Word Matrix 8

Aspects as Latent Topics  Unification of “topics” in text with “concerns” in software  A CONCERN IS A LATENT TOPIC  Syntax and convention differentiates natural and programming languages, but:  At most basic level a source file is still a document  Tokens in source code still define a vocabulary  Probability distributions of topics over files and files over topics allow for precise measurement of scattering and tangling, respectively 9

Measuring Scattering t1 t2 t3 tn If the distribution of a topic, t ,  across modules m 0 … m n is given by p t =(p t 0 … p t n ) then scattering can be measured by the entropy d1 0 0 8 0 H(p t )= - ∑ K p t k log(p t k ) d2 1 0 8 5 Can normalize by dividing by  log(n) H(p t ) =0 denotes a concern  d3 8 8 8 8 assigned to only one source file H(p t ) =1 denotes a concern d4 3 0 8 1  uniformly distributed across source files d5 15 0 8 2 AN ASPECT IS A LATENT TOPIC  WITH HIGH SCATTERING dn 12 0 8 4 ENTROPY 10

Measuring Tangling t1 t2 t3 tn If the distribution of a module, m ,  across concerns t 0 … t n is given by q m =(q m 0 … q m r ) then scattering can be measured by the entropy d1 0 0 8 0 H(q m )= - ∑ K q m k log(q m k ) d2 1 0 8 5 Can normalize by dividing by  log(r) d3 8 8 8 8 H(q m ) =0 denotes a file  assigned to only one concern d4 3 0 8 1 H(q m ) =1 denotes a file  uniformly distributed across concerns d5 15 0 8 2 dn 12 0 8 4 11

Data  We validate our technique at multiple scales  Internet-Scale  4,632 open source projects constituting 38 million LOC, 366k files, and 426k classes  Leverage Sourcerer infrastructure  Individual Projects  JHotDraw  PDFBox  Jikes  JNode  CoffeeMud 12

Sourcerer  UCI ICS project designed to:  Index publicly available source and provide fast search and mining  Leverage data to better understand code, facilitate reuse, provide tools for real-world software development  Explore new avenues for mining software  Current Version  ~12k open source projects (4,632 with source code)  Focused on Java language as proof of concept  Publicly Available  http://sourcerer.ics.uci.edu 13

Sourcerer Architecture 14

Vocabulary Selection  Vocabulary size affects interpretability of topics extracted by LDA  Code as plain text yields noisy results public class TextMiner { private List trainCollection; private Matrix bagOfWords; public void nearestNeighbor(){ ... bagOfWords.calcCosineDistance(); ... Random r = new Random(); } } 15

Vocabulary Selection  Vocabulary size affects interpretability of topics extracted by LDA  Code as plain text yields noisy results public class TextMiner { private List trainCollection; private Matrix bagOfWords; public void nearestNeighbor(){ ... bagOfWords.calcCosineDistance(); ... Random r = new Random(); } } 16

Vocabulary Selection  Vocabulary size affects interpretability of topics extracted by LDA  Code as plain text yields noisy results public class TextMiner { private List trainCollection; private Matrix bagOfWords; public void nearestNeighbor (){ ... bagOfWords.calcCosineDistance(); ... Random r = new Random(); } } 17

Vocabulary Selection  Vocabulary size affects interpretability of topics extracted by LDA  Code as plain text yields noisy results public class TextMiner { private List trainCollection ; private Matrix bagOfWords ; public void nearestNeighbor (){ ... bagOfWords.calcCosineDistance(); ... Random r = new Random(); } } 18

Vocabulary Selection  Vocabulary size affects interpretability of topics extracted by LDA  Code as plain text yields noisy results public class TextMiner { private List trainCollection ; private Matrix bagOfWords ; public void nearestNeighbor (){ ... bagOfWords. calcCosineDistance (); ... Random r = new Random(); } } 19

Vocabulary Selection  Vocabulary size affects interpretability of topics extracted by LDA  Code as plain text yields noisy results public class TextMiner { private List trainCollection ; private Matrix bagOfWords ; public void nearestNeighbor (){ ... bagOfWords. calcCosineDistance (); ... Random r = new Random (); } } 20

Scattering in the Large Concern Extraced Topic Entropy  Many prototypical String Processing .801 examples for AOP ‘string case length width substring’ .791 Exception ‘throwable trace stack Handling print method’ Concurrency ‘thread run start stop wait’ .767  Cross-cutting found at multiple magnitudes XML .749 ‘element document attribute schema child’ Authentication .745 ‘user group role application permission’ Web .723 ‘request servlet http response session’ Database .677 ‘sql object fields persistence jdbc’ Plotting .641 ‘category range domain axis paint’ 21

Scattering Visualization 22

Scattering in the Small: JHotDraw • Notable appearance of project-specific concerns • In general appear to have lower scattering entropy • Can be controlled in part by number of topics extracted by LDA • In specific cases may require developer expertise to determine 23 valid concerns versus noise

Scattering in the Small: Jikes 24

Scattering in the Small: JNode 25

Scattering in the Small: CoffeeMud 26

Scattering Visualization 27

Tangling in the Large  Full matrix available from supplementary materials page  366,287 x 125  72MB (compressed) 28

Tangling in the Small JHotDraw Jikes 29

Tangling Visualization 30

A Parametric Model of Tangling?  Inverse sigmoidal behavior noted in tangling  Fit simple 2 parameter model to data f(x)= a * ln((1/x)-1)+b  R-Square of .947  Standard deviation of .024 31

Comparison to Other Methods  Validation for Internet-scale repository challenging  Individual projects exist which make good baselines  JHotDraw  Compared to fan-in/fan-out, identifier analysis, dynamic analysis, manual analysis, and mining code revisions  What aspects are identified?  To what degree are scattering and tangling observed?  General agreement with our LDA-based technique in all cases 32

A Theory of Aspects as Latent Topics t Pierre Baldi, Cristina - PowerPoint PPT Presentation

A Theory of Aspects as Latent Topics t Pierre Baldi, Cristina Lopes, Erik Linstead, Sushil Bajracharya Donald Bren School of Information and Computer Science University of California, Irvine {pfbaldi,lopes,elinstea,sbajrach}@ics.uci.edu 1

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

F.Maraninchi 2 Aspects and Reactive Systems Switch to full screen F.Maraninchi 0 Aspects and

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

C unobserved construct (e.g. Disordered v. Non- Disordered) Latent classes are mutually

Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses Guan-Hua Huang,

Latent Damage and Reliability in Semiconductor Devices May1625 - Advisor & Client: Dr. Randy

ZEB1 Regulates the Latent- -Lytic Lytic Switch Switch ZEB1 Regulates the Latent in Infection

Latent Class Models: The Latent Class Logit Model Accouting for unobserved heterogeneity:

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

Latent Class Analysis (LCA) in Stata Kristin MacDonald Director of Statistical Services

Empirical Analysis of Latent Space Embedding David Mount and Eunhui Park Department of Computer

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

infotexture Information Architecture & Content Strategy Roger W. Fienhold Sheen Agenda

61A Lecture 36 Big Data : A buzzword used to describe data sets so large that they reveal

ATS 2017 Call for Abstracts Instructions for Title, Type, Category and Presentation Preference Add

Pre Prese sente nter Di Diana Wo Wood odwo worth, Fi Financ ancial Aid d Coordinato

Natural Language Processing CSCI 4152/6509 Lecture 4 About Course Project; Automata and

Welcome to the co u rse ! IN TR OD U C TION TO IMP OR TIN G DATA IN P YTH ON H u go Bo w ne -

File Naming Civil Documents Drawing Index ePs File Uploads C01 Cover C01 Cover C02

Cr Cross oss-Pa Part t Collaboration n for r Integrated d Plan Plannin ing an g and D

A Theory of Aspects as Latent Topics t Pierre Baldi, Cristina - PowerPoint PPT Presentation

A Theory of Aspects as Latent Topics t Pierre Baldi, Cristina Lopes, Erik Linstead, Sushil Bajracharya Donald Bren School of Information and Computer Science University of California, Irvine {pfbaldi,lopes,elinstea,sbajrach}@ics.uci.edu 1

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

F.Maraninchi 2 Aspects and Reactive Systems Switch to full screen F.Maraninchi 0 Aspects and

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

C unobserved construct (e.g. Disordered v. Non- Disordered) Latent classes are mutually

Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses Guan-Hua Huang,

Latent Damage and Reliability in Semiconductor Devices May1625 - Advisor &amp; Client: Dr. Randy

ZEB1 Regulates the Latent- -Lytic Lytic Switch Switch ZEB1 Regulates the Latent in Infection

Latent Class Models: The Latent Class Logit Model Accouting for unobserved heterogeneity:

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

Latent Class Analysis (LCA) in Stata Kristin MacDonald Director of Statistical Services

Empirical Analysis of Latent Space Embedding David Mount and Eunhui Park Department of Computer

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

infotexture Information Architecture &amp; Content Strategy Roger W. Fienhold Sheen Agenda

61A Lecture 36 Big Data : A buzzword used to describe data sets so large that they reveal

ATS 2017 Call for Abstracts Instructions for Title, Type, Category and Presentation Preference Add

Pre Prese sente nter Di Diana Wo Wood odwo worth, Fi Financ ancial Aid d Coordinato

Natural Language Processing CSCI 4152/6509 Lecture 4 About Course Project; Automata and

Welcome to the co u rse ! IN TR OD U C TION TO IMP OR TIN G DATA IN P YTH ON H u go Bo w ne -

File Naming Civil Documents Drawing Index ePs File Uploads C01 Cover C01 Cover C02

Cr Cross oss-Pa Part t Collaboration n for r Integrated d Plan Plannin ing an g and D

Latent Damage and Reliability in Semiconductor Devices May1625 - Advisor & Client: Dr. Randy

infotexture Information Architecture & Content Strategy Roger W. Fienhold Sheen Agenda