 
              Model-based Mining of Software Repositories Markus Scheidgen 1 Saturday, 27. September 2014
Agenda ▶ Mining Software Repositories (MSR) and current approaches ▶ srcrepo – a model-based MSR system ■ srcrepo components and analysis process ■ a meta-model for source code repositories ■ gathering software metrics with an OCL-like internal Scala DSL ▶ work in progress - discussion of remaining problems and limitations 2 Saturday, 27. September 2014
Relevant Research Fields Mining Software Repositories (MSR) Software Metrics Reverse Engineering The term mining software repositories (MSR) A software metric is a Reverse engineering is the process has been coined to describe a broad class of mathematical definition mapping of analyzing a subject system to investigations into the examination of the entities of a software system (1) identify the system’s software repositories . to numeric metrics values. components and their interrelationships and (2) create The premise of MSR is that empirical and [...] to express features of representations of the system in systematic investigations of repositories will software with numbers in order another form or at a higher level shed new light on the process of software to facilitate software quality of abstraction [3] evolution . [1] assessment. [2] Software Evolution Research (SER) volution Research (SER) ■ (dis-)proving Lehmann’s Laws of software evolution of software evolution ■ empirical investigations of software repositories through statistical analysis of empirical investigations of software repositories through statistical analysis of software and software change metrics over the evolutionary cause of many software and software change metrics over the evolutionary cause of many software systems. Model-based Mining Software Repositories Model-based Mining Software Repositories (with srcrepo (with srcrepo ) Overcoming heterogeneity and accessibility by raising the level of abstraction, while ensuring by raising the level of abstraction, while ensuring scalability and by raising the level of abstraction, while ensuring retaining meaningful information depth . 1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software evolution ; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007 2. R. Lincke, J. Lundberg, W. Löwe : Comparing Software Metrics Tools; 8th International Symposium on Software Testing and Analysis; 2008 3. E.J. Chikofsky, J.H. Cross : Reverse engineering and design recovery: A taxonomy ; IEEE Software; Vol.7/Nr.1/1990 3 Saturday, 27. September 2014
Contemporary Approaches to Large Scale MSR for SER FLOSS Metrics [1] Sourcerer [2] Boa [3] ■ database for over 3000 open source ■ database and searchable index ■ d omain specific language (DSL) software projects of declarations from over 4000 for mining meta-data in ultra- ■ contains data about all revisions Java software projects large software repositories ■ Alitheia, multiple version control systems ■ tracks only release revisions ■ only tracks VCS meta-data, e.g. ■ metrics based on declarations (VCS), but only text-based metrics “How many revisions are there ■ not only source code repositories (SCR) via (classes, methods, fields, etc., in all Java projects using SVN?” VCS, also issue-tracking systems, mailing- e.g. CK-metrics), but not based lists, etc. on actual implementations (e.g. McCabe, Halstead) Scalability Heterogeneity Accessibility Information Depth ■ a project ■ abstraction from VCS ■ programming ■ all revisions [1,3], ■ large scale: multiple ■ database with index [1,2,3] sample revisions [2] ■ abstraction from ■ meta-data [3] related projects, e.g. [1,2] ■ DSL [3] ■ text [1] Apache, Eclipse programming ■ ultra-large scale: 100k+ ■ declarations [2] language: only meta- unrelated projects with data [3] or text [1] varying quality [1,2,3] 1. G. Gousios, D. Spinellis: Alitheia core: An extensible software quality monitoring platform ; Proceedings of the 31st International Conference on Software Engineering; 2009 2. E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, P. Baldi. : Sourcerer: mining and searching internet-scale software repositories; Data Mining and Knowledge Discovery; Vok.18/Nr.2/2009 3. R. Dyer, H.A. Nguyen, H. Rajan, T.N. Nguyen : Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories; Proceedings of the 2013 International Conference on Software Engineering; 2013 4 Saturday, 27. September 2014
Goals and Hypothesis Scalability Heterogeneity Accessibility Information Depth ■ a project ■ abstraction from VCS ■ programming ■ all revisions [1,3], ■ large scale: multiple ■ database with index [1,2,3] sample revisions [2] ■ abstraction from ■ meta-data [3] related projects, e.g. [1,2] ■ DSL [3] ■ text [1] Apache, Eclipse programming ■ ultra-large scale: 100k+ ■ declarations [2] language: only meta- unrelated projects with data [3] or text [1] varying quality [1,2,3] 4 Saturday, 27. September 2014
Goals and Hypothesis Scalability Heterogeneity Accessibility Information Depth ■ a project ■ abstraction from VCS ■ programming ■ all revisions [1,3], ■ large scale: multiple ■ database with index [1,2,3] sample revisions [2] approaches ■ abstraction from ■ meta-data [3] related projects, e.g. [1,2] ■ DSL [3] ■ text [1] Apache, Eclipse programming ■ ultra-large scale: 100k+ ■ declarations [2] language: only meta- unrelated projects with data [3] or text [1] varying quality [1,2,3] ■ cluster- (batching) and ■ common meta-model ■ internal DSL: DSL + ■ all revisions ■ abstract syntax trees cloud- (Map/Reduce)- for VCSs programming with ■ meta-models for computing support models (AST) goals ■ distributable databases ■ common modeling ■ di ff erences between programming languages framework revisions (e.g. metrics ■ common meta-model ■ existing tools/ on adaptations and for metrics frameworks refactorings) ■ distributable model ■ abstraction for ■ is there a reasonable persistence di ff erent VCSs exists programming hypothesis ■ distributed processing of ■ abstraction regarding abstraction for models metrics for di ff . progr. gathering metrics/ languages exists change metrics ■ abstraction for di ff . languages exists 5 Saturday, 27. September 2014
srcrepo – Components and Process 6 Saturday, 27. September 2014
srcrepo – Components and Process large scale software repositories (e.g. github, sourceforge) software projects source code issue tracker, mailing lists, wiki repository (e.g. controlled by Git, SVN, CVS) source code (e.g. java, C++, eclipse*) 6 Saturday, 27. September 2014
srcrepo – Components and Process large scale software srcrepo storage repositories (EMF-models via EMF-Fragments, (e.g. github, sourceforge) e.g. on mongodb) software projects revision tree source code issue tracker, mailing lists, wiki repository 1 (e.g. controlled by Git, SVN, CVS) 2 3 import source code A 1 B 2 A 3 (e.g. java, C++, C 3 eclipse*) AST-models of new and changed CUs 6 Saturday, 27. September 2014
srcrepo – Components and Process large scale software srcrepo storage srcrepo runtime repositories (EMF-models via EMF-Fragments, (headless eclipse RCP) (e.g. github, sourceforge) e.g. on mongodb) revision tree software projects revision tree source code issue tracker, mailing lists, wiki repository 1 1 (e.g. controlled by Git, SVN, CVS) 2 3 2 3 analysis import fully resolved snapshot S 1 S 2 S 2 source code models A 1 B 2 A 3 (e.g. java, C++, C 3 eclipse*) A 3 A 1 A 1 B 2 B 2 C 3 AST-models of new and changed CUs 6 Saturday, 27. September 2014
srcrepo – Components and Process large scale software srcrepo storage srcrepo runtime repositories (EMF-models via EMF-Fragments, (headless eclipse RCP) (e.g. github, sourceforge) e.g. on mongodb) revision tree software projects revision tree source code revisions issue tracker, mailing lists, wiki repository 1 1 (e.g. controlled by Git, SVN, CVS) 2 3 2 3 analysis import fully resolved snapshot S 1 S 2 S 2 source code sources models A 1 B 2 A 3 (e.g. java, C++, C 3 eclipse*) A 3 A 1 A 1 B 2 B 2 C 3 AST-models of new and changed CUs 6 Saturday, 27. September 2014
revisions issue tracker, mailing lists, wiki repository 1 1 (e.g. controlled by Git, SVN, CVS) 2 3 2 3 analysis import fully resolved snapshot S 1 S 2 S 2 source code sources models A 1 B 2 A 3 (e.g. java, C++, C 3 eclipse*) A 3 A 1 A 1 B 2 B 2 C 3 AST-models of new and changed CUs 7 Saturday, 27. September 2014
revisions issue tracker, mailing lists, wiki repository 1 1 (e.g. controlled by Git, SVN, CVS) 2 3 2 3 analysis import fully resolved snapshot S 1 S 2 S 2 source code sources models A 1 B 2 A 3 (e.g. java, C++, C 3 eclipse*) A 3 A 1 A 1 B 2 B 2 C 3 AST-models of new and changed CUs OCL M 1 M 2 M 3 metrics 7 Saturday, 27. September 2014
revisions issue tracker, mailing lists, wiki repository 1 1 (e.g. controlled by Git, SVN, CVS) 2 3 2 3 analysis import fully resolved snapshot S 1 S 2 S 2 source code EMF-Compare sources models A 1 B 2 A 3 (e.g. java, C++, C 3 eclipse*) A 3 A 1 A 1 B 2 B 2 C 3 AST-models of new and changed CUs OCL M 1 M 2 M 3 metrics D 1-2 D 2-3 OCL M 1-2 M 2-3 7 Saturday, 27. September 2014
Recommend
More recommend