Getting the global picture Jes us M. Gonz alez Barahona, Gregorio - - PDF document

getting the global picture
SMART_READER_LITE
LIVE PREVIEW

Getting the global picture Jes us M. Gonz alez Barahona, Gregorio - - PDF document

Getting the global picture Jes us M. Gonz alez Barahona, Gregorio Robles GSyC, Universidad Rey Juan Carlos, Madrid, Spain { jgb,grex } @gsyc.escet.urjc.es Oxford Workshop on Libre Software 2004 Oxford, UK, June 25th Overview 1 Overview


slide-1
SLIDE 1

Getting the global picture

Jes´ us M. Gonz´ alez Barahona, Gregorio Robles GSyC, Universidad Rey Juan Carlos, Madrid, Spain

{jgb,grex}@gsyc.escet.urjc.es

Oxford Workshop on Libre Software 2004 Oxford, UK, June 25th

Overview 1

Overview

Available information about libre software projects Open problems (large detailed studies, crossing information from dif- ferent sources) Some questions still to be answered What do we need to be there

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-2
SLIDE 2

Sources of data about libre software projects 2

Sources of data about libre software projects

Version control systems: CVS, Subversion, Bitkeeper, etc Software releases (both binary and source) Project documentation: man, info, DocBook, LaTeX, plain text, etc Bug tracking systems: Bugzilla, Sourceforge, Debian, etc Mailing lists: BSD mbox, MH mbox, Mailman, etc Forums: many, many kinds Information about usage, eg: Debian’s popularity contest Impact in the Internet, eg: some filtered Googling Polls and surveys, eg: FLOSS

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

One kind of data source, one project 3

One kind of data source, one project

Example: source code analysis Data source: version control system Metrics based analysis (SLOC, McCabe, number of modules, etc.) Classification of code (language, documentation, etc.) Reuse study (comparison of source code) Contribution (eg. by author), including affiliation networks Evolution (any of the previous in time) Combined studies (within same project) What can be learned: structure of the source code, basic developer activity

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-3
SLIDE 3

Several kind of data sources, one project 4

Several kind of data sources, one project

Example: tracking developer activities Data source: version control system, bug tracking system, mailing list Identify all developers in the BTS (maybe with help of heuristics) Identify all BTS ids in mailing lists (maybe with help of heuristics) Track individual developers in time (evolution of their contribution to the project) What can be learned: how activity evolves over time, who fix bugs (and when), ratio of listers to reporters to developers

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

One kind of data source, several projects 5

One kind of data source, several projects

Example: source code analysis for a distribution Data source: source packages in a distribution Compare and correlate source analysis (already shown) What can be learned: file size for different languages, correlations between metrics and developers (are they similar in similar projects?)

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-4
SLIDE 4

Several kinds of data sources, several projects 6

Several kinds of data sources, several projects

Example: relationship of bug fixing to patch size Data sources: version control system, bug tracking system, mailing list Look for patches in the BTS, identify them in the CVS Look for patches in the mailing list, identify them in the CVS Look for fixed bugs in the BTS, relate them to changes in CVS What can be learned: time from bug report to bug fix, relationship to patch size, to who takes the bug report, to existence of patch: relationship of bugs to previous changes in code

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

Several kinds of data sources, thousands of projects 7

Several kinds of data sources, thousands of projects

Example: tracking developer effort and activities Data sources: as much as possible Select some hundreds of developers Track them in several projects, submit a poll to them Use the combined information to estimate effort per developer over time, to look for shifts in effort from project to project, to correlate effort in different activities (coding, bug fixing, mailing lists) What can be learned: typical evolutions of developers, what they think they do compared to what they actually do, understanding why some projects get developers and other no, and model the project

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-5
SLIDE 5

In all cases... 8

In all cases...

All the downloading of data can be automated Most of the analysis of data can be automated (maybe with the help

  • f heuristics statistically valid, and some hand-work)

Data can benefit a lot of well designed polls answered by developers Really large sets of data Many privacy issues

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

Main problems to get the big picture 9

Main problems to get the big picture

Different source systems (eg, bug tracking systems: Bugzilla, Source- forge, GNATS, Debian) Different levels of information and data representation for the same concept (eg: user ids in CVS, BTS, mailing list, forum, etc) Different information for the same item (eg: different mailing addres- ses for the same developer, at the same time) Different conventions at different projects (eg: policy and uses of code uploads and releases)

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-6
SLIDE 6

Our plan (headlines) 10

Our plan (headlines)

Automate as much download processes as possible (modular archi- tecture) Automate as much analysis approaches as possible (modular archi- tecture) Build huge database with all data collected (be it raw or result of analysis) Allow other to use an contribute code Allow data from polls to be integrated Run the machinery for many, many projects

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

Our plan (some details) 11

Our plan (some details)

Unique (opaque) identifier for every developer Unified data descriptions for main sources of raw data Clear data formats for exchange of information in most common con- texts Let projects use the tools (“report on my project”) Integrate the tools with usual development systems (as GForge)

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-7
SLIDE 7

Where are we now 12

Where are we now

GlueTheos and CVSAnalY in good shape Work in progress: integration with source analysis tools Work in progress: integration of social network analysis tools Work in progress: integration with statistical tools To be done: integration of other data sources (BTSs, mailing lists, etc) To be done: framework for data interchange To be done: integration of everything, and collaboration framework

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

CVSAnalY: analyzing CVS repositories 13

CVSAnalY: analyzing CVS repositories

Based on the analysis of CVS logs Three steps

  • Preprocessing (data retrieval and extraction)
  • Intermediate format (SQL, XML...)
  • Postprocessing (manipulation, correlations, graphics, etc.)

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-8
SLIDE 8

Preprocessing 14

Preprocessing

Downloading modules and removing aggregated ones Log retrieval and parsing Transformation into SQL and XML Username merging File type matching (source code, documentation, translation, etc.) 1.1.1.1 version and files in the Attic Commit comment parsed for external contribution and “silent” com- mits

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

Postprocess 15

Postprocess

Statistical information on the project Software evolution, inequelity, etc. graphs Heat maps for developer interaction Social Network Analysis (for modules/directories/files and developers) Developer statistics

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-9
SLIDE 9

GlueTheos: analysis of the evolution 16

GlueTheos: analysis of the evolution

Retrieves periodically the sources from CVS Runs external programs

  • Size measurement in SLOC (SLOCCount)
  • Authorship attribution (CODD)
  • Complexity measures (Halstead & McCabe)
  • Other (even language-specific) tools are also possible (wc, etc.)

Stores results in database in order to make comparisons possible

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

Commiters in time for ’Evolution’ 17

Commiters in time for ’Evolution’

Commiters in time for ’Evolution’

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-10
SLIDE 10

Commits in time for ’Evolution’ 18

Commits in time for ’Evolution’

Commits in time for ’Evolution’

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

Gini coefficient in ’Evolution’ 19

Gini coefficient in ’Evolution’

Gini coefficient in ’Evolution’

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-11
SLIDE 11

’Generations’ in ’KOffice’ 20

’Generations’ in ’KOffice’

“Generations”in KOffice

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

File discrimination in KDE 21

File discrimination in KDE

File discrimination in KDE

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-12
SLIDE 12

File discrimintation by developers correlated 22

File discrimintation by developers correlated

kdelibs Heatmap (5th slot)

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

File discrimination by developers correlated (II) 23

File discrimination by developers correlated (II)

kdelibs Heatmap (9th slot)

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-13
SLIDE 13

Different communities? 24

Different communities?

Translation vs development

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

The Apache modules 25

The Apache modules

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-14
SLIDE 14

Apache connection degree (modules network) 26

Apache connection degree (modules network)

Degree

5 10 15 20 25 30 35 1 2 3 4 5 6 7 8

Degree

10 20 30 40 50 60 70 2 4 6 8 10 12

Degree

10 20 30 40 50 60 70 80 90 2 4 6 8 10 12 14

Degree

20 40 60 80 100 120 2 4 6 8 10 12 14

2001 (top left) to 2004 (bottom right)

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

Structure of the Apache modules 27

Structure of the Apache modules

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture

slide-15
SLIDE 15

Conclusions 28

Conclusions

The field of quantitative analysis of libre software projects is maturing Large quantities of work can be automated In the end, it is a problem of data mining To advance with more complex studies, we need more quantities of data from different sources, and tools to handle them Integration with polls and surveys is fundamental libresoft.dat.escet.urjc.es

c 2004 Jes´ us M. Gonz´ alez Barahona Getting the global picture