Using software trails to recover the evolution of software 3rd - - PowerPoint PPT Presentation

using software trails to recover the evolution of software
SMART_READER_LITE
LIVE PREVIEW

Using software trails to recover the evolution of software 3rd - - PowerPoint PPT Presentation

Using software trails to recover the evolution of software 3rd ELISA 2003 Daniel M. German Software Engineering Group University of Victoria, Canada September 23, 2003 Version: 1.0.0 1 Introduction By using tools that become vital to the


slide-1
SLIDE 1

Using software trails to recover the evolution

  • f software

3rd ELISA 2003 Daniel M. German Software Engineering Group University of Victoria, Canada

September 23, 2003 Version: 1.0.0

1

slide-2
SLIDE 2

Introduction

  • By using tools that become vital to the success of a project, its

history is being recorded in software trails: – Configuration management systems (including version control and defect management systems) – Mailing lists – ChangeLogs

2

slide-3
SLIDE 3

Evolution

  • The initial objective of this research was to try to recover the

evolution of Evolution using its software trails – It is the Outlook of the GNOME project – Almost 4 years of development – It is becoming one of the free mail clients – Unlike many other OSS projects ∗ It started as a group project, with its software requirements drawn before the code was written ∗ It has been driven by one company: Ximian (recently bought by Novell)

3

slide-4
SLIDE 4

Methodology

  • Define a schema that represents and correlates software trails
  • Gather the trails:

– Recover the trails and map them to the schema – Trails are usually available as logs and history reports

  • Extend the information:

– Combine the available information, creating new facts – It might require some heuristics

  • Analyze:

– Using query languages and visualization tools – It is a time consuming task

4

slide-5
SLIDE 5

Is this info useful?

  • The most important question: can we trust this information?
  • The answer: it depends
  • Some projects establish clear guidelines –and follow them– on

how to use these tools. – IBM uses a Configuration Management System that tracks several trails – Many free/Open Source software projects use a toolkit based

  • n CVS, Bugzilla, mailman, following a set of de-facto

standards

5

slide-6
SLIDE 6

Evolution Trails

  • This papers uses info from

– ChangeLogs: “explain how earlier versions of software were different from the current version.” – CVS: Most popular version control system ∗ Keeps track of who modifies what, and when, supports branching ∗ It does not support transaction-oriented operations – Mailing lists ∗ For developers and for users – Source code releases

  • In several cases, it was necessary to reverse engineer their formats

6

slide-7
SLIDE 7

The Challenge of Extending the Trails

  • It is difficult to correlate raw trails
  • For example, identifying developers:

– CVS uses an id to record the developer – The ChangeLog lists his/her preferred email address – The mailing list might list his/her spam, or commonly used address – Some changes come from non-cvs developers and they are recorded in the ChangeLogs

  • Nonetheless, they provide a gold mine of information to follow the

evolution of a project

7

slide-8
SLIDE 8

Milestones of Evolution

Milestones Date Coding of camel starts 1999-01-01 Evolution starts 1999-04-16 Ximian is established 1999-10-01 Version 0.0 2000-05-10 Version 1.0 2001-11-21 Version 1.1.1 2002-09-09 Version 1.2.0 2002-11-07 LinuxWorld “Best Front Office Solution” award 2003-01-23 Version 1.3.1 2003-02-28

8

slide-9
SLIDE 9

Size of the Distributions

10 20 30 40 50 60 70 00/07 01/01 01/07 02/01 02/07 03/01 Size (in MBytes) Month Size of version Size of source code Size of translations Size of ChangeLogs Major releases

9

slide-10
SLIDE 10

Size of the Distributions...

150000 200000 250000 300000 350000 400000 450000 500000 550000 00/07 01/01 01/07 02/01 02/07 03/01 400 525 650 775 900 1025 1150 1275 1400 Number of Source Files Month LOCS clean LOCS Total number of files Major releases

10

slide-11
SLIDE 11

How is the code base changing?

  • 20000

20000 40000 60000 80000 100000 00/07 01/01 01/07 02/01 02/07 03/01

  • 50

50 100 150 200 New LOCS New Source Files Month New LOCS New Source Files (right axis) Major releases

11

slide-12
SLIDE 12

And the developers?

200 400 600 800 1000 1200 98/01 98/07 99/01 99/07 00/01 00/07 01/01 01/07 02/01 02/07 03/01 20000 40000 60000 80000 100000 120000 MRs Date Ximian starts operations Release 0.0 Release 1.0 Release 1.2 Release 1.1.1 Release 1.3.1 MRs code MRs Major releases Minor releases

12

slide-13
SLIDE 13

Change in code base vs. contributors activity

  • 200

200 400 600 800 1000 1200 00/01 00/07 01/01 01/07 02/01 02/07 03/01

  • 20000

20000 40000 60000 80000 100000 120000 code MRs LOCS added in release Date Release 1.0 Release 1.2 Release 1.1.1 Release 1.3.1 MRs New LOCS (right axis) Major releases Major releases

13

slide-14
SLIDE 14

How many contributors?

1e-05 0.0001 0.001 0.01 0.1 1 1 2 4 8 16 32 64 128 Proportion of total MRs (log scale) Contributors (log scale) Contributors activity

14

slide-15
SLIDE 15

Revisions per type of file

Extension Prop. Accum. Number of files in CVS .c 0.41 0.41 1195 ChangeLog 0.22 0.62 43 .h 0.13 0.75 1063 .am 0.05 0.81 174 .po 0.04 0.85 71

15

slide-16
SLIDE 16

Most files are rarely changed

0.002 0.004 0.006 0.008 0.01 0.012 1 10 100 1000

  • Prop. of rev. to a given code file (log scale)

Files (log scale) Revisions to Files

16

slide-17
SLIDE 17

Modules

mail camel calendar addressbook shell widgets composer e-util filter my-evolution tests libical libibex executive-summary wombat importers im libversit notes tools libwombat cmdline ebook 500 1000 1500 2000 2500 3000 Number of MRs for each Module MRs per Module

17

slide-18
SLIDE 18

Evolution of the size of the modules

20 40 60 80 100 00/07 01/01 01/07 02/01 02/07 03/01 LOCS Date camel calendar mail addressbook shell libical widgets Major releases

18

slide-19
SLIDE 19

Changes are usually localized in a given module

1 10 100 1000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of codeMRs (log scale) Number of Modules in a codeMR Number of Modules in codeMR

19

slide-20
SLIDE 20

Developers tend to concentrate in one module

Mod Developers Id Prop Acc shell 17 ettore 0.65 0.65 danw 0.11 0.76 toshok 0.05 0.81 clahey 0.04 0.84 zucchi 0.03 0.87 mail 19 fejj 0.52 0.52 rodo 0.13 0.65 zucchi 0.12 0.77 ettore 0.07 0.83 danw 0.06 0.89 calendar 17 jpr 0.40 0.40 rodrigo 0.32 0.72 ettore 0.07 0.79 danw 0.06 0.85 damon 0.03 0.88 20

slide-21
SLIDE 21

Observations

  • One software trail does not tell the whole story
  • Schema evolution
  • Informal structure in trail
  • Information overload and the need for analysis and visualization

tools.

  • Quality of software trails.

21

slide-22
SLIDE 22

Quality of Trails

  • Some projects keep better trails than others.
  • One hypothesis: it is a measure of:

– The number of developers, – their dislocation, – and the maturity of the project.

22

slide-23
SLIDE 23

Conclusions and Future Work

  • Extracting and correlating software trails can tell a detailed story
  • f how a software project has evolved
  • But it comes at a cost: too much information to analyze
  • It is needed:

– Creating of standardized schemas – More tools to recover and enhance the trails – Heuristics to automatically discover “interesting” facts – Metrics to quantify trails

23