Mining Software Repositories Session 1 Infrastructure and - - PowerPoint PPT Presentation

mining software repositories
SMART_READER_LITE
LIVE PREVIEW

Mining Software Repositories Session 1 Infrastructure and - - PowerPoint PPT Presentation

Mining Software Repositories Session 1 Infrastructure and extraction Discussion Leader: Daniel M. German 1 The Stages 1. Data Extraction 2. Data Mining/Facts Finding/Change Patterns/System Understanding 3. Integration and Presentation 2


slide-1
SLIDE 1

Mining Software Repositories

Session 1

Infrastructure and extraction

Discussion Leader: Daniel M. German

1

slide-2
SLIDE 2

The Stages

  • 1. Data Extraction
  • 2. Data Mining/Facts Finding/Change Patterns/System

Understanding

  • 3. Integration and Presentation

2

slide-3
SLIDE 3

The Extraction Stage

  • The dirty work, but somebody has to do it
  • Lots of raw data out there

– Usually Open Source – Difficult to gain access to Closed source data

3

slide-4
SLIDE 4

The Issues

  • Why do we need extract historical data?
  • Without a purpose, this data might have no value

4

slide-5
SLIDE 5

The Issues...

  • What to extract? (software trails)

– Code ∗ Releases ∗ Versioning history – Defects – Documentation ∗ Explicit (man pages, help system, design documents) ∗ Implicit (email messages) ∗ Web site

5

slide-6
SLIDE 6

The Issues...

  • From Where

– What projects to select? – The software process might have an impact in the way the historical data gets recorded – It is necessary to understand this process – Different projects store data in different ways

6

slide-7
SLIDE 7

The Papers

  • The Perils and Pitfalls of Mining SourceForge

by James Howison and Kevin Crowston

  • Their experiences mining sourceForge
  • What they learnt spidering the site
  • Some potential mistakes in the analysis of the extracted data

7

slide-8
SLIDE 8

The Papers...

  • Text is Software Too by Alexander Dekhtyar, Jane Huffman Hayes

and Tim Menzies

  • Mining of textual requirements documents
  • “Text mining from software engineering text is a hight risk, high

return adventure.”

8

slide-9
SLIDE 9

The Papers...

  • Mining CVS Repositories, the softChange experience by Daniel

German

  • The revision history of the source code says a lot about the project:

– it highlights the process, the architecture evolution, hidden relationships between files...

  • The Concurrent Versions System (CVS) is a major source of

historical data

9

slide-10
SLIDE 10

The Papers

  • Research Infrastructure for Empirical Science of F/OSS

by Les Gasser, Gabriel Ripoche and Robert Sandusky

  • Preprocessing CVS Data for Fine-Grained Analysis

by Thomas Zimmerman and Peter Weissgerber

10

slide-11
SLIDE 11

Discussion: the Issues, revisited

  • Several people are working in the same problems

– Comparison? – Collaboration? (Avoid reinventing the wheel)

  • Nomenclature?
  • Choosing projects for analysis?
  • Sharing data?
  • Sharing the extractors?

11