mining software repositories
play

Mining Software Repositories Session 1 Infrastructure and - PowerPoint PPT Presentation

Mining Software Repositories Session 1 Infrastructure and extraction Discussion Leader: Daniel M. German 1 The Stages 1. Data Extraction 2. Data Mining/Facts Finding/Change Patterns/System Understanding 3. Integration and Presentation 2


  1. Mining Software Repositories Session 1 Infrastructure and extraction Discussion Leader: Daniel M. German 1

  2. The Stages 1. Data Extraction 2. Data Mining/Facts Finding/Change Patterns/System Understanding 3. Integration and Presentation 2

  3. The Extraction Stage • The dirty work, but somebody has to do it • Lots of raw data out there – Usually Open Source – Difficult to gain access to Closed source data 3

  4. The Issues • Why do we need extract historical data? • Without a purpose, this data might have no value 4

  5. The Issues... • What to extract? ( software trails ) – Code ∗ Releases ∗ Versioning history – Defects – Documentation ∗ Explicit (man pages, help system, design documents) ∗ Implicit (email messages) ∗ Web site 5

  6. The Issues... • From Where – What projects to select? – The software process might have an impact in the way the historical data gets recorded – It is necessary to understand this process – Different projects store data in different ways 6

  7. The Papers • The Perils and Pitfalls of Mining SourceForge by James Howison and Kevin Crowston • Their experiences mining sourceForge • What they learnt spidering the site • Some potential mistakes in the analysis of the extracted data 7

  8. The Papers... • Text is Software Too by Alexander Dekhtyar, Jane Huffman Hayes and Tim Menzies • Mining of textual requirements documents • “Text mining from software engineering text is a hight risk, high return adventure.” 8

  9. The Papers... • Mining CVS Repositories, the softChange experience by Daniel German • The revision history of the source code says a lot about the project: – it highlights the process, the architecture evolution, hidden relationships between files... • The Concurrent Versions System (CVS) is a major source of historical data 9

  10. The Papers • Research Infrastructure for Empirical Science of F/OSS by Les Gasser, Gabriel Ripoche and Robert Sandusky • Preprocessing CVS Data for Fine-Grained Analysis by Thomas Zimmerman and Peter Weissgerber 10

  11. Discussion: the Issues, revisited • Several people are working in the same problems – Comparison? – Collaboration? (Avoid reinventing the wheel) • Nomenclature? • Choosing projects for analysis? • Sharing data? • Sharing the extractors? 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend