sourcerer an infrastructure for large scale collection
play

Sourcerer: An Infrastructure for Large-scale Collection and - PowerPoint PPT Presentation

Sourcerer: An Infrastructure for Large-scale Collection and Analysis of Open-source Code Sushil Bajracharya, Joel Ossher , Cristina Lopes Sushil Bajracharya, Joel Ossher , Cristina Lopes Donald Bren School of Information and Computer Sciences


  1. Sourcerer: An Infrastructure for Large-scale Collection and Analysis of Open-source Code Sushil Bajracharya, Joel Ossher , Cristina Lopes Sushil Bajracharya, Joel Ossher , Cristina Lopes Donald Bren School of Information and Computer Sciences University of California, Irvine jossher@uci.edu jossher@uci.edu

  2. SOURCERER SOURCERER 2 jossher@uci.edu jossher@uci.edu

  3. Sourcerer’s Inception • Started in 2005 • Motivation – Explore the use of structural information for code retrieval retrieval – Enable data mining on large quantities of source code • Target: Open-source Java code – Open Source movement provides a large quantity of high quality code – Java is popular, and amenable to static analysis 3 jossher@uci.edu jossher@uci.edu

  4. Sourcerer Today • Collection of loosely coupled Java tools – www.github.com/sourcerer/Sourcerer • Aggregated repository of open source code – www.ics.uci.edu/~lopes/datasets/index.html – www.ics.uci.edu/~lopes/datasets/index.html • Services – http://sourcerer.ics.uci.edu/services/ • Applications 4 jossher@uci.edu jossher@uci.edu

  5. Layered Architecture Models Applications Services Stored Content Tools 5 jossher@uci.edu jossher@uci.edu

  6. Tools and Stored Content Repository Manager File Repository SourcererDB Repository Creator Code Indexer Crawler Search Index Internet 6 jossher@uci.edu jossher@uci.edu

  7. Code Crawler Repository Manager • Input – List of seed pages • Output File Repository – List of project pages • Plugin-based • Plugin-based Repository Creator – Sourceforge – Java.net Crawler – Tigris – Google Code Hosting Internet – Apache 7 jossher@uci.edu jossher@uci.edu

  8. File Repository Repository Manager • Local aggregated repository File Repository • Repository Creator – Input – • List of project pages Repository Creator – Output • Populated file repository Crawler • Repository Manager – Housekeeping tasks Internet 8 jossher@uci.edu jossher@uci.edu

  9. Feature Extractor Repository Manager • Input – File Repository • Output File Repository – Files containing entities and relations and relations Repository Creator • Entity-relationship metamodel Crawler • Headless Eclipse plugin – Uses Eclipse Java development tools (JDT) Internet 9 jossher@uci.edu jossher@uci.edu

  10. SourcererDB • MySQL database • Database importer – Incremental SourcererDB – Parallel – Parallel – Input • Feature extractor output Code Indexer – Output • SourcererDB Search Index 10 jossher@uci.edu jossher@uci.edu

  11. Search Index • Text search for code entities • Apache Solr SourcererDB – Search platform for – Lucene • Code Indexer Code Indexer – Heavily parallel Search Index 11 jossher@uci.edu jossher@uci.edu

  12. Stored Contents Recap File Repository SourcererDB Search Index 12 jossher@uci.edu jossher@uci.edu

  13. Sourcerer Services • Repository Access – Look up text matching SourcererDB entities / relations • Relational Query • Relational Query – Direct access to SourcererDB • Code Search Service – Access the Lucene index • Dependency Slicing 13 jossher@uci.edu jossher@uci.edu

  14. Applications • Sourcerer Code Search Engine – sourcerer.ics.uci.edu/sourcerer/search/index.jsp • CodeGenie – Test-driven code search – Test-driven code search • Sourcerer API Search – Demo! 14 jossher@uci.edu jossher@uci.edu

  15. LESSONS LEARNED LESSONS LEARNED 15 jossher@uci.edu jossher@uci.edu

  16. 16 jossher@uci.edu jossher@uci.edu

  17. Lesson One: Reuse • Feature extractor 1.0 – Corollary: javac • Code crawler woes 17 jossher@uci.edu jossher@uci.edu

  18. Lesson Two:Performance & Scalability • Research prototype • Jars directory • Repository migration 18 jossher@uci.edu jossher@uci.edu

  19. Lesson Three: Loose Coupling • Sourcerer M1 • CASI 19 jossher@uci.edu jossher@uci.edu

  20. Lesson Four: YCMEH • You can’t make everyone happy – Why only Java? – Why no X project or Y repository? – Why no versioning information? – Why no versioning information? – … – If you try, no one will be happy (since your tool will never be released) 20 jossher@uci.edu jossher@uci.edu

  21. Thank you! • Contact: jossher@uci.edu 21 jossher@uci.edu jossher@uci.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend