Sourcerer: An Infrastructure for Large-scale Collection and - - PowerPoint PPT Presentation

sourcerer an infrastructure for large scale collection
SMART_READER_LITE
LIVE PREVIEW

Sourcerer: An Infrastructure for Large-scale Collection and - - PowerPoint PPT Presentation

Sourcerer: An Infrastructure for Large-scale Collection and Analysis of Open-source Code Sushil Bajracharya, Joel Ossher , Cristina Lopes Sushil Bajracharya, Joel Ossher , Cristina Lopes Donald Bren School of Information and Computer Sciences


slide-1
SLIDE 1

Sourcerer: An Infrastructure for Large-scale Collection and Analysis of Open-source Code

Sushil Bajracharya, Joel Ossher, Cristina Lopes

jossher@uci.edu jossher@uci.edu

Sushil Bajracharya, Joel Ossher, Cristina Lopes Donald Bren School of Information and Computer Sciences University of California, Irvine

slide-2
SLIDE 2

SOURCERER

jossher@uci.edu jossher@uci.edu

SOURCERER

2

slide-3
SLIDE 3

Sourcerer’s Inception

  • Started in 2005
  • Motivation

– Explore the use of structural information for code retrieval

jossher@uci.edu jossher@uci.edu

retrieval – Enable data mining on large quantities of source code

  • Target: Open-source Java code

– Open Source movement provides a large quantity of high quality code – Java is popular, and amenable to static analysis

3

slide-4
SLIDE 4

Sourcerer Today

  • Collection of loosely coupled Java tools

– www.github.com/sourcerer/Sourcerer

  • Aggregated repository of open source code

– www.ics.uci.edu/~lopes/datasets/index.html

jossher@uci.edu jossher@uci.edu

– www.ics.uci.edu/~lopes/datasets/index.html

  • Services

– http://sourcerer.ics.uci.edu/services/

  • Applications

4

slide-5
SLIDE 5

Models

Layered Architecture

Applications

jossher@uci.edu jossher@uci.edu

5

Tools Stored Content Services

slide-6
SLIDE 6

Tools and Stored Content

SourcererDB Repository Manager File Repository

jossher@uci.edu jossher@uci.edu

6

Repository Creator Crawler Internet Search Index Code Indexer

slide-7
SLIDE 7

Code Crawler

  • Input

– List of seed pages

  • Output

– List of project pages

  • Plugin-based

Repository Manager File Repository

jossher@uci.edu jossher@uci.edu

  • Plugin-based

– Sourceforge – Java.net – Tigris – Google Code Hosting – Apache

7

Repository Creator Crawler Internet

slide-8
SLIDE 8

File Repository

  • Local aggregated

repository

  • Repository Creator

– Input

Repository Manager File Repository

jossher@uci.edu jossher@uci.edu

  • List of project pages

– Output

  • Populated file repository
  • Repository Manager

– Housekeeping tasks

8

Repository Creator Crawler Internet

slide-9
SLIDE 9

Feature Extractor

  • Input

– File Repository

  • Output

– Files containing entities and relations

Repository Manager File Repository

jossher@uci.edu jossher@uci.edu

and relations

  • Entity-relationship

metamodel

  • Headless Eclipse plugin

– Uses Eclipse Java development tools (JDT)

9

Repository Creator Crawler Internet

slide-10
SLIDE 10

SourcererDB

  • MySQL database
  • Database importer

– Incremental – Parallel

SourcererDB

jossher@uci.edu jossher@uci.edu

– Parallel – Input

  • Feature extractor output

– Output

  • SourcererDB

10

Search Index Code Indexer

slide-11
SLIDE 11

Search Index

  • Text search for code

entities

  • Apache Solr

– Search platform for

SourcererDB

jossher@uci.edu jossher@uci.edu

– Lucene

  • Code Indexer

– Heavily parallel

11

Search Index Code Indexer

slide-12
SLIDE 12

Stored Contents Recap

SourcererDB File Repository

jossher@uci.edu jossher@uci.edu

12

Search Index

slide-13
SLIDE 13

Sourcerer Services

  • Repository Access

– Look up text matching SourcererDB entities / relations

  • Relational Query

jossher@uci.edu jossher@uci.edu

  • Relational Query

– Direct access to SourcererDB

  • Code Search Service

– Access the Lucene index

  • Dependency Slicing

13

slide-14
SLIDE 14

Applications

  • Sourcerer Code Search Engine

– sourcerer.ics.uci.edu/sourcerer/search/index.jsp

  • CodeGenie

– Test-driven code search

jossher@uci.edu jossher@uci.edu

– Test-driven code search

  • Sourcerer API Search

– Demo!

14

slide-15
SLIDE 15

LESSONS LEARNED

jossher@uci.edu jossher@uci.edu

LESSONS LEARNED

15

slide-16
SLIDE 16

jossher@uci.edu jossher@uci.edu

16

slide-17
SLIDE 17

Lesson One: Reuse

  • Feature extractor 1.0

– Corollary: javac

  • Code crawler woes

jossher@uci.edu jossher@uci.edu

17

slide-18
SLIDE 18

Lesson Two:Performance & Scalability

  • Research prototype
  • Jars directory
  • Repository migration

jossher@uci.edu jossher@uci.edu

18

slide-19
SLIDE 19

Lesson Three: Loose Coupling

  • Sourcerer M1
  • CASI

jossher@uci.edu jossher@uci.edu

19

slide-20
SLIDE 20

Lesson Four: YCMEH

  • You can’t make everyone happy

– Why only Java? – Why no X project or Y repository? – Why no versioning information?

jossher@uci.edu jossher@uci.edu

– Why no versioning information? – … – If you try, no one will be happy (since your tool will never be released)

20

slide-21
SLIDE 21

Thank you!

jossher@uci.edu jossher@uci.edu

  • Contact: jossher@uci.edu

21