Evolution of GitHub Repositories Aseel Awdeh Kalonji Kalala School - - PowerPoint PPT Presentation

evolution of github repositories
SMART_READER_LITE
LIVE PREVIEW

Evolution of GitHub Repositories Aseel Awdeh Kalonji Kalala School - - PowerPoint PPT Presentation

Evolution of GitHub Repositories Aseel Awdeh Kalonji Kalala School of Computer Science School of Computer Science University of Ottawa, Ottawa, Canada University of Ottawa, Ottawa, Canada araed104@uottawa.ca hkalo081@uottawa.ca COMP 5900


slide-1
SLIDE 1

COMP 5900 Project Presentation

Evolution of GitHub Repositories

Aseel Awdeh

School of Computer Science University of Ottawa, Ottawa, Canada araed104@uottawa.ca

Kalonji Kalala

School of Computer Science University of Ottawa, Ottawa, Canada hkalo081@uottawa.ca

slide-2
SLIDE 2

COMP 5900 Project Presentation

Outline

  • Background
  • Motivation
  • Research Questions
  • Challenges
  • Methodology
  • Results
  • Future Work and Implications
slide-3
SLIDE 3

COMP 5900 Project Presentation

Background

  • GitHub:
  • Collaborative code hosting
  • Collaborative code review
  • Integrated issue tracking
  • Social Features
  • Over 10 million git repositories and 5 million

developers.

  • Largest code hosting site in the world.
  • Important source of software artifacts on the

Internet.

slide-4
SLIDE 4

COMP 5900 Project Presentation

Background

Source: http://www.dataschool.io/

slide-5
SLIDE 5

COMP 5900 Project Presentation

Motivation

  • Increasing number of projects and users of

GitHub.

  • Surpassed in size and popularity of older

forges (Sourceforge).

  • Research:
  • GitHub's event logs.
  • Effects of branching and pull-based software

development.

  • Social nature of GitHub
  • No studies on evolution of GitHub repositories.
slide-6
SLIDE 6

COMP 5900 Project Presentation

Research Questions

How do the projects evolve? How does the popularity of projects change over time? What is the health of the projects? RQ1 RQ2 RQ3

slide-7
SLIDE 7

COMP 5900 Project Presentation

Methodology

  • 2. Data Extraction
  • 3. Data Analysis
  • 1. Data Collection
slide-8
SLIDE 8

COMP 5900 Project Presentation

Dataset Collection Challenges

  • MSR2016 challenges faced.
  • Ghtorrent dataset challenge.
  • GitHub allows access to its internal data through a

REST API.

  • Gathers event streams and data from GitHub.
  • Used MSR 2014 dataset.
slide-9
SLIDE 9

COMP 5900 Project Presentation

MSR 2014 Dataset

  • Top 10 software projects for the top

programming languages on GitHub

  • Resulting in 90 projects.
  • Year of creation till 2013 for each project.
  • Some characteristics of each project:
  • issues
  • pull requests
  • followers
  • stars
  • commits
slide-10
SLIDE 10

COMP 5900 Project Presentation

Dataset Challenges

  • Limited dataset
  • Selection of projects.
  • Tables provided for each project (project_language

table).

  • Incomplete attributes for some tables.
  • Not a representation of GitHub’s historical

dataset.

slide-11
SLIDE 11

COMP 5900 Project Presentation

Factors

  • Each project:
  • Commits.
  • Issues.
  • Pull Requests.
  • Committers.
  • Watchers.
  • Language.
slide-12
SLIDE 12

COMP 5900 Project Presentation

Overall Evolution

slide-13
SLIDE 13

COMP 5900 Project Presentation

Project Growth

10 20 30 40 50 60 70 80 90 100 March April June August October January February March April May June August September October November December January February March April May June July August September October November December January February March April July August September October November December February April June 2008 2009 2010 2011 2012 Number of Projects Year

slide-14
SLIDE 14

COMP 5900 Project Presentation

Commits Growth

2000 4000 6000 8000 10000 12000 14000 16000 18000 February May August November February May August November February May August November February May August November February May August November February May August November February May August November February May August November February May August November February May August November February May August 2003 - 2004 2004 - 2005 2005 - 2006 2006 - 2007 2007 - 2008 2008 - 2009 2009 - 2010 2010 - 2011 2011 - 2012 2012 - 2013 2013 - 2014 Number of Commits Year

slide-15
SLIDE 15

COMP 5900 Project Presentation

Committers Growth

200 400 600 800 1000 1200 1400 1600 1800 2000 February May August November February May August November February May August November February May August November February May August November February May August November February May August November February May August November February May August November February May August November February May August 2003 - 2004 2004 - 2005 2005 - 2006 2006 - 2007 2007 - 2008 2008 - 2009 2009 - 2010 2010 - 2011 2011 - 2012 2012 - 2013 2013 - 2014 Number of Committers Years

slide-16
SLIDE 16

COMP 5900 Project Presentation

Growth of projects in terms of..

slide-17
SLIDE 17

COMP 5900 Project Presentation

Number of commits per project

5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 accessible-boilerplate Agony-WoW-Core android barchart-project-netty beanstalkd blog boto ccv clojure CommunityCraftBukkit contrib-libuv d3 devise diaspora django-cms doom3.gpl facebook-android-sdk flockdb Font-Awesome gitlabhq hiphop-php httpie impress.js jquery kestrel libgit2 memcached mongo mosh Nancy paperclip php-sdk plupload prettyredcarpet rails reddit RestSharp scalatra shiny SignalR SparkleShare storm three.js zf2

Number of Commits Projects

slide-18
SLIDE 18

COMP 5900 Project Presentation

Number of issues per project

5000 10000 15000 20000 25000 30000 ActionBarSherlock android beanstalkd blueprint-css cakephp chosen CodeIgniter CraftBukkit devise diaspora django-cms doom3.gpl facebook-android-sdk flask folly foundation gizzard homebrew httpie impress.js jquery knitr libuv memcached mongo mosh netty

  • ctopress

paperclip php-sdk plupload rails redcarpet redis RestSharp scala ServiceStack Sick-Beard Slim stat-cookbook symfony three.js TrinityCore zf2

Number of Issues Project

slide-19
SLIDE 19

COMP 5900 Project Presentation

Number of committers per project

1000 2000 3000 4000 5000 6000 accessible-boilerplate Agony-WoW-Core android barchart-project-netty beanstalkd blog boto ccv clojure CommunityCraftBukkit contrib-libuv d3 devise diaspora django-cms doom3.gpl facebook-android-sdk flockdb Font-Awesome gitlabhq hiphop-php httpie impress.js jquery kestrel libgit2 memcached mongo mosh Nancy paperclip php-sdk plupload prettyredcarpet rails reddit RestSharp scalatra shiny SignalR SparkleShare storm three.js zf2

Number of Committers Projects

slide-20
SLIDE 20

COMP 5900 Project Presentation

Popularity

slide-21
SLIDE 21

COMP 5900 Project Presentation

Watchers Growth

5000 10000 15000 20000 25000 March May July September November January March May July September November January March May July September November January March May July September November January March May July September November January March May July September 2008 - 2009 2009 - 2010 2010 - 2011 2011 - 2012 2012 - 2013 2013 - 2014 Number of Watchers Year

slide-22
SLIDE 22

COMP 5900 Project Presentation

Number of watchers per project

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Number of watchers Projects

slide-23
SLIDE 23

COMP 5900 Project Presentation

Initial Data Analysis

Pearson Correlation Watchers and

  • Commits
  • 0.12411
  • Committers
  • 0.08331
  • Issues
  • 0.0308

Commits and Issues 0.765961 Commits and Committers 0.79765

slide-24
SLIDE 24

COMP 5900 Project Presentation

Anticipated Health Results

  • Assumption: Active projects are healthy.
  • A project with a large number of watchers is not

necessarily healthy/active.

  • Projects die out usually after a 3 years of

development.

  • Health of most projects increases in 2011.
  • Most projects are there for storage, not

development.

slide-25
SLIDE 25

COMP 5900 Project Presentation

Example: Commits per project per year

5000 10000 15000 20000 25000 30000 35000 40000 SparkleShare plupload mono Nancy ServiceStack AutoMapper RestSharp ravendb MiniProfiler storm elasticsearch ActionBarSherlock facebook-android-sdk clojure CraftBukkit netty android node jquery html5-boilerplate impress.js d3 chosen Font-Awesome three.js foundation symfony CodeIgniter php-sdk zf2 cakephp ThinkUp phpunit Slim django tornado httpie flask requests reddit boto django-debug-toolbar Sick-Beard django-cms rails homebrew jekyll

Commits Projects

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

slide-26
SLIDE 26

COMP 5900 Project Presentation

Future Data Analysis

  • Relationships:
  • Number of issues/committers related to number of

commits per project.

  • Number of commits related to programming

languages used.

  • Number of commits related to number of watchers.
  • Kruskal-Wallis Test
  • Test whether distribution of independent projects is

identical.

slide-27
SLIDE 27

COMP 5900 Project Presentation

Implications and Future Work

  • Sample and analyze from the larger GitHub

torrent dataset.

  • Analyze GitHub per user, instead of per project.
  • Help predict future growth patterns and

requirements of GitHub.

slide-28
SLIDE 28

COMP 5900 Project Presentation

Thank you