Boa Meets Python: A Boa Dataset of Data Science Software in Python - - PowerPoint PPT Presentation

boa meets python a boa dataset of data science software
SMART_READER_LITE
LIVE PREVIEW

Boa Meets Python: A Boa Dataset of Data Science Software in Python - - PowerPoint PPT Presentation

Department of Computer Science Boa Meets Python: A Boa Dataset of Data Science Software in Python Language Sumon Biswas , Md Johirul Islam, Yijia Huang and Hridesh Rajan http://boa.cs.iastate.edu Data Science Everywhere Trend of publications


slide-1
SLIDE 1

Boa Meets Python: A Boa Dataset of Data Science Software in Python Language

Sumon Biswas, Md Johirul Islam, Yijia Huang and Hridesh Rajan http://boa.cs.iastate.edu

Department of Computer Science

slide-2
SLIDE 2

Data Science Everywhere

Department of Computer Science Trend of publications with topic “machine-learning”

https://app.dimensions.ai/discover/publication

Top 5 courses in in 2018

1.

Stanford TensorFlow Tutorials

2.

Deep Learning Specialization on Coursera

3.

Creative Applications of Deep Learning with Tensorflow

4.

Practical RL: A course in reinforcement learning in the wild

5.

Data Science Coursera * based on forks https://github.blog/2018-03-20-top-10-courses-on-github

slide-3
SLIDE 3

Data Science Everywhere

3

Department of Computer Science

  • Data Science projects are growing very fast
  • 1. react
  • 2. android
  • 3. nodejs
  • 4. docker
  • 5. ios
  • 6. linux
  • 7. angular
  • 8. machine-learning
  • 9. electron
  • 10. api

Top topics in

  • 1. hacktoberfest
  • 2. pytorch
  • 3. machine-learning
  • 4. dapp
  • 5. gatsby
  • 6. cryptocurrency
  • 7. terraform-provider
  • 8. easy-to-use
  • 9. smart-contracts
  • 10. exchange

Top growing topics in

slide-4
SLIDE 4

Python in Data Science

Department of Computer Science

https://octoverse.github.com/projects

Top languages over time in GitHub

https://stackoverflow.blog/2017/09/06/incredible-growth-python/

Growth of programming languages in StackOverflow

slide-5
SLIDE 5

Motivation

  • Lots of Data Science (DS) software
  • Python is one of the most used languages in DS
  • Lots of packages, easy-to-learn
  • MSR have been very successful in software engineering
  • Availability of benchmarks has historically accelerated

research on a topic

  • e.g., Allamanis and Sutton's Java, DaCapo [1], Qualitas [2], etc.

[1] S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer et al., “The DaCapo benchmarks: Java benchmarking development and analysis,” in ACM Sigplan Notices, vol. 41, no. 10. ACM, 2006 [2] E.Tempero,C.Anslow,J.Dietrich,T.Han,J.Li,M.Lumpe,H.Melton, and J. Noble, “The Qualitas corpus: A curated collection of Java code for empirical studies,” in Software Engineering Conference (APSEC), 2010 17th Asia Pacific. IEEE, 2010

5

Department of Computer Science

slide-6
SLIDE 6

Contributions

1.

A large dataset for analyzing Python DS projects

2.

Efficiently store the dataset in Hadoop sequence file

  • make it memory efficient and
  • parallelly accessible

3.

Dataset is publicly available on Boa infrastructure

6

Department of Computer Science

  • 1. 1,558 Python

Projects for DS

  • 2. Stored in

sequence file

  • 3. Available in

infrastructure

slide-7
SLIDE 7

Dataset Metrics

7

Department of Computer Science

  • Top rated projects: Tensorflow, Keras, Pandas, Spacy, Theano etc.
  • Projects use at least 33 DS libraries including Pytroch, Caffe, Keras,

Tensorflow, XGBoost, NLTK etc.

Project metadata All the revisions Parsed Python AST

slide-8
SLIDE 8

Methodology

8

Department of Computer Science

Python Repository Original (not forked) Count 343,607 Star > 1 Data science projects Contain DS keywords Use DS libraries Star > 80 Count 1,558

slide-9
SLIDE 9

What to Do with the Dataset

9

Department of Computer Science

Learn from past and guide future development Improve software design and reuse Manage software better Automatic bug detection Mining DS repositories

...

slide-10
SLIDE 10

Summary

10

Department of Computer Science

slide-11
SLIDE 11

11

Department of Computer Science

Appendix

slide-12
SLIDE 12

Boa - Mining Large Scale Software Repositories

1.

Infrastructure

1.

Domain-specific language

12

Department of Computer Science

Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen, "Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories", In the proceedings of the 35th International Conference on Software Engineering (ICSE 2013), May 22, 2013. San Francisco, CA.

slide-13
SLIDE 13

Boa Web Based Interface

13

Department of Computer Science

http://boa.cs.iastate.edu

slide-14
SLIDE 14

Data Schema

14

Department of Computer Science

slide-15
SLIDE 15

Applications - API usage study

15

Department of Computer Science