DataLab: Introducing Software Engineering Thinking into Data Science - - PowerPoint PPT Presentation

datalab introducing software engineering thinking into
SMART_READER_LITE
LIVE PREVIEW

DataLab: Introducing Software Engineering Thinking into Data Science - - PowerPoint PPT Presentation

DataLab: Introducing Software Engineering Thinking into Data Science Education at Scale Yang Zhang , Tingjian Zhang, Yongzheng Jia, Jiao Sun, Fangzhou Xu, and Wei Xu Institute of Interdisciplinary Information Sciences, Tsinghua University


slide-1
SLIDE 1

DataLab: Introducing Software Engineering Thinking into Data Science Education at Scale

Yang Zhang, Tingjian Zhang, Yongzheng Jia, Jiao Sun, Fangzhou Xu, and Wei Xu Institute of Interdisciplinary Information Sciences, Tsinghua University Department of Computer Science and Technology, Shandong University

slide-2
SLIDE 2

Overview - Backgrounds

u Data science

Data scientist became the best job in the US in 2016

u Data Science Education

Ubiquitous in Universities and Online Education

[1]: 25 Best Jobs in America. https://www.glassdoor.com/List/Best-Jobs-in-America-LST KQ0,20.htm

slide-3
SLIDE 3

Overview - Challenges

Students l Lack formal computer science training l Hard to set up coding tools l Confused with data/code versions l Time-consuming to setup tools l Hard to scale teaching methodologies Instructors

slide-4
SLIDE 4

Differences between a DS and SE project

u

Data science requires managing both data and source code together

u

Many data science tasks are primarily concerned with tuning hyperparameters with many versions of code, data and results

u

Even a simple data science assignment requires a large dataset

u

Data science projects often require collaboration between students from different backgrounds

slide-5
SLIDE 5

Our solution

u

DataLab

u Integrates code, data and execution management

into a single system

u Creates links among code, data, parameters and

their revisions

u Provides a scalable system u Allows students to share their code, data, results

with any versions.

slide-6
SLIDE 6

Easy to set up a project

slide-7
SLIDE 7

A project summary page

u

Data

u

Code

u

Project push commit

slide-8
SLIDE 8

Separate config and parameters from code

slide-9
SLIDE 9

Online development environment

slide-10
SLIDE 10

Creating code/data versions and autograding

Versions Grades

slide-11
SLIDE 11

Version management

Versions

slide-12
SLIDE 12

Team collaboration

Share data Share code, config, param Import data Import code, config, param

slide-13
SLIDE 13

Instructor tools

slide-14
SLIDE 14

DataLab is scalable

u

Data management system

u

Scalable execution environment

u

Extensible APIs

slide-15
SLIDE 15

Evaluation - Deployment

u

DataLab: 3 machines

u 8 cores u 16 GB memory u 80GB of hard disk storage

[1]: Kaggle. https://www.kaggle.com

slide-16
SLIDE 16

Evaluation : in-classroom experiment

u

A graduate-level introductory data science course with 81 students and 20 volunteers

u

Classical Kaggle[1] competition project: Titanic Machine Learning from Disaster

u Predict survivors from gender, age, cabin class, and other information u 1,979 different versions of code submissions

[1]: Kaggle. https://www.kaggle.com

slide-17
SLIDE 17

Log analysis

Fig 1. Relation between number of submissions and accuracy Fig 2. How many times did students push and submit their code given their ranks? Fig 3. How many times did students check branches and reset their code given their ranks?

slide-18
SLIDE 18

Survey results

u

18 subjective questions

u

The survey has 3 parts

u Students’ coding experience u Students’ opinions u Students suggestions

u

92 out of 101 students indicate that they will continue to use DataLab for their future data science projects

Fig 1. Is DataLab helpful for learning data analysis techniques?

slide-19
SLIDE 19

Conclusion

Datalab: introducing SE Thinking to DS Education

Manage data/code/executi

  • n automatically

Save instructors' time Improve students' development efficiency Can scale at low cost

slide-20
SLIDE 20