Datalab:AVersion Data Management And Analytics System Yang - - PowerPoint PPT Presentation

datalab aversion data management and analytics system
SMART_READER_LITE
LIVE PREVIEW

Datalab:AVersion Data Management And Analytics System Yang - - PowerPoint PPT Presentation

Datalab:AVersion Data Management And Analytics System Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi Wu, Bin Yu,Wei Xu Overview Problem: how do we manage code and data with versions? Code version control, e.g. GitHub Data version control, e.g.


slide-1
SLIDE 1

Datalab:AVersion Data Management And Analytics System

Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi Wu, Bin Yu,Wei Xu

slide-2
SLIDE 2

Overview

¡ Problem: how do we manage code and data with versions?

¡ Code version control, e.g. GitHub ¡ Data version control, e.g. DataHub[1]

¡ But how to combine them in a coherent system?

Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, And Aditya G Parameswaran. Datahub: Collaborative Data Science & Dataset Version Management At Scale. Arxiv Preprint Arxiv:1409.0798, 2014.

slide-3
SLIDE 3

Our Solution

¡ Version control combining codes and datasets. ¡ Datasets are generated by execution of codes. ¡ T

wo data versions are connected by a code version.

Dataset version0001 Experiment commit_id = c41f29 Dataset version0002

slide-4
SLIDE 4

Data Work Flow (DWF)

¡ Pairs of data versions make up a data

work flow (DWF)

¡ Reconstruct a dataset by re-executing

the version of code that generates it

Dataset 2 Dataset 5 Dataset 3 Dataset 4 Dataset 7 Dataset 6 Dataset 1

slide-5
SLIDE 5

System Architecture

slide-6
SLIDE 6

Case Study --A Biological Data Application

¡ Goal: find the best K principle patterns ¡ Procedure:

¡ Data preprocessing ¡ Feature extraction ¡ Non-negative matrix factorization ¡ Evaluate K by a stability function ¡ Repeat until find the best parameter

slide-7
SLIDE 7

Core APIs

slide-8
SLIDE 8

Future Work

¡ Dataset caching ¡ Online development environment ¡ Multi-level of interfaces

Dataset 2 Dataset 5 Dataset 3 Dataset 4 Dataset 7 Dataset 6 Dataset 1

slide-9
SLIDE 9

Conclusions

¡ We combine data and code version control ¡ We propose data work flow ¡ We improve the efficiency of a data science procedure