datalab aversion data management and analytics system
play

Datalab:AVersion Data Management And Analytics System Yang - PowerPoint PPT Presentation

Datalab:AVersion Data Management And Analytics System Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi Wu, Bin Yu,Wei Xu Overview Problem: how do we manage code and data with versions? Code version control, e.g. GitHub Data version control, e.g.


  1. Datalab:AVersion Data Management And Analytics System Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi Wu, Bin Yu,Wei Xu

  2. Overview ¡ Problem: how do we manage code and data with versions? ¡ Code version control, e.g. GitHub ¡ Data version control, e.g. DataHub [1] ¡ But how to combine them in a coherent system? Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, And Aditya G Parameswaran. Datahub: Collaborative Data Science & Dataset Version Management At Scale. Arxiv Preprint Arxiv:1409.0798, 2014.

  3. Our Solution ¡ Version control combining codes and datasets. ¡ Datasets are generated by execution of codes. ¡ T wo data versions are connected by a code version. Experiment commit_id = c41f29 Dataset version0001 Dataset version0002

  4. Data Work Flow (DWF) Dataset 1 Dataset 6 Dataset 4 ¡ Pairs of data versions make up a data Dataset 2 work flow (DWF) Dataset 7 ¡ Reconstruct a dataset by re-executing Dataset 5 the version of code that generates it Dataset 3

  5. System Architecture

  6. Case Study --A Biological Data Application ¡ Goal: find the best K principle patterns ¡ Procedure: ¡ Data preprocessing ¡ Feature extraction ¡ Non-negative matrix factorization ¡ Evaluate K by a stability function ¡ Repeat until find the best parameter

  7. Core APIs

  8. Future Work Dataset 1 ¡ Dataset caching Dataset 6 ¡ Online development environment Dataset 4 ¡ Multi-level of interfaces Dataset 2 Dataset 7 Dataset 5 Dataset 3

  9. Conclusions ¡ We combine data and code version control ¡ We propose data work flow ¡ We improve the efficiency of a data science procedure

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend