Dat ataH aHub ub : Collaborative Data Science and Dataset Version - PowerPoint PPT Presentation

Dat ataH aHub ub : Collaborative Data Science and Dataset Version Management at Scale Aditya Parameswaran U Illinois 1

Deep, Dark Secrets of Data Science a#on' oint”'is'increasingly'managing'the'pro atasets'are'being'used'and'where'did di#ng'what'or'who'generated'which' pes'of'analyses'have'been'conducted id'this'“plot.png”'file'come'from' 'do'when'I'discover'an'error'in'a'datas 'today’s'results'compare'to'yesterday atasets'should'I'use'to'further'my'anal c'data'management'systems'(e.g.,'Dr f'the'data'is'unstructured'so'typically' cess'of'data'science'itself'is'quite'ad'h ts/researchers/analysts'are'preTy'mu Courtesy: XKCD 2

How bad could dataset management get? 3

The Investigator Team Aaron Aditya Amol Sam Elmore Parameswaran Deshpande Madden Maryland MIT Chicago Illinois Amit Anant Chavan Bhardwaj Shouvik Bhattacherjee 4

A True (Horror) Story of Dataset Management Before 5

What did we learn? We use about 100TB of data across 20-30 researchers We spend a LO LOT of money on this. Everything is organized around shared folders, and everyone has access. Research Our ur dat atase aset manag anagement nt sc sche heme Scientist is s so so si simpl ple, it’s ’s gre reat at! 6

What did we learn? So how do users work on datasets? They typically make a private copy. Us Us But wouldn’t that mean lots of redundant versions and duplication? Yes. That’s why our storage is 100TB. 1: Massi 1: assive re redund undanc ancy y in n st store red dat atase asets 7

What did we learn? Do you have datasets being analyzed by multiple users simultaneously? Sure, but we have no way of knowing or resolving modifications Us Us But wouldn’t that mean you cannot combine work across users True. The users will need to discuss. II: True ue collab aborat ation is s ne near ar impo possi ssible! 8

What did we learn? Do you get rid of redundant datasets, given that you have space issues? All the time! Us Us What if the user had left, and if the dataset is crucial for reproducibility? We cross our fingers! III: Unk nkno nown n depe pend ndenc ncies s between n dat atase asets 9

What did we learn? Is there any way users can search for specific dataset versions of interest? Not really. They talk to me. Us Us What if you leave? Let’s pray for the group’s sake that that doesn’t happen! IV: No No organi anizat ation or r manag anagement nt of f dat atase aset versi sions ns. 10

What did we learn? The four 1. Massive redundancy in stored datasets 2. Truly collaborative data science is impossible 3. Unknown dependencies between dataset versions 4. No efficient organization or management of datasets 11

Happens all the time… Ever ery y colla collabor orativ ive e data scien science ce project roject en ends s up in in dataset set ver ersion sion ma managemen ement hell ell Surely, there must be a better way? 1. Massive redundancy in stored datasets 2. Truly collaborative data science is impossible 3. Unknown dependencies between dataset versions 4. No efficient organization or management of datasets 12

Have we seen this before? Analogous to management of source code before source code version control! How about: DataHub: a “GitHub for data” Solving the “AYS” problems 1. Massive redundancy in stored datasets Compact storage 2. Truly collaborative data science is impossible “Branching” allowed 3. Unknown dependencies between versions Explicit and implicit 4. No efficient organization or management Rich retrieval methods 13

What about alternatives? Many issues with directly using GitHub or SC-VC: • Cannot handle large datasets or large # of versions • Querying and retrieval functionality is primitive • Datasets have regular repeating structure Many issues with temporal databases: similar issues, plus one major one: • Only supports a linear chain of versions 14

The Vision for DataHub The for collaborative data science and dataset version management satisfying all your dataset book-keeping needs. 15

The Vision for DataHub Basics: • Efficient maintenance and management of dataset versions DataHub will also have: • A rich query language encompassing data and versions • In-built essential data science functionality such as ingestion, and integration, plus API hooks to external apps (MATLAB, R, …) 16

Raw Files Ingest (Import) Database System Fork, Branch, Merge Version Management Sharing, Collaboration Query Language Integrate / Visualize / Other Apps 17

DataHub Architecture Client Applications INGEST INTEGRATE OTHER Client Applications Support for Data Science Versioning API Versioning QL Dataset Versioning Manager Metadata : Data: Version Graphs Versioned Indexes, Datasets Provenance DataHub: A Collaborative Dataset Management Platform 18

Data Model and Basic API Flexible “Schema-later” Data Model Groups of records with different schemas in same table Key ey Valu lue Key ey Sch chool ool Yea ear Ad Advisor isor Sam (Berkeley, 2003, Hellerstein) Sam Berkeley 2003 Hellerstein Amol (Berkeley, 2004, Hellerstein) Amol Berkeley 2004 Hellerstein Aaron UCSB 2014 El Abbadi and Aaron (UCSB, 2014, El Abbadi and Agrawal Agrawal) Metadata Versions Standard git commands: branch, commit, fork, merge, rollback, checkout 19

Storing and Retrieving Versions Simplest Strawman Approach: Store: For every version, store “delta” from previous DAG version Retrieve: Start from version pointer, walk up to root The Good: visible bit T1 T4 • Somewhat Compact Version 0 Master Sam, $50, 1 + Mike, $150, 1 Amol, $100, 1 The Bad: Deletes Amol • Inefficient to construct versions T2 T3 Walk up entire chains Version 1 Version 1.1 + Aditya, $80, 1 + Amol, $100, 0 • Inefficient to look up all versions that contain a tuple Q: Why store delta from the previous version? Q: Why not materialize some versions completely? Q: What kind of indexes should we use? 20

Branching and Merging More ore quest estion ions s than answ swer ers! s! • Q: How do we allow users operate on servers and/or their local machines without missing updates? • Q: What if the datasets are large? Can users work on samples? • Q: How do we detect conflicts and allow users to merge conflicting branches with as little effort as possible? 21

Rich Query Language Can comb combin ine e ver ersion sions s and data! SELECT * FROM R[V1], R[V4] WHERE R[V1].ID = R[V4].ID SELECT VNUM FROM VERSIONS(R) WHERE EXISTS (SELECT * FROM R[VNUM] WHERE NAME=‘AARON’) rogress! ress! in prog ork in ill a wor Still Other examples: Find… • All versions that are vastly different in size from a given version. • The first version where a certain tuple was introduced • All tuples that were introduced in a given version and subsequently deleted 22

Screenshots 23

App: Ingest by Example Example from Data Wrangler Paper 24

App: Automatic Visualization 25

Papers in the works.. • Fundamentals: • Blobs: Exploring the trade-off between storage and recreation/retrieval cost for blob stores • Relational: Exploring SQL-based versioning implementations and indexing • Add-on functionality: • Ingest: Ingest by example • Viz: Automatically generating query visualizations 26

To Summarize • Dataset management as of today is bad, bad, bad • DataHub is “GitHub for data”; an essential prerequisite to collaborative data science • Tracking, managing, reasoning about, and retrieving versions • Fundamental building block for study of other problems • DataHub has in-built data science functionality, plus hooks Lo Lots of related work! • Ingestion: ingest by example • Integration: search, and auto-integrate Integrated with versioned storage • Provenance: explicit and implicit • Visualization: manual and automatic 27

To find out more and contribute… datahub.csail.mit.edu Aditya Parameswaran data-people.cs.illinois.edu 28

Dat ataH aHub ub : Collaborative Data Science and Dataset Version - PowerPoint PPT Presentation

Dat ataH aHub ub : Collaborative Data Science and Dataset Version Management at Scale Aditya Parameswaran U Illinois 1 Deep, Dark Secrets of Data Science a#on' oint'is'increasingly'managing'the'pro atasets'are'being'used'and'where'did

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

Bitly Link & DAT Page Link to Digital Preservation Peer Assessment: http://bit.ly/BPE-DAT

2016 dat 2016 dataset aset Huajie Cheng 2020.11.13 Introduction TES measurement using

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

runs and dat aset s analysis of t he dat aset s remaining quest ions & work runs

DM26 Database Systems (Also: Databaser for HA-Dat ) Rolf Fagerberg Fall 2006 1 Course Credit

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

Dat - Distributed Dataset Synchronization And Versioning Maxwell Ogden, Karissa McKelvey, Mathias

The United Nations Voting Dataset Exploratory Data Analysis: Case Study UN Voting Dataset Roll

COLLABORATIVE COMMUNITY PRESENTATION MAY 30TH, 2018 One San Pedro COLLABORATIVE One San Pedro

An Analysis of Amazon Reviews Joao Carreira Outline Dataset and Methodology

Downtown Osaka Scene Text Dataset Masakazu Iwamura, Takahiro Matsuda Naoyuki Morimoto, Hitomi

http://falconn-lib.org Dataset: n points in R d , r > 0 Dataset: n points in R d , r

K-Nearest Neighbors Nicolas Indelicato K-Nearest Neighbors Dataset Background How the

SAMA-VTOL Aerial Image Dataset (SVAID): A New UAV Image Dataset for Advanced Remote Sensing

Git, GitHub, and Version Control Version Control: How you keep track of coding projects or

FOURTH QUARTER 2019 EARNINGS CALL AND WEBCAST January 31, 2020 Sweeny Fractionator OLD OCEAN,

SEARCHING AND SORTING ALGORITHMS (download slides and .py files and follow along!) 6.0001

Modern Version Control with Git Aaron Perley (aperley@andrew.cmu.edu) Ilan Biala

2020 Basic Election Judge Training Version 1.0 Revised 6/2/2020 INTRODUCTION Hennepin County

SUBVERSION , FUNCTIONS, PARAMETERS, AND FILE HANDLING CSSE 120 Rose-Hulman Institute of

Drawing on the Web Version Control CSCI-UA 380 Project Management with Git Drawing on the Web

#CCI2017 CHANGE THE STORY, CHANGE THE FUTURE David Korten International Best Selling Author,

Dat ataH aHub ub : Collaborative Data Science and Dataset Version - PowerPoint PPT Presentation

Dat ataH aHub ub : Collaborative Data Science and Dataset Version Management at Scale Aditya Parameswaran U Illinois 1 Deep, Dark Secrets of Data Science a#on' oint'is'increasingly'managing'the'pro atasets'are'being'used'and'where'did

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

Bitly Link &amp; DAT Page Link to Digital Preservation Peer Assessment: http://bit.ly/BPE-DAT

2016 dat 2016 dataset aset Huajie Cheng 2020.11.13 Introduction TES measurement using

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

runs and dat aset s analysis of t he dat aset s remaining quest ions &amp; work runs

DM26 Database Systems (Also: Databaser for HA-Dat ) Rolf Fagerberg Fall 2006 1 Course Credit

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

Dat - Distributed Dataset Synchronization And Versioning Maxwell Ogden, Karissa McKelvey, Mathias

The United Nations Voting Dataset Exploratory Data Analysis: Case Study UN Voting Dataset Roll

COLLABORATIVE COMMUNITY PRESENTATION MAY 30TH, 2018 One San Pedro COLLABORATIVE One San Pedro

An Analysis of Amazon Reviews Joao Carreira Outline Dataset and Methodology

Downtown Osaka Scene Text Dataset Masakazu Iwamura, Takahiro Matsuda Naoyuki Morimoto, Hitomi

http://falconn-lib.org Dataset: n points in R d , r &gt; 0 Dataset: n points in R d , r

K-Nearest Neighbors Nicolas Indelicato K-Nearest Neighbors Dataset Background How the

SAMA-VTOL Aerial Image Dataset (SVAID): A New UAV Image Dataset for Advanced Remote Sensing

Git, GitHub, and Version Control Version Control: How you keep track of coding projects or

FOURTH QUARTER 2019 EARNINGS CALL AND WEBCAST January 31, 2020 Sweeny Fractionator OLD OCEAN,

SEARCHING AND SORTING ALGORITHMS (download slides and .py files and follow along!) 6.0001

Modern Version Control with Git Aaron Perley (aperley@andrew.cmu.edu) Ilan Biala

2020 Basic Election Judge Training Version 1.0 Revised 6/2/2020 INTRODUCTION Hennepin County

SUBVERSION , FUNCTIONS, PARAMETERS, AND FILE HANDLING CSSE 120 Rose-Hulman Institute of

Drawing on the Web Version Control CSCI-UA 380 Project Management with Git Drawing on the Web

#CCI2017 CHANGE THE STORY, CHANGE THE FUTURE David Korten International Best Selling Author,

Bitly Link & DAT Page Link to Digital Preservation Peer Assessment: http://bit.ly/BPE-DAT

runs and dat aset s analysis of t he dat aset s remaining quest ions & work runs

http://falconn-lib.org Dataset: n points in R d , r > 0 Dataset: n points in R d , r