Dat ataH aHub ub: Collaborative Data Science and Dataset Version Management at Scale
Aditya Parameswaran U Illinois
1
Dat ataH aHub ub : Collaborative Data Science and Dataset Version - - PowerPoint PPT Presentation
Dat ataH aHub ub : Collaborative Data Science and Dataset Version Management at Scale Aditya Parameswaran U Illinois 1 Deep, Dark Secrets of Data Science a#on' oint'is'increasingly'managing'the'pro atasets'are'being'used'and'where'did
1
2
atasets'are'being'used'and'where'did di#ng'what'or'who'generated'which' pes'of'analyses'have'been'conducted id'this'“plot.png”'file'come'from' 'do'when'I'discover'an'error'in'a'datas 'today’s'results'compare'to'yesterday atasets'should'I'use'to'further'my'anal c'data'management'systems'(e.g.,'Dr f'the'data'is'unstructured'so'typically' cess'of'data'science'itself'is'quite'ad'h ts/researchers/analysts'are'preTy'mu
3
4
5
6
We use about 100TB of data across 20-30 researchers We spend a LO LOT of money on this. Everything is organized around shared folders, and everyone has access. Our ur dat atase aset manag anagement nt sc sche heme is s so so si simpl ple, it’s ’s gre reat at!
7
They typically make a private copy.
So how do users work on datasets? But wouldn’t that mean lots of redundant versions and duplication?
1: 1: Massi assive re redund undanc ancy y in n st store red dat atase asets
8
Sure, but we have no way of knowing
Do you have datasets being analyzed by multiple users simultaneously? But wouldn’t that mean you cannot combine work across users
II: True ue collab aborat ation is s ne near ar impo possi ssible!
9
All the time!
Do you get rid of redundant datasets, given that you have space issues? What if the user had left, and if the dataset is crucial for reproducibility? We cross our fingers! III: Unk nkno nown n depe pend ndenc ncies s between n dat atase asets
10
Not really. They talk to me.
Is there any way users can search for specific dataset versions of interest? What if you leave? Let’s pray for the group’s sake that that doesn’t happen! IV: No No organi anizat ation
r manag anagement nt of f dat atase aset versi sions ns.
11
12
13
14
15
16
17 Ingest (Import) Version Management Sharing, Collaboration Raw Files Fork, Branch, Merge Database System Query Language Integrate / Visualize / Other Apps
18
Data: Versioned Datasets Metadata: Version Graphs Indexes, Provenance Dataset Versioning Manager Versioning API Versioning QL INGEST INTEGRATE OTHER Client Applications Client Applications
DataHub: A Collaborative Dataset Management Platform
Support for Data Science
19
Key ey Valu lue Sam (Berkeley, 2003, Hellerstein) Amol (Berkeley, 2004, Hellerstein) Aaron (UCSB, 2014, El Abbadi and Agrawal) Key ey Sch chool
Yea ear Ad Advisor isor Sam Berkeley 2003 Hellerstein Amol Berkeley 2004 Hellerstein Aaron UCSB 2014 El Abbadi and Agrawal
20
Version 0 Sam, $50, 1 Amol, $100, 1 Master + Mike, $150, 1 Version 1 + Aditya, $80, 1 Version 1.1 + Amol, $100, 0 T1 T2 T3 T4 visible bit Deletes Amol
The Good:
The Bad:
Walk up entire chains
that contain a tuple
21
22
SELECT * FROM R[V1], R[V4] WHERE R[V1].ID = R[V4].ID SELECT VNUM FROM VERSIONS(R) WHERE EXISTS (SELECT * FROM R[VNUM] WHERE NAME=‘AARON’)
23
24
25
26
27
Integrated with versioned storage
28