FeatureHub: towards collaborative data science
Micah J. Smith, Roy Wedge, Kalyan Veeramachaneni MIT IEEE DSAA 2017 Tokyo, Japan
FeatureHub: towards collaborative data science Micah J. Smith, Roy - - PowerPoint PPT Presentation
FeatureHub: towards collaborative data science Micah J. Smith, Roy Wedge, Kalyan Veeramachaneni MIT IEEE DSAA 2017 Tokyo, Japan A tale of two systems FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 2 Massive
Micah J. Smith, Roy Wedge, Kalyan Veeramachaneni MIT IEEE DSAA 2017 Tokyo, Japan
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 2
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 3
ease of use share results no collaboration not scalable integrated solution ecosystem of collaboration wrong abstractions difficult to use ease of use bookkeeping not open expensive many competitors many solutions no additional structure
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 4
Current collaborative approaches Massive
science
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 5
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 6
id Closest traffic light (meters) Beacon St @ Prentiss 470 Vassar St @ Main 25 Newbury St @ Mass Ave … Memorial Drive @ Ames 130
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 7
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 8
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 9
▫ Needs human intuition and domain expertise ▫ Automation difficult in many circumstances ▫ Collaboration can help uncover key ideas
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 10
11
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 12
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 13
1 def hi_lo_age(dataset): 2 """Whether users are older than 30 years""" 3 from sklearn.preprocessing import binarize 4 threshold = 30 5 return binarize(dataset["users"]["age"], threshold)
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 14
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 15
extract values on train and test sets
AutoML
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 16
▫ Quality ▫ Security
▫ Metrics to reward good work ▫ Adversarial behavior
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 17
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 18
Hired 41 crowd data scientist workers from Upwork
▫ airbnb: Predict the destination country of Airbnb users (Source: Kaggle) ▫ sberbank: Predict selling price for houses and apartments (Source: Kaggle)
Data collected
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 19
Combined model competes with expert data scientists
airbnb sberbank
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 20
Competition launches Competitor submits solution 1 Competitor downloads materials Competitor submits solution N Competition ends
t=0 10 weeks 5 days 2 weeks 4 weeks
What can we accomplish with FeatureHub?
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 21
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni)
Competitor downloads materials Competition ends
LAUNCH CREATE COMBINE
Competitor submits solution N Competitor submits solution 1
+3 hours +2.5 hours +6 hours
5 days 2 weeks
12 hours
Competition launches
21
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 22
1. Focus creative effort of data scientists working in parallel on feature engineering 2. Integrate source code contributions into a single model 3. Automate everything else and produce output quickly
with automated modeling
using FeatureHub to generate competitive predictive models using limited resources
FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 23
Micah J. Smith, Roy Wedge, Kalyan Veeramachaneni MIT Source code: https://github.com/HDI-Project/FeatureHub Correspondence: Micah Smith (micahs@mit.edu, @micahjsmith)