SLIDE 1
Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy - - PowerPoint PPT Presentation
Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy - - PowerPoint PPT Presentation
Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy IBM IBM, , Da Data ta an and d AI Politechnika Poznaska The usual workflow Data collection Data pipeline Model selection Model selection Raw data algorithm
SLIDE 2
SLIDE 3
The hard life of data scientists
→ Deali ling ng wit ith mis issin ing valu lue: e: → Discarding? Row? Column? → Imputation? What imputation? Mean? Median? Model-based? What model? → Im Imbalan lanced ed datasets: asets: → Downsampling? Oversampling? → Nothing? What bias it implies? → Data a too
- o la
large: e: → Dimensional reductions: what algorithm? PCA? normalization or not? → Subsampling: what technique? what bias? → Outli liers ers detecti tection
- n and curat
atio ion: n: → What threshold? What deviation measure? → Trimming? Truncating? Censoring? Winsorizing? → Encodi coding ng for
- r metho
hod d dom
- main
in requi uirements: rements: → Discretization? Grid? What step? Cluster? What method? What hyperparameter? → Categorial encoder? Binary? Hot-One? Helmert? Backward Difference? → NLP: → How many tokens? → Size of m-grams?
SLIDE 4
The usual workflow
Operation 1 Operation 2 Operation 3
Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer algorithm model 1 Model selection model 2 model 3 metric best model metaoptimizer p Potentially tool-assisted
SLIDE 5
The workflow proposed in the paper
O11 Best pipeline
Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer metaoptimizer
O21 O22
p p
O12 O22 O23
metric?
SLIDE 6
The workflow proposed in the paper
O11 Best pipeline
Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer metaoptimizer
O21 O22
p p
O12 O22 O23
metric
Feedback
SLIDE 7
Pipeline prototype
Reb ebal alance: ance: 4 operators Nor Normalize: malize: 5 operators Fea eatu tures: res: 4 operators Co Configuratio figuration n space: ace: 4750 configurations Bas aseline: eline: (Id, Id, Id)
SLIDE 8
Protocol
- Datasets: Breast, Iris, Wine.
- Methods: SVM, Random Forest, Neural Network, Decision Tree.
- Dataset split: 60% for training set, 40% for test set.
- Pipeline configuration space size: 4750 configurations.
- Performance metric: Cross-validation accuracy
- Metaoptimizer: Tree Parzen Estimator (hyperopt)
- Budget: 100 configurations (~2% of the space)
No algorithm hyperparameter tuning! We want to quantify the influence of data pipeline Exhaustive search to compare between baseline and max score.
SLIDE 9
Results
SLIDE 10
In average, with 20 ite tera ratio tions ns (0.42 .42% of the search space): 1. decrease of error by 58% % compared to the baseline 2. 98.92% .92% in the normalized score space)
SLIDE 11
How close are we from the optimal pipeline?
SLIDE 12
A solution for Euclidian space
For
- r each
ch op
- ptim
imal al con
- nfig
igurat uratio ion n r:
- 1. Bu
Buil ild the e sa sample le w.r. r.t.
- t. to th
- the
e alg lgor
- rit
ithms: hms: → For each algorithm, select the optimal point that is the closest from the reference point. 2.Express Express the sa sample le in in n nor
- rmali
alized zed con
- nf.
. sp space ce 3.Cal Calcul culate ate the NMAD on
- n t
the e sample le N: number of algorithms K: dimension of the configuration space r: a reference point p* p*: sample of optimal configurations
SLIDE 13
Results on two datasets for text classification
SLIDE 14
Future work
Wo Work k in progress:
- gress:
→ Tests on larger configuration spaces. → Online architecture. → Metric between pipelines
SLIDE 15