Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy - - PowerPoint PPT Presentation

data pipeline selection and optimization dolap 2019
SMART_READER_LITE
LIVE PREVIEW

Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy - - PowerPoint PPT Presentation

Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy IBM IBM, , Da Data ta an and d AI Politechnika Poznaska The usual workflow Data collection Data pipeline Model selection Model selection Raw data algorithm


slide-1
SLIDE 1

Data Pipeline Selection and Optimization DOLAP 2019

Alexandre Quemy IBM IBM, , Da Data ta an and d AI Politechnika Poznańska

slide-2
SLIDE 2

The usual workflow

Operation 1 Operation 2 Operation 3

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer algorithm model 1 Model selection model 2 model 3 metric best model metaoptimizer p Potentially tool-assisted

slide-3
SLIDE 3

The hard life of data scientists

→ Deali ling ng wit ith mis issin ing valu lue: e: → Discarding? Row? Column? → Imputation? What imputation? Mean? Median? Model-based? What model? → Im Imbalan lanced ed datasets: asets: → Downsampling? Oversampling? → Nothing? What bias it implies? → Data a too

  • o la

large: e: → Dimensional reductions: what algorithm? PCA? normalization or not? → Subsampling: what technique? what bias? → Outli liers ers detecti tection

  • n and curat

atio ion: n: → What threshold? What deviation measure? → Trimming? Truncating? Censoring? Winsorizing? → Encodi coding ng for

  • r metho

hod d dom

  • main

in requi uirements: rements: → Discretization? Grid? What step? Cluster? What method? What hyperparameter? → Categorial encoder? Binary? Hot-One? Helmert? Backward Difference? → NLP: → How many tokens? → Size of m-grams?

slide-4
SLIDE 4

The usual workflow

Operation 1 Operation 2 Operation 3

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer algorithm model 1 Model selection model 2 model 3 metric best model metaoptimizer p Potentially tool-assisted

slide-5
SLIDE 5

The workflow proposed in the paper

O11 Best pipeline

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer metaoptimizer

O21 O22

p p

O12 O22 O23

metric?

slide-6
SLIDE 6

The workflow proposed in the paper

O11 Best pipeline

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer metaoptimizer

O21 O22

p p

O12 O22 O23

metric

Feedback

slide-7
SLIDE 7

Pipeline prototype

Reb ebal alance: ance: 4 operators Nor Normalize: malize: 5 operators Fea eatu tures: res: 4 operators Co Configuratio figuration n space: ace: 4750 configurations Bas aseline: eline: (Id, Id, Id)

slide-8
SLIDE 8

Protocol

  • Datasets: Breast, Iris, Wine.
  • Methods: SVM, Random Forest, Neural Network, Decision Tree.
  • Dataset split: 60% for training set, 40% for test set.
  • Pipeline configuration space size: 4750 configurations.
  • Performance metric: Cross-validation accuracy
  • Metaoptimizer: Tree Parzen Estimator (hyperopt)
  • Budget: 100 configurations (~2% of the space)

No algorithm hyperparameter tuning!  We want to quantify the influence of data pipeline Exhaustive search to compare between baseline and max score.

slide-9
SLIDE 9

Results

slide-10
SLIDE 10

In average, with 20 ite tera ratio tions ns (0.42 .42% of the search space): 1. decrease of error by 58% % compared to the baseline 2. 98.92% .92% in the normalized score space)

slide-11
SLIDE 11

How close are we from the optimal pipeline?

slide-12
SLIDE 12

A solution for Euclidian space

For

  • r each

ch op

  • ptim

imal al con

  • nfig

igurat uratio ion n r:

  • 1. Bu

Buil ild the e sa sample le w.r. r.t.

  • t. to th
  • the

e alg lgor

  • rit

ithms: hms: → For each algorithm, select the optimal point that is the closest from the reference point. 2.Express Express the sa sample le in in n nor

  • rmali

alized zed con

  • nf.

. sp space ce 3.Cal Calcul culate ate the NMAD on

  • n t

the e sample le N: number of algorithms K: dimension of the configuration space r: a reference point p* p*: sample of optimal configurations

slide-13
SLIDE 13

Results on two datasets for text classification

slide-14
SLIDE 14

Future work

Wo Work k in progress:

  • gress:

→ Tests on larger configuration spaces. → Online architecture. → Metric between pipelines

slide-15
SLIDE 15

Thank you

Don’t forget the poster session!