Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy - - PowerPoint PPT Presentation

▶

Jun 25, 2023 370 likes •532 views

Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy IBM IBM, , Da Data ta an and d AI Politechnika Poznaska The usual workflow Data collection Data pipeline Model selection Model selection Raw data algorithm

SLIDE 1

Data Pipeline Selection and Optimization DOLAP 2019

Alexandre Quemy IBM IBM, , Da Data ta an and d AI Politechnika Poznańska

SLIDE 2

The usual workflow

Operation 1 Operation 2 Operation 3

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer algorithm model 1 Model selection model 2 model 3 metric best model metaoptimizer p Potentially tool-assisted

SLIDE 3

The hard life of data scientists

→ Deali ling ng wit ith mis issin ing valu lue: e: → Discarding? Row? Column? → Imputation? What imputation? Mean? Median? Model-based? What model? → Im Imbalan lanced ed datasets: asets: → Downsampling? Oversampling? → Nothing? What bias it implies? → Data a too

o la

large: e: → Dimensional reductions: what algorithm? PCA? normalization or not? → Subsampling: what technique? what bias? → Outli liers ers detecti tection

n and curat

atio ion: n: → What threshold? What deviation measure? → Trimming? Truncating? Censoring? Winsorizing? → Encodi coding ng for

r metho

hod d dom

main

in requi uirements: rements: → Discretization? Grid? What step? Cluster? What method? What hyperparameter? → Categorial encoder? Binary? Hot-One? Helmert? Backward Difference? → NLP: → How many tokens? → Size of m-grams?

SLIDE 4

The usual workflow

Operation 1 Operation 2 Operation 3

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer algorithm model 1 Model selection model 2 model 3 metric best model metaoptimizer p Potentially tool-assisted

SLIDE 5

The workflow proposed in the paper

O11 Best pipeline

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer metaoptimizer

O21 O22

p p

O12 O22 O23

metric?

SLIDE 6

The workflow proposed in the paper

O11 Best pipeline

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer metaoptimizer

O21 O22

p p

O12 O22 O23

metric

Feedback

SLIDE 7

Pipeline prototype

Reb ebal alance: ance: 4 operators Nor Normalize: malize: 5 operators Fea eatu tures: res: 4 operators Co Configuratio figuration n space: ace: 4750 configurations Bas aseline: eline: (Id, Id, Id)

SLIDE 8

Protocol

Datasets: Breast, Iris, Wine.
Methods: SVM, Random Forest, Neural Network, Decision Tree.
Dataset split: 60% for training set, 40% for test set.
Pipeline configuration space size: 4750 configurations.
Performance metric: Cross-validation accuracy
Metaoptimizer: Tree Parzen Estimator (hyperopt)
Budget: 100 configurations (~2% of the space)

No algorithm hyperparameter tuning!  We want to quantify the influence of data pipeline Exhaustive search to compare between baseline and max score.

SLIDE 9

Results

SLIDE 10

In average, with 20 ite tera ratio tions ns (0.42 .42% of the search space): 1. decrease of error by 58% % compared to the baseline 2. 98.92% .92% in the normalized score space)

SLIDE 11

How close are we from the optimal pipeline?

SLIDE 12

A solution for Euclidian space

For

r each

ch op

ptim

imal al con

nfig

igurat uratio ion n r:

1. Bu

Buil ild the e sa sample le w.r. r.t.

t. to th
the

e alg lgor

ithms: hms: → For each algorithm, select the optimal point that is the closest from the reference point. 2.Express Express the sa sample le in in n nor

rmali

alized zed con

. sp space ce 3.Cal Calcul culate ate the NMAD on

the e sample le N: number of algorithms K: dimension of the configuration space r: a reference point p* p*: sample of optimal configurations

SLIDE 13

Results on two datasets for text classification

SLIDE 14

Future work

Wo Work k in progress:

gress:

→ Tests on larger configuration spaces. → Online architecture. → Metric between pipelines

SLIDE 15

Data Pipeline Selection and Optimization DOLAP 2019

Alexandre Quemy IBM IBM, , Da Data ta an and d AI Politechnika Poznańska

The usual workflow

Operation 1 Operation 2 Operation 3

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer algorithm model 1 Model selection model 2 model 3 metric best model metaoptimizer p Potentially tool-assisted

The hard life of data scientists

→ Deali ling ng wit ith mis issin ing valu lue: e: → Discarding? Row? Column? → Imputation? What imputation? Mean? Median? Model-based? What model? → Im Imbalan lanced ed datasets: asets: → Downsampling? Oversampling? → Nothing? What bias it implies? → Data a too

large: e: → Dimensional reductions: what algorithm? PCA? normalization or not? → Subsampling: what technique? what bias? → Outli liers ers detecti tection

atio ion: n: → What threshold? What deviation measure? → Trimming? Truncating? Censoring? Winsorizing? → Encodi coding ng for

hod d dom

in requi uirements: rements: → Discretization? Grid? What step? Cluster? What method? What hyperparameter? → Categorial encoder? Binary? Hot-One? Helmert? Backward Difference? → NLP: → How many tokens? → Size of m-grams?

The usual workflow

Operation 1 Operation 2 Operation 3

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer algorithm model 1 Model selection model 2 model 3 metric best model metaoptimizer p Potentially tool-assisted

The workflow proposed in the paper

O11 Best pipeline

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer metaoptimizer

O21 O22

p p

O12 O22 O23

metric?

The workflow proposed in the paper

O11 Best pipeline

Raw data algorithm model 1 Data pipeline Model selection Data collection model 2 model 3 metric best model metaoptimizer metaoptimizer

O21 O22

p p

O12 O22 O23

metric

Feedback

Pipeline prototype

Reb ebal alance: ance: 4 operators Nor Normalize: malize: 5 operators Fea eatu tures: res: 4 operators Co Configuratio figuration n space: ace: 4750 configurations Bas aseline: eline: (Id, Id, Id)

Protocol

No algorithm hyperparameter tuning!  We want to quantify the influence of data pipeline Exhaustive search to compare between baseline and max score.

Results

In average, with 20 ite tera ratio tions ns (0.42 .42% of the search space): 1. decrease of error by 58% % compared to the baseline 2. 98.92% .92% in the normalized score space)

How close are we from the optimal pipeline?

A solution for Euclidian space

For

ch op

imal al con

igurat uratio ion n r:

Buil ild the e sa sample le w.r. r.t.

e alg lgor

ithms: hms: → For each algorithm, select the optimal point that is the closest from the reference point. 2.Express Express the sa sample le in in n nor

alized zed con

. sp space ce 3.Cal Calcul culate ate the NMAD on

the e sample le N: number of algorithms K: dimension of the configuration space r: a reference point p* p*: sample of optimal configurations

Results on two datasets for text classification

Future work

Wo Work k in progress:

→ Tests on larger configuration spaces. → Online architecture. → Metric between pipelines

Thank you

Don’t forget the poster session!