Machine learning and black-box expensive optimization S ebastien - - PowerPoint PPT Presentation

▶

Apr 08, 2023 102 likes •306 views

Introduction Learning for optimization Optimization for learning Machine learning and black-box expensive optimization S ebastien Verel Laboratoire dInformatique, Signal et Image de la C ote dopale (LISIC) Universit e du

SLIDE 1

Introduction Learning for optimization Optimization for learning

Machine learning and black-box expensive optimization

S´ ebastien Verel

Laboratoire d’Informatique, Signal et Image de la Cˆ

te d’opale (LISIC)

Universit´ e du Littoral Cˆ

te d’Opale, Calais, France

http://www-lisic.univ-littoral.fr/~verel/

June, 18th, 2018

SLIDE 2

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

SLIDE 3

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data)

SLIDE 4

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms

SLIDE 5

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x)

SLIDE 6

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x) According to the class of algorithms, search spaces, functions, etc., huge number of learning algorithms

SLIDE 7

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x) According to the class of algorithms, search spaces, functions, etc., huge number of learning algorithms Artificial : from paper to computer !

SLIDE 8

Introduction Learning for optimization Optimization for learning

Black-box (expensive) optimization

x − → − → f (x)

No information on the objective definition function f

Objective fonction : can be irregular, non continuous, non differentiable, etc. given by a computation or an (expensive) simulation

Few examples from the team :

Mobility simulation (Florian Leprˆ

etre),

Plant’s biology, plant growth (Amaury Dubois),
Logistic simulation (Brahim Aboutaib),
Cellular automata,
Nuclear power plant (Valentin Drouet),

SLIDE 9

Introduction Learning for optimization Optimization for learning

Real-world black-box expensive optimization

PhD of Mathieu Muniglia, 2014-2017, Valentin Drouet, 2017-2020, CEA, Paris

x − → − → f (x) (73, . . . , 8) − → − → ∆zP Multi-physic simulator Expensive optimization : parallel computing, and surrogate model.

SLIDE 10

Introduction Learning for optimization Optimization for learning

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

Portfolio of algorithms : Control of algorithm during

ptimization

SLIDE 11

Introduction Learning for optimization Optimization for learning

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

3 4 4 8 2 5

Portfolio of algorithms : Control of algorithm during

ptimization

How to select an algorithm ? Design reinforcement learning methods for distributed computing (ǫ-greedy, adapt. pursuit, UCB, ...) How to compute a reward ? Aggregation function of local rewards (mean, max, etc.) for a global selection

SLIDE 12

Introduction Learning for optimization Optimization for learning

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

3 4 4 8 2 5

Methodology Use designed benchmark functions with designed properties and experimental analysis Portfolio of algorithms : Control of algorithm during

ptimization

How to select an algorithm ? Design reinforcement learning methods for distributed computing (ǫ-greedy, adapt. pursuit, UCB, ...) How to compute a reward ? Aggregation function of local rewards (mean, max, etc.) for a global selection

SLIDE 13

Introduction Learning for optimization Optimization for learning

Features to learn : Mult.-Obj. fitness landscape

K. Tanaka, H. Aguirre (Univ. Shinshu), A. Liefooghe, B. Derbel (univ. Lille), 2010 - 2018

Fitness landscape : (X, f , N), Search space, obj. func., neighborhood relation

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Objective 2 Objective 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Objective 2 Objective 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Objective 2 Objective 1

conflicting objectives independent objectives correlated objectives

f_cor_rws rho #lsupp_avg_rws #lsupp_avg_aws #lnd_avg_rws #lnd_avg_aws length_aws #sup_avg_aws #sup_avg_rws #inc_avg_rws #inf_avg_rws #inc_avg_aws #inf_avg_aws hv_r1_rws #sup_r1_rws #inf_r1_rws #lnd_r1_rws nhv_r1_rws hvd_r1_rws #inc_r1_rws k_n #lsupp_r1_rws n hvd_avg_rws hvd_avg_aws hv_avg_aws hv_avg_rws m nhv_avg_rws nhv_avg_aws nhv_avg_aws nhv_avg_rws m hv_avg_rws hv_avg_aws hvd_avg_aws hvd_avg_rws n #lsupp_r1_rws k_n #inc_r1_rws hvd_r1_rws nhv_r1_rws #lnd_r1_rws #inf_r1_rws #sup_r1_rws hv_r1_rws #inf_avg_aws #inc_avg_aws #inf_avg_rws #inc_avg_rws #sup_avg_rws #sup_avg_aws length_aws #lnd_avg_aws #lnd_avg_rws #lsupp_avg_aws #lsupp_avg_rws rho f_cor_rws

−1 1 Value 40 Kendall's tau

Count

Perf. prediction (cross-val.)

feature set MAE MSE R2 rank GSEMO all 0.007781 0.000118 0.951609 1 enumeration 0.008411 0.000142 0.943046 2 sampling all 0.009113 0.000161 0.932975 3 sampling rws 0.009284 0.000167 0.930728 4 sampling aws 0.010241 0.000195 0.917563 5 {r, m, n, k/n} 0.010609 0.000215 0.911350 6 {r, m, n} 0.026974 0.001123 0.518505 7 {m, n} 0.032150 0.001545 0.340715 8

SLIDE 14

Introduction Learning for optimization Optimization for learning

Learning/tuning parameters according to features

SLIDE 15

Introduction Learning for optimization Optimization for learning

Learning/tuning parameters according to features

SLIDE 16

Introduction Learning for optimization Optimization for learning

Surrogate model for pseudo-boolean functions

Goal : Replace/learn the (expensive) objective function with a (sheep) meta-model during the optimization process Continuous optimization : NN, Gaussian Process (krigging), EGO : sample the next solution with max. expected improvement

GP : Random variables which have joint Gaussian distribution. mean : m(y(x)) = µ covariance : cov(y(x), y(x′)) = exp(−θd(x, x′)p)

from : Rasmussen, Williams, GP for ML, MIT Press, 2006.

SLIDE 17

Introduction Learning for optimization Optimization for learning

Surrogate model for pseudo-boolean functions

Goal : Replace/learn the (expensive) objective function with a (sheep) meta-model during the optimization process Continuous optimization : NN, Gaussian Process (krigging), EGO : sample the next solution with max. expected improvement Proposition

Walsh function basis : ∀x ∈ {0, 1}n, ϕk(x) = (−1)

n−1

j=0 kjxj

f (x) =

2n−1

wk.ϕk(x) with wk = 1 2n

x∈{0,1}n

f (x).ϕk(x) Surrogate model : ˆ f (x) =

k : o(ϕk)d
wk.ϕk(x)

Estimate the coefficients with LARS

0.00

0.01 0.02 0.03 0.04 100 200 300 400 500

Sample size Mean Abs. Error of fitness

method

kriging

walsh

SLIDE 18

Introduction Learning for optimization Optimization for learning

Energy surface of deep learning problem

To learn Deep NN : High dimension space Minimize error with variants of stochastic gradient descent Why does it works ?

improves expressiveness but complicates optimization

What is the shape of energy surface ?

A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. ”The loss surfaces of multilayer networks.”

In Artificial Intelligence and Statistics, pp. 192-204. (2015).

P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina.

”Entropy-sgd : Biasing gradient descent into wide valleys.” ICLR (2017).

S. Arora, N. Cohen, and E. Hazan. ”On the optimization of deep networks : Implicit acceleration by
verparameterization.” arXiv preprint arXiv :1802.06509 (2018).

Perspective Study the geometry of fitness landscape...

SLIDE 19

Introduction Learning for optimization Optimization for learning

Energy surface of deep learning problem

To learn Deep NN : High dimension space Minimize error with variants of stochastic gradient descent Why does it works ?

improves expressiveness but complicates optimization

What is the shape of energy surface ?

A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. ”The loss surfaces of multilayer networks.”

In Artificial Intelligence and Statistics, pp. 192-204. (2015).

P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina.

”Entropy-sgd : Biasing gradient descent into wide valleys.” ICLR (2017).

S. Arora, N. Cohen, and E. Hazan. ”On the optimization of deep networks : Implicit acceleration by
verparameterization.” arXiv preprint arXiv :1802.06509 (2018).

Machine learning and black-box expensive optimization

S´ ebastien Verel

June, 18th, 2018

AI : Machine Learning, Optimization, perception, etc.

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data)

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x)

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x) According to the class of algorithms, search spaces, functions, etc., huge number of learning algorithms

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x) According to the class of algorithms, search spaces, functions, etc., huge number of learning algorithms Artificial : from paper to computer !

Black-box (expensive) optimization

x − → − → f (x)

No information on the objective definition function f

Objective fonction : can be irregular, non continuous, non differentiable, etc. given by a computation or an (expensive) simulation

Few examples from the team :

etre),

Real-world black-box expensive optimization

PhD of Mathieu Muniglia, 2014-2017, Valentin Drouet, 2017-2020, CEA, Paris

x − → − → f (x) (73, . . . , 8) − → − → ∆zP Multi-physic simulator Expensive optimization : parallel computing, and surrogate model.

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

Portfolio of algorithms : Control of algorithm during

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

Portfolio of algorithms : Control of algorithm during

How to select an algorithm ? Design reinforcement learning methods for distributed computing (ǫ-greedy, adapt. pursuit, UCB, ...) How to compute a reward ? Aggregation function of local rewards (mean, max, etc.) for a global selection

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

Methodology Use designed benchmark functions with designed properties and experimental analysis Portfolio of algorithms : Control of algorithm during

How to select an algorithm ? Design reinforcement learning methods for distributed computing (ǫ-greedy, adapt. pursuit, UCB, ...) How to compute a reward ? Aggregation function of local rewards (mean, max, etc.) for a global selection

Features to learn : Mult.-Obj. fitness landscape

Fitness landscape : (X, f , N), Search space, obj. func., neighborhood relation

Learning/tuning parameters according to features

Learning/tuning parameters according to features

Surrogate model for pseudo-boolean functions

Goal : Replace/learn the (expensive) objective function with a (sheep) meta-model during the optimization process Continuous optimization : NN, Gaussian Process (krigging), EGO : sample the next solution with max. expected improvement

GP : Random variables which have joint Gaussian distribution. mean : m(y(x)) = µ covariance : cov(y(x), y(x′)) = exp(−θd(x, x′)p)

Surrogate model for pseudo-boolean functions

Goal : Replace/learn the (expensive) objective function with a (sheep) meta-model during the optimization process Continuous optimization : NN, Gaussian Process (krigging), EGO : sample the next solution with max. expected improvement Proposition

Walsh function basis : ∀x ∈ {0, 1}n, ϕk(x) = (−1)

f (x) =

wk.ϕk(x) with wk = 1 2n

f (x).ϕk(x) Surrogate model : ˆ f (x) =

Estimate the coefficients with LARS

Energy surface of deep learning problem

To learn Deep NN : High dimension space Minimize error with variants of stochastic gradient descent Why does it works ?

improves expressiveness but complicates optimization

What is the shape of energy surface ?

Perspective Study the geometry of fitness landscape...

Energy surface of deep learning problem

To learn Deep NN : High dimension space Minimize error with variants of stochastic gradient descent Why does it works ?

improves expressiveness but complicates optimization

What is the shape of energy surface ?

Perspective Study the geometry of fitness landscape... Any idea, and collaboration are welcome !