Machine learning and black-box expensive optimization S ebastien - - PowerPoint PPT Presentation

machine learning and black box expensive optimization
SMART_READER_LITE
LIVE PREVIEW

Machine learning and black-box expensive optimization S ebastien - - PowerPoint PPT Presentation

Introduction Learning for optimization Optimization for learning Machine learning and black-box expensive optimization S ebastien Verel Laboratoire dInformatique, Signal et Image de la C ote dopale (LISIC) Universit e du


slide-1
SLIDE 1

Introduction Learning for optimization Optimization for learning

Machine learning and black-box expensive optimization

S´ ebastien Verel

Laboratoire d’Informatique, Signal et Image de la Cˆ

  • te d’opale (LISIC)

Universit´ e du Littoral Cˆ

  • te d’Opale, Calais, France

http://www-lisic.univ-littoral.fr/~verel/

June, 18th, 2018

slide-2
SLIDE 2

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

slide-3
SLIDE 3

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data)

slide-4
SLIDE 4

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms

slide-5
SLIDE 5

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x)

slide-6
SLIDE 6

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x) According to the class of algorithms, search spaces, functions, etc., huge number of learning algorithms

slide-7
SLIDE 7

Introduction Learning for optimization Optimization for learning

AI : Machine Learning, Optimization, perception, etc.

Learning :

Minimize an error function

Mθ : model to learn on data Search θ⋆ = arg minθ Error(Mθ, data) According to the model dimension, variables, error function, etc., huge number of optimization algorithms Optimization :

Learn a design algorithm of good solutions

Aθ : search algorithm for problems (X, f ) Learn Aθ such that Aθ(X, f ) = arg minx∈X f (x) According to the class of algorithms, search spaces, functions, etc., huge number of learning algorithms Artificial : from paper to computer !

slide-8
SLIDE 8

Introduction Learning for optimization Optimization for learning

Black-box (expensive) optimization

x − → − → f (x)

No information on the objective definition function f

Objective fonction : can be irregular, non continuous, non differentiable, etc. given by a computation or an (expensive) simulation

Few examples from the team :

  • Mobility simulation (Florian Leprˆ

etre),

  • Plant’s biology, plant growth (Amaury Dubois),
  • Logistic simulation (Brahim Aboutaib),
  • Cellular automata,
  • Nuclear power plant (Valentin Drouet),
slide-9
SLIDE 9

Introduction Learning for optimization Optimization for learning

Real-world black-box expensive optimization

PhD of Mathieu Muniglia, 2014-2017, Valentin Drouet, 2017-2020, CEA, Paris

x − → − → f (x) (73, . . . , 8) − → − → ∆zP Multi-physic simulator Expensive optimization : parallel computing, and surrogate model.

slide-10
SLIDE 10

Introduction Learning for optimization Optimization for learning

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

Portfolio of algorithms : Control of algorithm during

  • ptimization
slide-11
SLIDE 11

Introduction Learning for optimization Optimization for learning

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

3 4 4 8 2 5

Portfolio of algorithms : Control of algorithm during

  • ptimization

How to select an algorithm ? Design reinforcement learning methods for distributed computing (ǫ-greedy, adapt. pursuit, UCB, ...) How to compute a reward ? Aggregation function of local rewards (mean, max, etc.) for a global selection

slide-12
SLIDE 12

Introduction Learning for optimization Optimization for learning

Adaptive distributed optimization algorithms

Christopher Jankee, Bilel Derbel, Cyril Fonlupt

3 4 4 8 2 5

Methodology Use designed benchmark functions with designed properties and experimental analysis Portfolio of algorithms : Control of algorithm during

  • ptimization

How to select an algorithm ? Design reinforcement learning methods for distributed computing (ǫ-greedy, adapt. pursuit, UCB, ...) How to compute a reward ? Aggregation function of local rewards (mean, max, etc.) for a global selection

slide-13
SLIDE 13

Introduction Learning for optimization Optimization for learning

Features to learn : Mult.-Obj. fitness landscape

  • K. Tanaka, H. Aguirre (Univ. Shinshu), A. Liefooghe, B. Derbel (univ. Lille), 2010 - 2018

Fitness landscape : (X, f , N), Search space, obj. func., neighborhood relation

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Objective 2 Objective 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Objective 2 Objective 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Objective 2 Objective 1

conflicting objectives independent objectives correlated objectives

f_cor_rws rho #lsupp_avg_rws #lsupp_avg_aws #lnd_avg_rws #lnd_avg_aws length_aws #sup_avg_aws #sup_avg_rws #inc_avg_rws #inf_avg_rws #inc_avg_aws #inf_avg_aws hv_r1_rws #sup_r1_rws #inf_r1_rws #lnd_r1_rws nhv_r1_rws hvd_r1_rws #inc_r1_rws k_n #lsupp_r1_rws n hvd_avg_rws hvd_avg_aws hv_avg_aws hv_avg_rws m nhv_avg_rws nhv_avg_aws nhv_avg_aws nhv_avg_rws m hv_avg_rws hv_avg_aws hvd_avg_aws hvd_avg_rws n #lsupp_r1_rws k_n #inc_r1_rws hvd_r1_rws nhv_r1_rws #lnd_r1_rws #inf_r1_rws #sup_r1_rws hv_r1_rws #inf_avg_aws #inc_avg_aws #inf_avg_rws #inc_avg_rws #sup_avg_rws #sup_avg_aws length_aws #lnd_avg_aws #lnd_avg_rws #lsupp_avg_aws #lsupp_avg_rws rho f_cor_rws

−1 1 Value 40 Kendall's tau

Count
  • Perf. prediction (cross-val.)

feature set MAE MSE R2 rank GSEMO all 0.007781 0.000118 0.951609 1 enumeration 0.008411 0.000142 0.943046 2 sampling all 0.009113 0.000161 0.932975 3 sampling rws 0.009284 0.000167 0.930728 4 sampling aws 0.010241 0.000195 0.917563 5 {r, m, n, k/n} 0.010609 0.000215 0.911350 6 {r, m, n} 0.026974 0.001123 0.518505 7 {m, n} 0.032150 0.001545 0.340715 8

slide-14
SLIDE 14

Introduction Learning for optimization Optimization for learning

Learning/tuning parameters according to features

slide-15
SLIDE 15

Introduction Learning for optimization Optimization for learning

Learning/tuning parameters according to features

slide-16
SLIDE 16

Introduction Learning for optimization Optimization for learning

Surrogate model for pseudo-boolean functions

Goal : Replace/learn the (expensive) objective function with a (sheep) meta-model during the optimization process Continuous optimization : NN, Gaussian Process (krigging), EGO : sample the next solution with max. expected improvement

GP : Random variables which have joint Gaussian distribution. mean : m(y(x)) = µ covariance : cov(y(x), y(x′)) = exp(−θd(x, x′)p)

from : Rasmussen, Williams, GP for ML, MIT Press, 2006.

slide-17
SLIDE 17

Introduction Learning for optimization Optimization for learning

Surrogate model for pseudo-boolean functions

Goal : Replace/learn the (expensive) objective function with a (sheep) meta-model during the optimization process Continuous optimization : NN, Gaussian Process (krigging), EGO : sample the next solution with max. expected improvement Proposition

Walsh function basis : ∀x ∈ {0, 1}n, ϕk(x) = (−1)

n−1

j=0 kjxj

f (x) =

2n−1

  • k=0

wk.ϕk(x) with wk = 1 2n

  • x∈{0,1}n

f (x).ϕk(x) Surrogate model : ˆ f (x) =

  • k : o(ϕk)d
  • wk.ϕk(x)

Estimate the coefficients with LARS

  • 0.00

0.01 0.02 0.03 0.04 100 200 300 400 500

Sample size Mean Abs. Error of fitness

method

  • kriging

walsh

slide-18
SLIDE 18

Introduction Learning for optimization Optimization for learning

Energy surface of deep learning problem

To learn Deep NN : High dimension space Minimize error with variants of stochastic gradient descent Why does it works ?

improves expressiveness but complicates optimization

What is the shape of energy surface ?

  • A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. ”The loss surfaces of multilayer networks.”

In Artificial Intelligence and Statistics, pp. 192-204. (2015).

  • P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina.

”Entropy-sgd : Biasing gradient descent into wide valleys.” ICLR (2017).

  • S. Arora, N. Cohen, and E. Hazan. ”On the optimization of deep networks : Implicit acceleration by
  • verparameterization.” arXiv preprint arXiv :1802.06509 (2018).

Perspective Study the geometry of fitness landscape...

slide-19
SLIDE 19

Introduction Learning for optimization Optimization for learning

Energy surface of deep learning problem

To learn Deep NN : High dimension space Minimize error with variants of stochastic gradient descent Why does it works ?

improves expressiveness but complicates optimization

What is the shape of energy surface ?

  • A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. ”The loss surfaces of multilayer networks.”

In Artificial Intelligence and Statistics, pp. 192-204. (2015).

  • P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina.

”Entropy-sgd : Biasing gradient descent into wide valleys.” ICLR (2017).

  • S. Arora, N. Cohen, and E. Hazan. ”On the optimization of deep networks : Implicit acceleration by
  • verparameterization.” arXiv preprint arXiv :1802.06509 (2018).

Perspective Study the geometry of fitness landscape... Any idea, and collaboration are welcome !