Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole - - PowerPoint PPT Presentation

derivative free optimization
SMART_READER_LITE
LIVE PREVIEW

Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole - - PowerPoint PPT Presentation

Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole Polytechnique) Laurent Dumas (U. Versailles) AMS Master - Optimization Paris-Saclay Master RandOpt Team Inria and CMAP (Ecole Polytechnique) anne.auger@inria.fr 2017 2018


slide-1
SLIDE 1

Derivative Free Optimization

Anne Auger (Inria and CMAP, Ecole Polytechnique) Laurent Dumas (U. Versailles) AMS Master - Optimization Paris-Saclay Master

RandOpt Team Inria and CMAP (Ecole Polytechnique)

∗anne.auger@inria.fr

2017 — 2018

slide-2
SLIDE 2

Course Organization

Part I - Stochastic/Randomized Methods

taught by Anne Auger

Friday 1/12, 08/12, 15/12, 22/12, 12/01 from 2pm to 5:15 Small project graded (grade is part of final grade)

Part II - Deterministic Methods

taught by Laurent Dumas

Friday 19/01, 26/01, 02/02, 09/02, 16/02 from 2pm to 5:15

Final Written Exam

Friday 23/02

slide-3
SLIDE 3

First Example of a Black-Box Continuous Optimization Problem

slide-4
SLIDE 4

Problem Statement

Continuous Domain Search/Optimization

◮ Task: minimize an objective function (fitness function, loss

function) in continuous domain f : X ⊆ Rn → R, x → f (x)

◮ Black Box scenario (direct search scenario)

f(x) x

◮ gradients are not available or not useful ◮ problem domain specific knowledge is used only within the

black box, e.g. within an appropriate encoding

◮ Search costs: number of function evaluations (often called

runtime of algorithms)

: this is not the "real" runtime (i.e. time you have to wait) but

this is typically proportional to the real runtime. This measurement is independent of the programming langage/ implementation "tricks" chosen for the implementation.

slide-5
SLIDE 5

Optimization of the Design of a Launcher

Example of a black-box problem

slide-6
SLIDE 6

Control of the Alignement of Molecules

Example of a black-box problem (II)

slide-7
SLIDE 7

Coffee Tasting Problem

Example of a black-box problem (III)

Coffee Tasting Problem

◮ Find a mixture of coffee in order to keep the coffee taste from

  • ne year to another

◮ Objective function = opinion of one expert

slide-8
SLIDE 8

What Makes a Function Difficult to Solve?

Why stochastic search?

◮ non-linear, non-quadratic, non-convex

  • n linear and quadratic functions

much better search policies are available

◮ ruggedness

non-smooth, discontinuous, multimodal, and/or noisy function

◮ dimensionality (size of search space)

(considerably) larger than three

◮ non-separability

dependencies between the

  • bjective variables

◮ ill-conditioning

1.0 0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −4 −3 −2 −1 1 2 3 4 10 20 30 40 50 60 70 80 90 100

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

gradient direction Newton direction

slide-9
SLIDE 9

Curse of Dimensionality

The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. How many points would you need to get a similar coverage (in terms of distance between adjacent points) in dimension 10?

slide-10
SLIDE 10

Curse of Dimensionality

The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. To get similar coverage, in terms of distance between adjacent points, of the 10-dimensional space [0, 1]10 would require 10010 = 1020 points. A 100 points appear now as isolated points in a vast empty space. Consequence: a search policy (e.g. exhaustive search) that is valuable in small dimensions might be useless in moderate or large dimensional search spaces.

slide-11
SLIDE 11

Curse of Dimensionality

How long would it take to evaluate 1020 points ? import timeit timeit.timeit(’import numpy as np ; np.sum(np.ones(10)*np.ones(10))’, number=1000000) > 7.0521080493927 7 seconds for 106 evaluations of f (x) = 10

i=1 x2 i

We would need more than 108 days for evaluating 1020 points [As a reference: origin of human species: roughly 6 108 days]

slide-12
SLIDE 12

Separable Problems

Definition (Separable Problem)

A function f is separable if arg min

(x1,...,xn) f (x1, . . . , xn) =

  • arg min

x1 f (x1, . . .), . . . , arg min xn f (. . . , xn)

  • ⇒ it follows that f can be optimized in a

sequence of n independent 1-D optimization processes

Example: Additively decomposable functions

f (x1, . . . , xn) =

n

  • i=1

fi(xi)

Rastrigin function f (x) = 10n+n

i=1(x2 i −10 cos(2πxi))

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

slide-13
SLIDE 13

Non-Separable Problems

Building a non-separable problem from a separable one (1,2)

Rotating the coordinate system

◮ f : x → f (x) separable ◮ f : x → f (Rx) non-separable

R rotation matrix

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

R − →

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

1Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2Salomon (1996). "Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms." BioSystems, 39(3):263-278

slide-14
SLIDE 14

Ill-Conditioned Problems

◮ If f is convex quadratic, f : x → 1

2xTHx = 1 2

  • i hi,i x2

i + 1 2

  • i=j hi,j xixj,

with H positive, definite, symmetric matrix H is the Hessian matrix of f

◮ ill-conditioned means a high condition number of Hessian Matrix H

cond(H) = λmax(H) λmin(H)

Example / exercice

The level-sets of a function are defined as Lc = {x ∈ Rn|f (x) = c}, c ∈ R . Consider the objective function f (x) = 1

2(x2 1 + 9x2 2)

  • 1. Plot the level sets of f .
  • 2. Compute the condition number of the Hessian matrix of f , relate it to

the axis ratio of the level sets of f .

  • 3. Generalize 1. and 2. to a general convex-quadratic function.
slide-15
SLIDE 15

Ill-conditionned Problems

consider the curvature of the level sets of a function ill-conditioned means “squeezed” lines of equal function value (high curvatures) gradient direction −f ′(x)T Newton direction −H−1f ′(x)T Condition number equals nine here. Condition numbers up to 1010 are not unusual in real world problems.

slide-16
SLIDE 16

Landscape of Derivative Free Optimization Algorithms

Deterministic Algorithms

Quasi-Newton with estimation of gradient (BFGS)

[Broyden et al. 1970]

Simplex downhill [Nelder and Mead 1965] Pattern search [Hooke and Jeeves 1961] Trust-region methods (NEWUOA, BOBYQA) [Powell 2006, 2009]

Stochastic (randomized) search methods

Evolutionary Algorithms (continuous domain)

◮ Differential Evolution

[Storn and Price 1997]

◮ Particle Swarm Optimization [Kennedy and Eberhart 1995] ◮ Evolution Strategies, CMA-ES [Rechenberg 1965, Hansen and Ostermeier 2001] ◮ Estimation of Distribution Algorithms (EDAs) [Larrañaga, Lozano, 2002] ◮ Cross Entropy Method (same as EDA) [Rubinstein, Kroese, 2004] ◮ Genetic Algorithms [Holland 1975, Goldberg 1989]

Simulated annealing [Kirkpatrick et al. 1983] Simultaneous perturbation stochastic approximation (SPSA) [Spall 2000]

slide-17
SLIDE 17

Stochastic Search

A black box search template to minimize f : Rn → R

Initialize distribution parameters θ, set population size λ ∈ N While not terminate

  • 1. Sample distribution P (x|θ) → x1, . . . , xλ ∈ Rn
  • 2. Evaluate x1, . . . , xλ on f
  • 3. Update parameters θ ← Fθ(θ, x1, . . . , xλ, f (x1), . . . , f (xλ))
slide-18
SLIDE 18

Stochastic Search

A black box search template to minimize f : Rn → R

Initialize distribution parameters θ, set population size λ ∈ N While not terminate

  • 1. Sample distribution P (x|θ) → x1, . . . , xλ ∈ Rn
  • 2. Evaluate x1, . . . , xλ on f
  • 3. Update parameters θ ← Fθ(θ, x1, . . . , xλ, f (x1), . . . , f (xλ))
slide-19
SLIDE 19

Stochastic Search

A black box search template to minimize f : Rn → R

Initialize distribution parameters θ, set population size λ ∈ N While not terminate

  • 1. Sample distribution P (x|θ) → x1, . . . , xλ ∈ Rn
  • 2. Evaluate x1, . . . , xλ on f
  • 3. Update parameters θ ← Fθ(θ, x1, . . . , xλ, f (x1), . . . , f (xλ))
slide-20
SLIDE 20

Stochastic Search

A black box search template to minimize f : Rn → R

Initialize distribution parameters θ, set population size λ ∈ N While not terminate

  • 1. Sample distribution P (x|θ) → x1, . . . , xλ ∈ Rn
  • 2. Evaluate x1, . . . , xλ on f
  • 3. Update parameters θ ← Fθ(θ, x1, . . . , xλ, f (x1), . . . , f (xλ))
slide-21
SLIDE 21

Stochastic Search

A black box search template to minimize f : Rn → R

Initialize distribution parameters θ, set population size λ ∈ N While not terminate

  • 1. Sample distribution P (x|θ) → x1, . . . , xλ ∈ Rn
  • 2. Evaluate x1, . . . , xλ on f
  • 3. Update parameters θ ← Fθ(θ, x1, . . . , xλ, f (x1), . . . , f (xλ))
slide-22
SLIDE 22

Stochastic Search

A black box search template to minimize f : Rn → R

Initialize distribution parameters θ, set population size λ ∈ N While not terminate

  • 1. Sample distribution P (x|θ) → x1, . . . , xλ ∈ Rn
  • 2. Evaluate x1, . . . , xλ on f
  • 3. Update parameters θ ← Fθ(θ, x1, . . . , xλ, f (x1), . . . , f (xλ))
slide-23
SLIDE 23

Stochastic Search

A black box search template to minimize f : Rn → R

Initialize distribution parameters θ, set population size λ ∈ N While not terminate

  • 1. Sample distribution P (x|θ) → x1, . . . , xλ ∈ Rn
  • 2. Evaluate x1, . . . , xλ on f
  • 3. Update parameters θ ← Fθ(θ, x1, . . . , xλ, f (x1), . . . , f (xλ))

Everything depends on the definition of P and Fθ

slide-24
SLIDE 24

Stochastic Search

A black box search template to minimize f : Rn → R

Initialize distribution parameters θ, set population size λ ∈ N While not terminate

  • 1. Sample distribution P (x|θ) → x1, . . . , xλ ∈ Rn
  • 2. Evaluate x1, . . . , xλ on f
  • 3. Update parameters θ ← Fθ(θ, x1, . . . , xλ, f (x1), . . . , f (xλ))

Everything depends on the definition of P and Fθ In Evolutionary Algorithms the distribution P is often implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for Estimation of Distribution Algorithms

slide-25
SLIDE 25

A Simple Example: The Pure Random Search

Also an Ineffective Example

The Pure Random Search

◮ Sample uniformly at random a solution ◮ Return the best solution ever found

Exercice

See the exercice on the document "Exercices - class 1".

Non-adaptive Algorithm

For the pure random search P (x|θ) is independent of θ (i.e. no θ to be adapted): the algorithm is "blind"

In this class: present algorithms that are "much better" than that

slide-26
SLIDE 26

Evolution Strategies

New search points are sampled normally distributed

xi = m+σ y i for i = 1, . . . , λ with y i i.i.d. ∼ N (0, C)

as perturbations of m, where xi, m ∈ Rn, σ ∈ R+, C ∈ Rn×n

slide-27
SLIDE 27

Evolution Strategies

New search points are sampled normally distributed

xi = m+σ y i for i = 1, . . . , λ with y i i.i.d. ∼ N (0, C)

as perturbations of m, where xi, m ∈ Rn, σ ∈ R+, C ∈ Rn×n

where

◮ the mean vector m ∈ Rn represents the favorite solution ◮ the so-called step-size σ ∈ R+ controls the step length ◮ the covariance matrix C ∈ Rn×n determines the shape

  • f the distribution ellipsoid

here, all new points are sampled with the same parameters

slide-28
SLIDE 28

Evolution Strategies

New search points are sampled normally distributed

xi = m+σ y i for i = 1, . . . , λ with y i i.i.d. ∼ N (0, C)

as perturbations of m, where xi, m ∈ Rn, σ ∈ R+, C ∈ Rn×n

where

◮ the mean vector m ∈ Rn represents the favorite solution ◮ the so-called step-size σ ∈ R+ controls the step length ◮ the covariance matrix C ∈ Rn×n determines the shape

  • f the distribution ellipsoid

here, all new points are sampled with the same parameters The question remains how to update m, C, and σ.

slide-29
SLIDE 29

Normal Distribution

1-D case

−4 −2 2 4 0.1 0.2 0.3 0.4 Standard Normal Distribution probability density

probability density of the 1-D standard normal distribution N (0, 1) (expected (mean) value, variance) = (0,1) p(x) = 1 √ 2π exp

  • −x2

2

  • General case

◮ Normal distribution N

  • m, σ2

(expected value, variance) = (m, σ2) density: pm,σ(x) =

1 √ 2πσ exp

  • − (x−m)2

2σ2

  • ◮ A normal distribution is entirely determined by its mean value and

variance

◮ The family of normal distributions is closed under linear transformations:

if X is normally distributed then a linear transformation aX + b is also normally distributed

◮ Exercice: Show that m + σN (0, 1) = N

  • m, σ2
slide-30
SLIDE 30

Normal Distribution

General case A random variable following a 1-D normal distribution is determined by its mean value m and variance σ2. In the n-dimensional case it is determined by its mean vector and covariance matrix

Covariance Matrix

If the entries in a vector X = (X1, . . . , Xn)T are random variables, each with finite variance, then the covariance matrix Σ is the matrix whose (i, j) entries are the covariance of (Xi, Xj) Σij = cov(Xi, Xj) = E

  • (Xi − µi)(Xj − µj)
  • where µi = E(Xi). Considering the expectation of a matrix as the expectation
  • f each entry, we have

Σ = E[(X − µ)(X − µ)T] Σ is symmetric, positive definite

slide-31
SLIDE 31

The Multi-Variate (n-Dimensional) Normal Distribution

Any multi-variate normal distribution N (m, C) is uniquely determined by its mean value m ∈ Rn and its symmetric positive definite n × n covariance matrix C. density: pN(m,C)(x) =

1 (2π)n/2|C|1/2 exp

  • − 1

2(x − m)TC−1(x − m)

  • ,

The mean value m

◮ determines the displacement (translation) ◮ value with the largest density (modal value) ◮ the distribution is symmetric about the

distribution mean N (m, C) = m + N (0, C)

−5 5 −5 5 0.1 0.2 0.3 0.4 2−D Normal Distribution

slide-32
SLIDE 32

The Multi-Variate (n-Dimensional) Normal Distribution

Any multi-variate normal distribution N (m, C) is uniquely determined by its mean value m ∈ Rn and its symmetric positive definite n × n covariance matrix C. density: pN(m,C)(x) =

1 (2π)n/2|C|1/2 exp

  • − 1

2(x − m)TC−1(x − m)

  • ,

The mean value m

◮ determines the displacement (translation) ◮ value with the largest density (modal value) ◮ the distribution is symmetric about the

distribution mean N (m, C) = m + N (0, C)

−5 5 −5 5 0.1 0.2 0.3 0.4 2−D Normal Distribution

The covariance matrix C

◮ determines the shape ◮ geometrical interpretation: any covariance matrix can be uniquely

identified with the iso-density ellipsoid {x ∈ Rn | (x − m)TC−1(x − m) = 1}

slide-33
SLIDE 33

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid {x ∈ Rn | (x − m)TC−1(x − m) = 1} Lines of Equal Density

N

  • m, σ2I
  • ∼ m + σN (0, I)
  • ne degree of freedom σ

components are independent standard normally distributed where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A × N (0, I) ∼ N

  • 0, AAT

holds for all A.

slide-34
SLIDE 34

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid {x ∈ Rn | (x − m)TC−1(x − m) = 1} Lines of Equal Density

N

  • m, σ2I
  • ∼ m + σN (0, I)
  • ne degree of freedom σ

components are independent standard normally distributed

N

  • m, D2

∼ m + D N (0, I)

n degrees of freedom components are independent, scaled where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A × N (0, I) ∼ N

  • 0, AAT

holds for all A.

slide-35
SLIDE 35

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid {x ∈ Rn | (x − m)TC−1(x − m) = 1} Lines of Equal Density

N

  • m, σ2I
  • ∼ m + σN (0, I)
  • ne degree of freedom σ

components are independent standard normally distributed

N

  • m, D2

∼ m + D N (0, I)

n degrees of freedom components are independent, scaled

N (m, C) ∼ m + C

1 2 N (0, I)

(n2 + n)/2 degrees of freedom components are correlated where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A × N (0, I) ∼ N

  • 0, AAT

holds for all A.

slide-36
SLIDE 36

Where are we?

Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control

Theory Algorithms

Covariance Matrix Adaptation

Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-37
SLIDE 37

Adaptation: What do we want to achieve?

New search points are sampled normally distributed

xi = m + σ y i for i = 1, . . . , λ with y i i.i.d. ∼ N (0, C)

where xi, m ∈ Rn, σ ∈ R+, C ∈ Rn×n

◮ the mean vector should represent the favorite solution ◮ the step-size controls the step-length and thus convergence

rate

should allow to reach fastest convergence rate possible

◮ the covariance matrix C ∈ Rn×n determines the shape of the

distribution ellipsoid

adaptation should allow to learn the “topography” of the problem particulary important for ill-conditionned problems C ∝ H−1 on convex quadratic functions

slide-38
SLIDE 38

Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control

Theory Algorithms

Covariance Matrix Adaptation

Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-39
SLIDE 39

Evolution Strategies (ES)

Simple Update for Mean Vector

Let µ: # parents, λ: # offspring

Plus (elitist) and comma (non-elitist) selection

(µ + λ)-ES: selection in {parents} ∪ {offspring} (µ, λ)-ES: selection in {offspring}

ES algorithms emerged in the community of bio-inspired methods where a parallel between

  • ptimization and evolution of species as described by Darwin served in the origin as inspiration

for the methods. Nowadays this parallel is mainly visible through the terminology used: candidate solutions are parents or offspring, the objective function is a fitness function, ...

(1 + 1)-ES

Sample one offspring from parent m x = m + σ N (0, C) If x better than m select m ← x

slide-40
SLIDE 40

The (µ/µ, λ)-ES - Update of the mean vector

Non-elitist selection and intermediate (weighted) recombination

Given the i-th solution point xi = m + σ y i

  • ∼N(0,C)

Let xi:λ the i-th ranked solution point, such that f (x1:λ) ≤ · · · ≤ f (xλ:λ).

Notation: we denote y i:λ the vector such that xi:λ = m + σy i:λ Exercice: realize that y i:λ is generally not distributed as N (0, C)

The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied.

slide-41
SLIDE 41

The (µ/µ, λ)-ES - Update of the mean vector

Non-elitist selection and intermediate (weighted) recombination

Given the i-th solution point xi = m + σ y i

  • ∼N(0,C)

Let xi:λ the i-th ranked solution point, such that f (x1:λ) ≤ · · · ≤ f (xλ:λ).

Notation: we denote y i:λ the vector such that xi:λ = m + σy i:λ Exercice: realize that y i:λ is generally not distributed as N (0, C)

The new mean reads m ←

µ

  • i=1

wi xi:λ where w1 ≥ · · · ≥ wµ > 0, µ

i=1 wi = 1, 1 µ

i=1 wi 2 =: µw ≈ λ

4

The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied.

slide-42
SLIDE 42

The (µ/µ, λ)-ES - Update of the mean vector

Non-elitist selection and intermediate (weighted) recombination

Given the i-th solution point xi = m + σ y i

  • ∼N(0,C)

Let xi:λ the i-th ranked solution point, such that f (x1:λ) ≤ · · · ≤ f (xλ:λ).

Notation: we denote y i:λ the vector such that xi:λ = m + σy i:λ Exercice: realize that y i:λ is generally not distributed as N (0, C)

The new mean reads m ←

µ

  • i=1

wi xi:λ = m + σ

µ

  • i=1

wi y i:λ

  • =: y w

where w1 ≥ · · · ≥ wµ > 0, µ

i=1 wi = 1, 1 µ

i=1 wi 2 =: µw ≈ λ

4

The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied.

slide-43
SLIDE 43

Invariance Under Monotonically Increasing Functions

Rank-based algorithms

Update of all parameters uses only the ranks f (x1:λ) ≤ f (x2:λ) ≤ ... ≤ f (xλ:λ) g(f (x1:λ)) ≤ g(f (x2:λ)) ≤ ... ≤ g(f (xλ:λ)) ∀g g is strictly monotonically increasing g preserves ranks

slide-44
SLIDE 44

Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control

Theory Algorithms

Covariance Matrix Adaptation

Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-45
SLIDE 45

Why Step-Size Control?

0.5 1 1.5 2 x 10

4

10

−9

10

−6

10

−3

10 function evaluations function value

step−size too small |

| step−size too large

constant step−size random search

  • ptimal step−size

(scale invariant)

(1+1)-ES (red & green) f (x) =

n

  • i=1

x2

i

in [−2.2, 0.8]n for n = 10

slide-46
SLIDE 46

Why Step-Size Control?

(5/5w, 10)-ES, 11 runs

m − x∗ =

  • f (x)

200 400 600 800 1000 1200 function evaluations 10-5 10-4 10-3 10-2 10-1 100

with optimal step-size with step-size control

f (x) =

n

  • i=1

x2

i

for n = 10 and x0 ∈ [−0.2, 0.8]n with optimal step-size σ

slide-47
SLIDE 47

Why Step-Size Control?

(5/5w, 10)-ES, 2×11 runs

m − x∗ =

  • f (x)

200 400 600 800 1000 1200 function evaluations 10-5 10-4 10-3 10-2 10-1 100

with optimal step-size with step-size control

f (x) =

n

  • i=1

x2

i

for n = 10 and x0 ∈ [−0.2, 0.8]n with optimal versus adaptive step-size σ with too small initial σ

slide-48
SLIDE 48

Why Step-Size Control?

(5/5w, 10)-ES

m − x∗ =

  • f (x)

200 400 600 800 1000 1200 function evaluations 10-5 10-4 10-3 10-2 10-1 100

with optimal step-size with step-size control respective step-size

f (x) =

n

  • i=1

x2

i

for n = 10 and x0 ∈ [−0.2, 0.8]n comparing number of f -evals to reach m = 10−5: 1100−100

650

≈ 1.5

slide-49
SLIDE 49

Why Step-Size Control?

(5/5w, 10)-ES

m − x∗ =

  • f (x)

200 400 600 800 1000 1200 1400 1600 function evaluations 10-5 10-4 10-3 10-2 10-1 100

with optimal step-size with step-size control respective step-size

f (x) =

n

  • i=1

x2

i

for n = 10 and x0 ∈ [−0.2, 0.8]n comparing optimal versus default damping parameter dσ:

1700 1100 ≈ 1.5

slide-50
SLIDE 50

Why Step-Size Control?

500 1000 1500 10

−9

10

−6

10

−3

10 function evaluations function value

adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ

10

−3

10

−2

10

−1

10 0.05 0.1 0.15 0.2 normalized progress normalized step size

ϕ∗ −ϕ∗ n σ∗

  • pt

σ ← σ∗

  • ptparent

evolution window refers to the step-size interval ( ) where reasonable performance is observed

slide-51
SLIDE 51

Step-size control

Theory

◮ On well conditioned problem (sphere function f (x) = x2) step-size

adaptation should allow to reach (close to) optimal convergence rates need to be able to solve optimally simple scenario (linear function, sphere function) that quite often (always?) need to be solved when addressing a real-world problem

◮ Is it possible to quantify optimal convergence rate for step-size adaptive

ESs?

slide-52
SLIDE 52

Lower bound for convergence

Exemplified on (1+1)-ES

Consider a (1+1)-ES with any step-size adaptation mechanism

(1+1)-ES with adaptive step-size

Iteration k: ˜ X k+1

  • ffspring

= X k

  • parent

+ σk

  • step−size

Nk+1 with (Nk)k i.i.d. ∼ N (0, I) X k+1 = ˜ X k+1 if f ( ˜ X k+1) ≤ f (X k) X k

  • therwise
slide-53
SLIDE 53

Lower bound for convergence (II)

Exemplify on (1+1)-ES

Theorem

For any objective function f : Rn → R, for any y∗ ∈ Rn E [ln X k+1 − y∗] ≥ E [ln X k − y∗] − τ

  • lower bound

where τ = maxσ∈R+ E[ln− e1

  • (1,0,...,0)

+σN]

  • =:ϕ(σ)
slide-54
SLIDE 54

"Tight" lower bound

Theorem

Lower bound reached on the sphere function f (x) = g(x − y∗), (with g : R → R, increasing mapping) for step-size proportional to the distance to the optimum where σk = σx − y∗ with σ := σopt such that ϕ(σopt) = τ.

slide-55
SLIDE 55

(Log)-Linear convergence of scale-invariant step-size ES

Theorem

The (1+1)-ES with step-size proportional to the distance to the

  • ptimum σk = σx converges (log)-linearly on the sphere function

f (x) = g(x), (with g : R → R, increasing mapping) in the sense 1 k ln X k X 0 − − − →

k→∞ −ϕ(σ) =: CR(1+1)(σ)

almost surely.

1000 2000 3000 4000 5000 10

−20

10

−10

10 function evaluations distance to optimum

2 4 6 8 10 −0.5 −0.45 −0.4 −0.35 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 sigma*dimension c(sigma)*dimension dim=2 min for dim=2 dim=3 min for dim=3 dim=5 min for dim=5 dim=10 min for dim=10 dim=20 min for dim=20 dim=160 min for dim=160

n = 20 and σ = 0.6/n

slide-56
SLIDE 56

Asymptotic results

When n → ∞

Theorem

Let σ > 0, the convergence rate of the (1+1)-ES with scale-invariant step-size on spherical functions satisfies at the limit lim

n→∞ n × CR(1+1)

σ n

  • = −σ

√ 2π exp

  • − σ2

8

  • + σ2

2 Φ

  • −σ

2

  • where Φ is the cumulative distribution of a normal distribution.
  • ptimal convergence rate decreases to zero like 1

n

2 4 6 8 10 −0.5 −0.45 −0.4 −0.35 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 sigma*dimension c(sigma)*dimension

slide-57
SLIDE 57

Summary of theory results

500 1000 1500 10

−9

10

−6

10

−3

10 function evaluations function value

adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ

10

−3

10

−2

10

−1

10 0.05 0.1 0.15 0.2 normalized progress normalized step size

ϕ∗ −ϕ∗ n σ∗

  • pt

σ ← σ∗

  • ptparent

evolution window refers to the step-size interval ( ) where reasonable performance is observed

slide-58
SLIDE 58

Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control

Theory Algorithms

Covariance Matrix Adaptation

Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-59
SLIDE 59

Methods for Step-Size Control

◮ 1/5-th success ruleab, often applied with “+”-selection

increase step-size if more than 20% of the new solutions are successful, decrease otherwise

◮ σ-self-adaptationc, applied with “,”-selection

mutation is applied to the step-size and the better one, according to the objective function value, is selected simplified “global” self-adaptation

◮ path length controld (Cumulative Step-size Adaptation, CSA)e, applied

with “,”-selection

aRechenberg 1973, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog bSchumer and Steiglitz 1968. Adaptive step size random search. IEEE TAC cSchwefel 1981, Numerical Optimization of Computer Models, Wiley dHansen & Ostermeier 2001, Completely Derandomized Self-Adaptation in Evolution Strategies,

  • Evol. Comput. 9(2)

eOstermeier et al 1994, Step-size adaptation based on non-local use of selection information, PPSN IV

slide-60
SLIDE 60

One-fifth success rule

increase σ ↓ decrease σ

slide-61
SLIDE 61

One-fifth success rule

  • Probability of success (ps)

1/2 1/5 Probability of success (ps) “too small”

slide-62
SLIDE 62

One-fifth success rule

ps: # of successful offspring / # offspring (per iteration) σ ← σ × exp 1 3 × ps − ptarget 1 − ptarget

  • Increase σ if ps > ptarget

Decrease σ if ps < ptarget

(1 + 1)-ES

ptarget = 1/5 IF offspring better parent ps = 1, σ ← σ × exp(1/3) ELSE ps = 0, σ ← σ/ exp(1/3)1/4

slide-63
SLIDE 63

Why 1/5?

Asymptotic convergence rate and probability of success of scale-invariant step-size (1+1)-ES

2 4 6 8 10 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 sigma*dimension c(sigma)*dimension CR(1+1) min (CR(1+1)) proba of success

sphere - asymptotic results, i.e. n = ∞ (see slides before)

1/5 trade-off of optimal probability of success on the sphere and corridor

slide-64
SLIDE 64

Path Length Control (CSA)

The Concept of Cumulative Step-Size Adaptation xi = m + σ y i m ← m + σy w

Measure the length of the evolution path

the pathway of the mean vector m in the iteration sequence ⇓ decrease σ ⇓ increase σ

slide-65
SLIDE 65

Path Length Control (CSA)

The Equations

Sampling of solutions, notations as on slide “The (µ/µ, λ)-ES - Update of the mean vector” with C equal to the identity.

Initialize m ∈ Rn, σ ∈ R+, evolution path pσ = 0, set cσ ≈ 4/n, dσ ≈ 1.

slide-66
SLIDE 66

Path Length Control (CSA)

The Equations

Sampling of solutions, notations as on slide “The (µ/µ, λ)-ES - Update of the mean vector” with C equal to the identity.

Initialize m ∈ Rn, σ ∈ R+, evolution path pσ = 0, set cσ ≈ 4/n, dσ ≈ 1. m ← m + σy w where y w = µ

i=1 wi y i:λ

update mean pσ ← (1 − cσ) pσ +

  • 1 − (1 − cσ)2
  • accounts for 1−cσ

õw

  • accounts for wi

y w σ ← σ × exp cσ dσ

EN (0, I) − 1

  • >1 ⇐

⇒ pσ is greater than its expectation

update step-size

slide-67
SLIDE 67

Step-size adaptation

What is achieved

(1 + 1)-ES with one-fifth success rule (blue) 500 1000 1500 10

−9

10

−6

10

−3

10 function evaluations function value

adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ adaptive step−size σ

  • ptimal step−size

(scale invariant) random search constant σ step−size σ

f (x) =

n

  • i=1

x2

i

in [−0.2, 0.8]n for n = 10

Linear convergence

slide-68
SLIDE 68

Step-size adaptation

What is achieved

(5/5, 10)-CSA-ES, default parameters m − x∗

500 1000 1500 2000 2500 3000 3500 4000 function evaluations 10-5 10-4 10-3 10-2 10-1 100

with optimal step-size with step-size control respective step-size

f (x) =

n

  • i=1

x2

i

in [−0.2, 0.8]n for n = 30

slide-69
SLIDE 69

Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control

Theory Algorithms

Covariance Matrix Adaptation

Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-70
SLIDE 70

Evolution Strategies

Recalling

New search points are sampled normally distributed

xi ∼ m + σ Ni (0, C) for i = 1, . . . , λ

as perturbations of m, where xi, m ∈ Rn, σ ∈ R+, C ∈ Rn×n

where

◮ the mean vector m ∈ Rn represents the favorite solution ◮ the so-called step-size σ ∈ R+ controls the step length ◮ the covariance matrix C ∈ Rn×n determines the shape

  • f the distribution ellipsoid

The remaining question is how to update C.

slide-71
SLIDE 71

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) initial distribution, C = I

slide-72
SLIDE 72

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) initial distribution, C = I

slide-73
SLIDE 73

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) y w, movement of the population mean m (disregarding σ)

slide-74
SLIDE 74

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) mixture of distribution C and step y w, C ← 0.8 × C + 0.2 × y w y T

w

slide-75
SLIDE 75

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) new distribution (disregarding σ)

slide-76
SLIDE 76

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) new distribution (disregarding σ)

slide-77
SLIDE 77

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) movement of the population mean m

slide-78
SLIDE 78

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) mixture of distribution C and step y w, C ← 0.8 × C + 0.2 × y w y T

w

slide-79
SLIDE 79

Covariance Matrix Adaptation

Rank-One Update

m ← m + σy w, y w = µ

i=1 wi y i:λ,

y i ∼ Ni (0, C) new distribution, C ← 0.8 × C + 0.2 × y w y T

w

the ruling principle: the adaptation increases the likelihood of successful steps, y w, to appear again

slide-80
SLIDE 80

Covariance Matrix Adaptation

Rank-One Update

Initialize m ∈ Rn, and C = I, set σ = 1, learning rate ccov ≈ 2/n2 While not terminate xi = m + σ y i, y i ∼ Ni (0, C) , m ← m + σy w where y w =

µ

  • i=1

wi y i:λ C ← (1 − ccov)C + ccovµw y w y T

w rank-one

where µw = 1 µ

i=1 wi 2 ≥ 1

slide-81
SLIDE 81

Problem Statement Stochastic search algorithms - basics Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-82
SLIDE 82

Cumulation

The Evolution Path

Evolution Path

Conceptually, the evolution path is the search path the strategy takes over a number of iteration steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum

  • f steps y w is used

pc ∝

g

  • i=0

(1 − cc)g−i

  • exponentially

fading weights

y (i)

w

slide-83
SLIDE 83

Cumulation

The Evolution Path

Evolution Path

Conceptually, the evolution path is the search path the strategy takes over a number of iteration steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum

  • f steps y w is used

pc ∝

g

  • i=0

(1 − cc)g−i

  • exponentially

fading weights

y (i)

w

The recursive construction of the evolution path (cumulation): pc ← (1 − cc)

  • decay factor

pc +

  • 1 − (1 − cc)2√µw
  • normalization factor

y w

  • input =

m−mold σ

where µw =

1 wi 2 , cc ≪ 1. History information is accumulated in the evolution

path.

slide-84
SLIDE 84

Cumulation

Utilizing the Evolution Path We used y w y T

w for updating C. Because y w y T w = −y w(−y w)T the sign of y w

is lost.

slide-85
SLIDE 85

Cumulation

Utilizing the Evolution Path We used y w y T

w for updating C. Because y w y T w = −y w(−y w)T the sign of y w

is lost.

slide-86
SLIDE 86

Cumulation

Utilizing the Evolution Path We used y w y T

w for updating C. Because y w y T w = −y w(−y w)T the sign of y w

is lost. The sign information is (re-)introduced by using the evolution path. pc ← (1 − cc)

  • decay factor

pc +

  • 1 − (1 − cc)2√µw
  • normalization factor

y w C ← (1 − ccov)C + ccov pc pc

T rank-one

where µw =

1 wi 2 , cc ≪ 1.

slide-87
SLIDE 87

Using an evolution path for the rank-one update of the covariance matrix reduces the number of function evaluations to adapt to a straight ridge from O(n2) to O(n).(3) The overall model complexity is n2 but important parts of the model can be learned in time of order n

3Hansen, Müller and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1),

  • pp. 1-18
slide-88
SLIDE 88

Rank-µ Update

xi = m + σ y i, y i ∼ Ni (0, C) , m ← m + σy w y w = µ

i=1 wi y i:λ

The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step.

slide-89
SLIDE 89

Rank-µ Update

xi = m + σ y i, y i ∼ Ni (0, C) , m ← m + σy w y w = µ

i=1 wi y i:λ

The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step. The matrix Cµ =

µ

  • i=1

wi y i:λy T

i:λ

computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one.

slide-90
SLIDE 90

Rank-µ Update

xi = m + σ y i, y i ∼ Ni (0, C) , m ← m + σy w y w = µ

i=1 wi y i:λ

The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step. The matrix Cµ =

µ

  • i=1

wi y i:λy T

i:λ

computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one. The rank-µ update then reads C ← (1 − ccov) C + ccov Cµ where ccov ≈ µw/n2 and ccov ≤ 1.

slide-91
SLIDE 91

xi = m + σ yi , yi ∼ N (0, C)

sampling of λ = 150 solutions where C = I and σ = 1

Cµ =

1 µ

yi:λyT

i:λ

C ← (1 − 1) × C + 1 × Cµ

calculating C where µ = 50, w1 = · · · = wµ = 1

µ, and

ccov = 1

mnew ← m + 1

µ

yi:λ

new distribution

slide-92
SLIDE 92

The rank-µ update

◮ increases the possible learning rate in large populations

roughly from 2/n2 to µw/n2

◮ can reduce the number of necessary iterations roughly from

O(n2) to O(n) (4)

given µw ∝ λ ∝ n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say λ ≥ 3 n + 10

4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1),

  • pp. 1-18
slide-93
SLIDE 93

The rank-µ update

◮ increases the possible learning rate in large populations

roughly from 2/n2 to µw/n2

◮ can reduce the number of necessary iterations roughly from

O(n2) to O(n) (4)

given µw ∝ λ ∝ n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say λ ≥ 3 n + 10

The rank-one update

◮ uses the evolution path and reduces the number of necessary

function evaluations to learn straight ridges from O(n2) to O(n) .

4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1),

  • pp. 1-18
slide-94
SLIDE 94

The rank-µ update

◮ increases the possible learning rate in large populations

roughly from 2/n2 to µw/n2

◮ can reduce the number of necessary iterations roughly from

O(n2) to O(n) (4)

given µw ∝ λ ∝ n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say λ ≥ 3 n + 10

The rank-one update

◮ uses the evolution path and reduces the number of necessary

function evaluations to learn straight ridges from O(n2) to O(n) . Rank-one update and rank-µ update can be combined

4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1),

  • pp. 1-18
slide-95
SLIDE 95

Summary of Equations

The Covariance Matrix Adaptation Evolution Strategy

Input: m ∈ Rn, σ ∈ R+, λ Initialize: C = I, and pc = 0, pσ = 0, Set: cc ≈ 4/n, cσ ≈ 4/n, c1 ≈ 2/n2, cµ ≈ µw/n2, c1 + cµ ≤ 1, dσ ≈ 1 + µw

n , and wi=1...λ such that µw = 1 µ

i=1 wi 2 ≈ 0.3 λ

While not terminate xi = m + σ y i, y i ∼ Ni (0, C) , for i = 1, . . . , λ sampling m ← µ

i=1 wi xi:λ = m + σy w

where y w = µ

i=1 wi y i:λ

update mean pc ← (1 − cc) pc + 1 I{pσ<1.5√n}

  • 1 − (1 − cc)2√µw y w

cumulation for C pσ ← (1 − cσ) pσ +

  • 1 − (1 − cσ)2√µw C− 1

2 y w

cumulation for σ C ← (1 − c1 − cµ) C + c1 pc pcT + cµ µ

i=1 wi y i:λy T i:λ

update C σ ← σ × exp

EN(0,I) − 1

  • update of σ

Not covered on this slide: termination, restarts, useful output, boundaries and encoding

slide-96
SLIDE 96

Experimentum Crucis (0)

What did we want to achieve?

◮ reduce any convex-quadratic function

f (x) = xTHx

e.g. f (x) = n

i=1 106 i−1

n−1 x2

i

to the sphere model f (x) = xTx

without use of derivatives

◮ lines of equal density align with lines of equal fitness

C ∝ H−1

in a stochastic sense

slide-97
SLIDE 97

Experimentum Crucis (1)

f convex quadratic, separable

2000 4000 6000 10

−10

10

−5

10 10

5

10

10

1e−05 1e−08 f=2.66178883753772e−10 blue:abs(f), cyan:f−min(f), green:sigma, red:axis ratio 2000 4000 6000 −5 5 10 15 x(3)=−6.9109e−07 x(4)=−3.8371e−07 x(5)=−1.0864e−07 x(9)=2.741e−09 x(8)=4.5138e−09 x(7)=2.7147e−08 x(6)=5.6127e−08 x(2)=2.2083e−06 x(1)=3.0931e−06 Object Variables (9−D) 2000 4000 6000 10

−4

10

−2

10 10

2

Principle Axes Lengths function evaluations 2000 4000 6000 10

−4

10

−2

10 10

2

9 8 7 6 5 4 3 2 1 Standard Deviations in Coordinates divided by sigma function evaluations

f (x) = n

i=1 10α i−1

n−1 x2

i , α = 6

slide-98
SLIDE 98

Experimentum Crucis (2)

f convex quadratic, as before but non-separable (rotated)

2000 4000 6000 10

−10

10

−5

10 10

5

10

10

8e−06 2e−06 f=7.91055728188042e−10 blue:abs(f), cyan:f−min(f), green:sigma, red:axis ratio 2000 4000 6000 −4 −2 2 4 x(8)=−2.6301e−06 x(2)=−2.1131e−06 x(3)=−2.0364e−06 x(7)=−8.3583e−07 x(4)=−2.9981e−07 x(9)=−7.3812e−08 x(6)=1.2468e−06 x(5)=1.2552e−06 x(1)=2.0052e−06 Object Variables (9−D) 2000 4000 6000 10

−4

10

−2

10 10

2

Principle Axes Lengths function evaluations 2000 4000 6000 10 4 9 6 5 7 2 8 1 3 Standard Deviations in Coordinates divided by sigma function evaluations

C ∝ H−1 for all g, H f (x) = g

  • xTHx
  • , g : R → R stricly increasing
slide-99
SLIDE 99

Comparison to BFGS, NEWUOA, PSO and DE

f convex quadratic, separable with varying condition number α

10

2

10

4

10

6

10

8

10

10

10

1

10

2

10

3

10

4

10

5

10

6

10

7

10 Ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

Condition number SP1

NEWUOA BFGS DE2 PSO CMAES

BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) = g(xTHx) with H diagonal g identity (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations5 to reach the target function value of g −1(10−9)

5Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA

slide-100
SLIDE 100

Comparison to BFGS, NEWUOA, PSO and DE

f convex quadratic, non-separable (rotated) with varying condition number α

10

2

10

4

10

6

10

8

10

10

10

1

10

2

10

3

10

4

10

5

10

6

10

7

10 Rotated Ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

Condition number SP1

NEWUOA BFGS DE2 PSO CMAES

BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) = g(xTHx) with H full g identity (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations6 to reach the target function value of g −1(10−9)

6Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA

slide-101
SLIDE 101

Comparison to BFGS, NEWUOA, PSO and DE

f non-convex, non-separable (rotated) with varying condition number α

10

2

10

4

10

6

10

8

10

10

10

1

10

2

10

3

10

4

10

5

10

6

10

7

10 Sqrt of sqrt of rotated ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

Condition number SP1

NEWUOA BFGS DE2 PSO CMAES

BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) = g(xTHx) with H full g : x → x1/4 (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations7 to reach the target function value of g −1(10−9)

7Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA

slide-102
SLIDE 102

Comparison during BBOB at GECCO 2009

24 functions and 31 algorithms in 20-D

1 2 3 4 5 6 7 8 Running length / dimension 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of functions

Monte Carlo BayEDAcG DIRECT DEPSO simple GA LSfminbnd LSstep Rosenbrock MCS PSO POEMS EDA-PSO NELDER (Doe) NELDER (Han) full NEWUOA ALPS-GA GLOBAL PSO_Bounds BFGS (1+1)-ES Cauchy EDA (1+1)-CMA-ES NEWUOA G3-PCX DASA MA-LS-Chain VNS (Garcia) iAMaLGaM IDEA AMaLGaM IDEA BIPOP-CMA-ES best 2009

(24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24) = 1200 funcs

slide-103
SLIDE 103

Comparison during BBOB at GECCO 2010

24 functions and 20+ algorithms in 20-D

1 2 3 4 5 6 7 8 Running length / dimension 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of functions

Monte Carlo SPSA Basic RCGA Artif Bee Colony

  • POEMS

GLOBAL (1,2s)-CMA-ES (1,2)-CMA-ES Cauchy EDA NBC-CMA NEWUOA (1,4s)-CMA-ES (1,4)-CMA-ES avg NEWUOA (1,4m)-CMA-ES (1,4ms)-CMA-ES (1,2ms)-CMA-ES (1+1)-CMA-ES (1,2m)-CMA-ES (1+2ms)-CMA-ES CMA-EGS (IPOP,r1) nPOEMS PM-AdapSS-DE DE (Uniform) Adap DE (F-AUC) IPOP-CMA-ES IPOP-aCMA-ES CMA+DE-MOS BIPOP-CMA-ES best 2009

(24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24) = 1200 funcs

slide-104
SLIDE 104

Comparison during BBOB at GECCO 2009

30 noisy functions and 20 algorithms in 20-D

1 2 3 4 5 6 7 8 Running length / dimension 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of functions

Monte Carlo BFGS SNOBFIT MCS DEPSO PSO_Bounds PSO EDA-PSO (1+1)-CMA-ES GLOBAL DASA (1+1)-ES full NEWUOA BayEDAcG ALPS-GA MA-LS-Chain VNS (Garcia) iAMaLGaM IDEA AMaLGaM IDEA BIPOP-CMA-ES best 2009

(30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30) = 1500 funcs

slide-105
SLIDE 105

Comparison during BBOB at GECCO 2010

30 noisy functions and 10+ algorithms in 20-D

1 2 3 4 5 6 7 8 Running length / dimension 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of functions

Monte Carlo SPSA NEWUOA avg NEWUOA GLOBAL (1,2s)-CMA-ES (1,2)-CMA-ES (1,4s)-CMA-ES (1,2m)-CMA-ES (1,4)-CMA-ES (1,4m)-CMA-ES (1,2ms)-CMA-ES (1,4ms)-CMA-ES Basic RCGA CMA-EGS (IPOP,r1) CMA+DE-MOS IPOP-CMA-ES BIPOP-CMA-ES IPOP-aCMA-ES best 2009

(30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30) = 1500 funcs

slide-106
SLIDE 106

Problem Statement Stochastic search algorithms - basics Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-107
SLIDE 107

The Continuous Search Problem

Difficulties of a non-linear optimization problem are

◮ dimensionality and non-separabitity

demands to exploit problem structure, e.g. neighborhood

◮ ill-conditioning

demands to acquire a second order model

◮ ruggedness

demands a non-local (stochastic?) approach

Approach: population based stochastic search, coordinate system independent and with second order estimations (covariances)

slide-108
SLIDE 108

Main Features of (CMA) Evolution Strategies

  • 1. Multivariate normal distribution to generate new search points

follows the maximum entropy principle

  • 2. Rank-based selection

implies invariance, same performance on g(f (x)) for any increasing g more invariance properties are featured

  • 3. Step-size control facilitates fast (log-linear) convergence

based on an evolution path (a non-local trajectory)

  • 4. Covariance matrix adaptation (CMA) increases the likelihood
  • f previously successful steps and can improve performance by
  • rders of magnitude

C ∝ H−1 ⇐ ⇒ adapts a variable metric ⇐ ⇒ new (rotated) problem representation = ⇒ f (x) = g(xTHx) reduces to g(xTx)

slide-109
SLIDE 109

Limitations

  • f CMA Evolution Strategies

◮ internal CPU-time: 10−8n2 seconds per function evaluation on a

2GHz PC, tweaks are available

100 000 f -evaluations in 1000-D take 1/4 hours internal CPU-time

◮ better methods are presumably available in case of

◮ partly separable problems ◮ specific problems, for example with cheap gradients

specific methods

◮ small dimension (n ≪ 10)

for example Nelder-Mead

◮ small running times (number of f -evaluations ≪ 100n)

model-based methods