Covariance Matrix Adaptation Covariance Matrix Adaptation Evolution - - PowerPoint PPT Presentation

covariance matrix adaptation covariance matrix adaptation
SMART_READER_LITE
LIVE PREVIEW

Covariance Matrix Adaptation Covariance Matrix Adaptation Evolution - - PowerPoint PPT Presentation

Covariance Matrix Adaptation Covariance Matrix Adaptation Evolution Strategies Recalling New search points are sampled normally distributed x i m + N i ( 0 , C ) for i = 1 , . . . , where x i , m 2 R n , 2 R + , as perturbations of


slide-1
SLIDE 1

Covariance Matrix Adaptation Covariance Matrix Adaptation

slide-2
SLIDE 2

Evolution Strategies

Recalling

New search points are sampled normally distributed

xi ⇠ m + σ Ni (0, C) for i = 1, . . . , λ

as perturbations of m, where xi, m 2 Rn, σ 2 R+, C 2 Rn⇥n

where

I the mean vector m 2 Rn represents the favorite solution I the so-called step-size σ 2 R+ controls the step length I the covariance matrix C 2 Rn×n determines the shape

  • f the distribution ellipsoid

The remaining question is how to update C.

slide-3
SLIDE 3

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) initial distribution, C = I

slide-4
SLIDE 4

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) initial distribution, C = I

slide-5
SLIDE 5

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) y w, movement of the population mean m (disregarding σ)

slide-6
SLIDE 6

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) mixture of distribution C and step y w, C 0.8 ⇥ C + 0.2 ⇥ y w y T

w

slide-7
SLIDE 7

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) new distribution (disregarding σ)

slide-8
SLIDE 8

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) new distribution (disregarding σ)

slide-9
SLIDE 9

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) movement of the population mean m

slide-10
SLIDE 10

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) mixture of distribution C and step y w, C 0.8 ⇥ C + 0.2 ⇥ y w y T

w

slide-11
SLIDE 11

Covariance Matrix Adaptation

Rank-One Update

m m + σy w, y w = Pµ

i=1 wi y i:λ,

y i ⇠ Ni (0, C) new distribution, C 0.8 ⇥ C + 0.2 ⇥ y w y T

w

the ruling principle: the adaptation increases the likelihood of successful steps, y w, to appear again

slide-12
SLIDE 12

Covariance Matrix Adaptation

Rank-One Update

Initialize m 2 Rn, and C = I, set σ = 1, learning rate ccov ⇡ 2/n2 While not terminate xi = m + σ y i, y i ⇠ Ni (0, C) , m m + σy w where y w =

µ

X

i=1

wi y i:λ C (1 ccov)C + ccovµw y w y T

w

| {z }

rank-one

where µw = 1 Pµ

i=1 wi 2 1

slide-13
SLIDE 13

Problem Statement Stochastic search algorithms - basics Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-14
SLIDE 14

Cumulation

The Evolution Path

Evolution Path

Conceptually, the evolution path is the search path the strategy takes over a number of iteration steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum

  • f steps y w is used

pc /

g

X

i=0

(1 cc)g−i | {z }

exponentially fading weights

y (i)

w

slide-15
SLIDE 15

Cumulation

The Evolution Path

Evolution Path

Conceptually, the evolution path is the search path the strategy takes over a number of iteration steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum

  • f steps y w is used

pc /

g

X

i=0

(1 cc)g−i | {z }

exponentially fading weights

y (i)

w

The recursive construction of the evolution path (cumulation): pc (1 cc) | {z }

decay factor

pc + p 1 (1 cc)2pµw | {z }

normalization factor

y w |{z}

input =

m−mold σ

where µw =

1 P wi 2 , cc ⌧ 1. History information is accumulated in the evolution

path.

slide-16
SLIDE 16

Cumulation

Utilizing the Evolution Path We used y w y T

w for updating C. Because y w y T w = y w(y w)T the sign of y w

is lost.

slide-17
SLIDE 17

Cumulation

Utilizing the Evolution Path We used y w y T

w for updating C. Because y w y T w = y w(y w)T the sign of y w

is lost.

slide-18
SLIDE 18

Cumulation

Utilizing the Evolution Path We used y w y T

w for updating C. Because y w y T w = y w(y w)T the sign of y w

is lost. The sign information is (re-)introduced by using the evolution path. pc (1 cc) | {z }

decay factor

pc + p 1 (1 cc)2pµw | {z }

normalization factor

y w C (1 ccov)C + ccov pc pc

T

| {z }

rank-one

where µw =

1 P wi 2 , cc ⌧ 1.

slide-19
SLIDE 19

Using an evolution path for the rank-one update of the covariance matrix reduces the number of function evaluations to adapt to a straight ridge from O(n2) to O(n).(3) The overall model complexity is n2 but important parts of the model can be learned in time of order n

3Hansen, Müller and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1),

  • pp. 1-18
slide-20
SLIDE 20

Rank-µ Update

xi = m + σ y i, y i ∼ Ni (0, C) , m ← m + σy w y w = Pµ

i=1 wi y i:λ

The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step.

slide-21
SLIDE 21

Rank-µ Update

xi = m + σ y i, y i ∼ Ni (0, C) , m ← m + σy w y w = Pµ

i=1 wi y i:λ

The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step. The matrix Cµ =

µ

X

i=1

wi y i:λy T

i:λ

computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one.

slide-22
SLIDE 22

Rank-µ Update

xi = m + σ y i, y i ∼ Ni (0, C) , m ← m + σy w y w = Pµ

i=1 wi y i:λ

The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step. The matrix Cµ =

µ

X

i=1

wi y i:λy T

i:λ

computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one. The rank-µ update then reads C (1 ccov) C + ccov Cµ where ccov ⇡ µw/n2 and ccov  1.

slide-23
SLIDE 23

xi = m + σ yi , yi ∼ N (0, C)

sampling of λ = 150 solutions where C = I and σ = 1

Cµ =

1 µ

P yi:λyT

i:λ

C ← (1 − 1) × C + 1 × Cµ

calculating C where µ = 50, w1 = · · · = wµ = 1

µ, and

ccov = 1

mnew ← m + 1

µ

P yi:λ

new distribution

slide-24
SLIDE 24

The rank-µ update

I increases the possible learning rate in large populations

roughly from 2/n2 to µw/n2

I can reduce the number of necessary iterations roughly from

O(n2) to O(n) (4)

given µw / λ / n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say λ 3 n + 10

4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1),

  • pp. 1-18
slide-25
SLIDE 25

The rank-µ update

I increases the possible learning rate in large populations

roughly from 2/n2 to µw/n2

I can reduce the number of necessary iterations roughly from

O(n2) to O(n) (4)

given µw / λ / n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say λ 3 n + 10

The rank-one update

I uses the evolution path and reduces the number of necessary

function evaluations to learn straight ridges from O(n2) to O(n) .

4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1),

  • pp. 1-18
slide-26
SLIDE 26

The rank-µ update

I increases the possible learning rate in large populations

roughly from 2/n2 to µw/n2

I can reduce the number of necessary iterations roughly from

O(n2) to O(n) (4)

given µw / λ / n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say λ 3 n + 10

The rank-one update

I uses the evolution path and reduces the number of necessary

function evaluations to learn straight ridges from O(n2) to O(n) . Rank-one update and rank-µ update can be combined

4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1),

  • pp. 1-18
slide-27
SLIDE 27

Summary of Equations

The Covariance Matrix Adaptation Evolution Strategy

Input: m 2 Rn, σ 2 R+, λ Initialize: C = I, and pc = 0, pσ = 0, Set: cc ⇡ 4/n, cσ ⇡ 4/n, c1 ⇡ 2/n2, cµ ⇡ µw/n2, c1 + cµ  1, dσ ⇡ 1 + p µw

n , and wi=1...λ such that µw = 1 Pµ

i=1 wi 2 ⇡ 0.3 λ

While not terminate xi = m + σ y i, y i ⇠ Ni (0, C) , for i = 1, . . . , λ sampling m Pµ

i=1 wi xi:λ = m + σy w

where y w = Pµ

i=1 wi y i:λ

update mean pc (1 cc) pc + 1 I{kpσk<1.5pn} p 1 (1 cc)2pµw y w cumulation for C pσ (1 cσ) pσ + p 1 (1 cσ)2pµw C 1

2 y w

cumulation for σ C (1 c1 cµ) C + c1 pc pcT + cµ Pµ

i=1 wi y i:λy T i:λ

update C σ σ ⇥ exp ⇣

cσ dσ

kpσk EkN(0,I)k 1

⌘⌘ update of σ Not covered on this slide: termination, restarts, useful output, boundaries and encoding

slide-28
SLIDE 28

Rank-one and Rank-mu updates

slide-29
SLIDE 29

Rank-one and Rank-mu update - default pop size

slide-30
SLIDE 30

Rank-one and Rank-mu update - larger pop size

slide-31
SLIDE 31

Experimentum Crucis (0)

What did we want to achieve?

I reduce any convex-quadratic function

f (x) = xTHx

e.g. f (x) = Pn

i=1 106 i−1

n−1 x2

i

to the sphere model f (x) = xTx

without use of derivatives

I lines of equal density align with lines of equal fitness

C / H−1

in a stochastic sense

slide-32
SLIDE 32

Experimentum Crucis (1)

f convex quadratic, separable

2000 4000 6000 10 −10 10 −5 10 10 5 10 10 1e−05 1e−08 f=2.66178883753772e−10 blue:abs(f), cyan:f−min(f), green:sigma, red:axis ratio 2000 4000 6000 −5 5 10 15 x(3)=−6.9109e− x(4)=−3.8371e− x(5)=−1.0864e− x(9)=2.741e−09 x(8)=4.5138e−09 x(7)=2.7147e−08 x(6)=5.6127e−08 x(2)=2.2083e−06 x(1)=3.0931e−06 Object Variables (9−D) 2000 4000 6000 10 −4 10 −2 10 10 2 Principle Axes Lengths function evaluations 2000 4000 6000 10 −4 10 −2 10 10 2 9 8 7 6 5 4 3 2 1 Standard Deviations in Coordinates divided by sigma function evaluations

f (x) = Pn

i=1 10α i−1

n−1 x2

i , α = 6

slide-33
SLIDE 33

Experimentum Crucis (2)

f convex quadratic, as before but non-separable (rotated)

2000 4000 6000 10 −10 10 −5 10 10 5 10 10 8e−06 2e−06 f=7.91055728188042e−10 blue:abs(f), cyan:f−min(f), green:sigma, red:axis ratio 2000 4000 6000 −4 −2 2 4 x(8)=−2.6301e− x(2)=−2.1131e− x(3)=−2.0364e− x(7)=−8.3583e− x(4)=−2.9981e− x(9)=−7.3812e− x(6)=1.2468e−06 x(5)=1.2552e−06 x(1)=2.0052e−06 Object Variables (9−D) 2000 4000 6000 10 −4 10 −2 10 10 2 Principle Axes Lengths function evaluations 2000 4000 6000 10 4 9 6 5 7 2 8 1 3 Standard Deviations in Coordinates divided by sigma function evaluations

C / H−1 for all g, H f (x) = g

  • xTHx
  • , g : R ! R stricly increasing
slide-34
SLIDE 34

On Invariances

slide-35
SLIDE 35

Evolution Strategies (ES) Invariance

Invariance

The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest number of hypotheses or axioms. — Albert Einstein

Empirical performance results

I from benchmark functions I from solved real world problems

are only useful if they do generalize to other problems Invariance is a strong non-empirical statement about generalization

generalizing (identical) performance from a single function to a whole class of functions

consequently, invariance is important for the evaluation of search algorithms

Anne Auger & Nikolaus Hansen CMA-ES July, 2014 30 / 81

slide-36
SLIDE 36

Evolution Strategies (ES) Invariance

Rotational Invariance in Search Space

invariance to orthogonal (rigid) transformations R, where RRT = I

e.g. true for simple evolution strategies recombination operators might jeopardize rotational invariance

f(x) $ f(Rx)

Identical behavior on f and fR

f : x 7! f(x), x(t=0) = x0 fR : x 7! f(Rx), x(t=0) = R−1(x0)

No difference can be observed w.r.t. the argument of f

45

4Salomon 1996. ”Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms.” BioSystems, 39(3):263-278 5Hansen 2000. Invariance, Self-Adaptation and Correlated Mutations in Evolution Strategies. Parallel Problem Solving from Nature PPSN VI Anne Auger & Nikolaus Hansen CMA-ES July, 2014 29 / 81

31

slide-37
SLIDE 37

Main Invariances in Optimization

Invariance to strictly increasing transformations of f: identical behavior when optimizing x ↦ f(x) Translation invariance: identical behavior when optimizing x ↦ f(x) x ↦ g( f(x)) where g : Im( f ) → ℝ is strictly increasing x ↦ f(x − a) for all a ∈ ℝn Rotational invariance: identical behavior when optimizing x ↦ f(Rx) for all R is an orthogonal matrix x ↦ f(x) Affine invariance: identical behavior when optimizing x ↦ f(Ax + b) for all A ∈ ℝn×n an invertible matrix and b ∈ ℝn x ↦ f(x) Scale invariance: identical behavior when optimizing x ↦ f(αx) for all α ∈ ℝ> x ↦ f(x)

slide-38
SLIDE 38

Main Invariances in Optimization

Invariance to strictly increasing transformations of f: identical behavior when optimizing x ↦ f(x) Translation invariance: identical behavior when optimizing x ↦ f(x) x ↦ g( f(x)) where g : Im( f ) → ℝ is strictly increasing x ↦ f(x − a) for all a ∈ ℝn Rotational invariance: identical behavior when optimizing x ↦ f(Rx) for all R an orthogonal matrix x ↦ f(x) Affine invariance: identical behavior when optimizing x ↦ f(Ax + b) for all A ∈ ℝn×n an invertible matrix and b ∈ ℝn x ↦ f(x) Scale invariance: identical behavior when optimizing x ↦ f(αx) for all α ∈ ℝ> x ↦ f(x) provided initial state is change accordingly

slide-39
SLIDE 39

Hierarchy of Invariance

Affine invariance Rotational Invariance Scale-invariance translation invariance

slide-40
SLIDE 40

Exercice - Invariances of (1+1)-ES and CMA-ES

CMA-ES (1+1)-ES with one- fifth success rule translation invariance scale invariance rotational invariance affine invariance

slide-41
SLIDE 41

Testing for invariances

slide-42
SLIDE 42

Comparison to BFGS, NEWUOA, PSO and DE

f convex quadratic, separable with varying condition number α

10 2 10 4 10 6 10 8 10 10 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 Ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

Condition number SP1

NEWUOA BFGS DE2 PSO CMAES

BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) = g(xTHx) with H diagonal g identity (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations5 to reach the target function value of g −1(10−9)

5Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA

slide-43
SLIDE 43

Comparison to BFGS, NEWUOA, PSO and DE

f convex quadratic, non-separable (rotated) with varying condition number α

10 2 10 4 10 6 10 8 10 10 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 Rotated Ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

Condition number SP1

NEWUOA BFGS DE2 PSO CMAES

BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) = g(xTHx) with H full g identity (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations6 to reach the target function value of g −1(10−9)

6Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA

slide-44
SLIDE 44

Comparison to BFGS, NEWUOA, PSO and DE

f non-convex, non-separable (rotated) with varying condition number α

10 2 10 4 10 6 10 8 10 10 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 Sqrt of sqrt of rotated ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

Condition number SP1

NEWUOA BFGS DE2 PSO CMAES

BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) = g(xTHx) with H full g : x 7! x1/4 (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations7 to reach the target function value of g −1(10−9)

7Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA

slide-45
SLIDE 45

Comparison during BBOB at GECCO 2009

24 functions and 31 algorithms in 20-D

slide-46
SLIDE 46

Comparison during BBOB at GECCO 2010

24 functions and 20+ algorithms in 20-D

slide-47
SLIDE 47

Comparison during BBOB at GECCO 2009

30 noisy functions and 20 algorithms in 20-D

slide-48
SLIDE 48

Comparison during BBOB at GECCO 2010

30 noisy functions and 10+ algorithms in 20-D

slide-49
SLIDE 49

Problem Statement Stochastic search algorithms - basics Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank-µ Update

slide-50
SLIDE 50

The Continuous Search Problem

Difficulties of a non-linear optimization problem are

I dimensionality and non-separabitity

demands to exploit problem structure, e.g. neighborhood

I ill-conditioning

demands to acquire a second order model

I ruggedness

demands a non-local (stochastic?) approach

Approach: population based stochastic search, coordinate system independent and with second order estimations (covariances)

slide-51
SLIDE 51

Main Features of (CMA) Evolution Strategies

  • 1. Multivariate normal distribution to generate new search points

follows the maximum entropy principle

  • 2. Rank-based selection

implies invariance, same performance on g(f (x)) for any increasing g more invariance properties are featured

  • 3. Step-size control facilitates fast (log-linear) convergence

based on an evolution path (a non-local trajectory)

  • 4. Covariance matrix adaptation (CMA) increases the likelihood
  • f previously successful steps and can improve performance by
  • rders of magnitude

C / H−1 ( ) adapts a variable metric ( ) new (rotated) problem representation = ) f (x) = g(xTHx) reduces to g(xTx)

slide-52
SLIDE 52

Limitations

  • f CMA Evolution Strategies

I internal CPU-time: 108n2 seconds per function evaluation on a

2GHz PC, tweaks are available

100 000 f -evaluations in 1000-D take 1/4 hours internal CPU-time

I better methods are presumably available in case of

I partly separable problems I specific problems, for example with cheap gradients

specific methods

I small dimension (n ⌧ 10)

for example Nelder-Mead

I small running times (number of f -evaluations ⌧ 100n)

model-based methods