Evaluating the Population Size Adaptation Mechanism for CMA-ES on - - PowerPoint PPT Presentation

evaluating the population size adaptation mechanism for
SMART_READER_LITE
LIVE PREVIEW

Evaluating the Population Size Adaptation Mechanism for CMA-ES on - - PowerPoint PPT Presentation

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion Evaluating the Population Size Adaptation Mechanism for CMA-ES on the BBOB Noiseless Testbed Kouhei Nishida 1 Youhei Akimoto 1 1 Shinshu University, Japan 1 / 23


slide-1
SLIDE 1

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Evaluating the Population Size Adaptation Mechanism for CMA-ES on the BBOB Noiseless Testbed

Kouhei Nishida1 Youhei Akimoto1

1Shinshu University, Japan 1 / 23

slide-2
SLIDE 2

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Introduction: CMA-ES

◮ The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is a

stochastic search algorithm using the multivariate normal distribution.

  • 1. Generate candidate solutions (x(t)

i )i=1,2,...,λ from N (m(t), C(t)).

  • 2. Evaluate f (x(t)

i ) and sort them, f (x1:λ) < · · · < f (xλ:λ).

  • 3. Update the distribution parameters θ(t) = (m(t), C(t)) using the

ranking of candidate solutions.

◮ The CMA-ES has the default value for all strategy parameters (such as the

population size λ, the learning rate ηc).

◮ A larger population size than the default value improves its performance

  • n following scenarios.
  • 1. Well-structured multimodal function
  • 2. Noisy function

◮ It can be easily very expensive to tune the population size in advance.

2 / 23

slide-3
SLIDE 3

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Introduction: Population Size Adaptation

◮ As a measure for the adaptation, we consider the randomness of the

parameter update.

◮ To quantify the randomness of the parameter update, we introduce the

evolution path in the parameter space.

◮ To keep the randomness of the parameter update in a certain level, the

population size is adapted online.

Advantage of adapting the population size online:

◮ It doesn’t require tuning of the population size in advance. ◮ On rugged function, it may accelerate the search by reducing the

population size after converging in a basin of a local minimum.

3 / 23

slide-4
SLIDE 4

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Rank-µ update CMA-ES

◮ The rank-µ update CMA-ES, which is a component of the CMA-ES,

repeats the following procedure.

  • 1. Generate candidate solutions (x(t)

i )i=1,2,...,λ from N (m(t), C(t)).

  • 2. Evaluate f (x(t)

i ) and sort them, f (x1:λ) < · · · < f (xλ:λ).

  • 3. Update the distribution parameters θ(t) = (m(t), C(t)) using the

ranking of candidate solutions. θ(t+1) = θ(t) + ∆θ(t) ∆m(t) = ηm

λ

  • i

wi(x(t)

i:λ − m(t)),

∆C(t) = ηc

λ

  • i

wi((x(t)

i:λ − m(t))(x(t) i:λ − m(t))T − C(t))

4 / 23

slide-5
SLIDE 5

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Population Size Adaptation: Measurement

To quantify the randomness of the parameter update, we introduce the evolution path in the space Θ of the distribution parameter θ = (m, C). p(t+1) = (1 − β)p(t) +

  • β(2 − β)∆θ(t)

The evolution path accumulates the successive steps in the parameter space Θ. (a) less tendency (b) strong tendency Figure: An image of the evolution path

5 / 23

slide-6
SLIDE 6

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Population Size Adaptation: Measurement

◮ We measure the length of the evolution path based on the KL-divergence.

p2

θ = pTI(θ)p ≈ KL(θθ + p)

The KL-divergence measures the defference between two probability distributions.

◮ We measure the randomness of the parameter update by the ratio between

p(t+1)2

θ and its expected value γ(t+1) ≈ E[p(t+1)2 θ] under a random

function. γ(t+1) = (1 − β)2γ(t) + β(2 − β)

λ

  • i

w2

i (dη2 m + d(d + 1)

2 η2

c)

◮ Two important cases

◮ a random function:

p 2

θ

γ

≈ 1

◮ too large λ:

p 2

θ

γ

→ ∞

6 / 23

slide-7
SLIDE 7

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Population Size Adaptation: Adaptation

◮ If

p(t+1) 2

θ(t)

γ(t+1)

< α, regarding the update as inaccurate, the population size is increased with λ(t+1) =

  • λ(t) exp
  • β
  • α −

p(t+1)2

θ(t)

γ(t+1)

  • ∨ λ(t) + 1.

◮ If

p(t+1) 2

θ(t)

γ(t+1)

> α, regarding the update as sufficiently accurate, the population size is decreased with λ(t+1) =

  • λ(t) exp
  • β
  • α −

p(t+1)2

θ(t)

γ(t+1)

  • ∨ λmin.

7 / 23

slide-8
SLIDE 8

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Algorithm Variant

We use the default setting for most of parameters. The modified parameters are the learning rate for the mean vector, cm, and the threshold α to decide whether the parameter update is considered accurate or not. PSAaLmC α = √ 2, cm = 0.1 PSAaLmD α = √ 2, cm = 1/D PSAaSmC α = 1.1, cm = 0.1 PSAaSmD α = 1.1, cm = 1/D

◮ The greater α is, the greater the population size tends to be kept ◮ From our preliminaly study, we set cc = √2/(D + 1)cm.

8 / 23

slide-9
SLIDE 9

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Restart Strategy

For each (re-)start of the algorithm, we initialize the mean vector m ∼ U[−4, 4]D and the covariance matrix C = 22I. The maximum #f-call is set to 105D. Termination conditions tolf: median(fiqr_hist) < 10 - 12abs(median(fmin_hist))

◮ the objective function value differences are too small to

sort them without being affected by numerical errors.

tolx: median(xiqr_hist) < 10 - 12min(abs(xmed_hist))

◮ the coordinate value differences are too small to update

parameters without being affected by numerical errors.

maxcond: cond(C) > 1014

◮ the matrix operations using C are not reliable due to

numerical errors.

maxeval: #f-call ≥ 5 × 104D (for noiseless) or 105D (for noisy)

9 / 23

slide-10
SLIDE 10

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

BIPOP-CMA-ES

BIPOP restart strategy: A restart strategy with two budgets of function evaluations.

◮ one is for incremental population size.

◮ to tackle well-structured multimodal functions or noisy functions

◮ the other is for relatively small population size and a relatively small

step-size.

◮ to tackle weakly-structured multimodal functions 10 / 23

slide-11
SLIDE 11

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noiseless: Unimodal Function

2 3 5 10 20 40 1 2 3 4 target Df: 1e-8

1 Sphere

PSAaLmC PSAaLmD PSAaSmC PSAaSmD BIPOP-CMA-ES 2 3 5 10 20 40 1 2 3 4 target Df: 1e-8

5 Linear slope

2 3 5 10 20 40 1 2 3 4 target Df: 1e-8

7 Step-ellipsoid

2 3 5 10 20 40 1 2 3 4 5 6 target Df: 1e-8

8 Rosenbrock original

The aRT is higher for most of the unimodal functions than the best 2009 portfolio due to lack of the step-size adaptation. On Step-ellipsoid function, where the step-size adaptaiton is less important, our algorithm performs well.

11 / 23

slide-12
SLIDE 12

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noiseless: Well-structured Multimodal Function

2 3 5 10 20 40 1 2 3 4 5 6 target Df: 1e-8

15 Rastrigin

2 3 5 10 20 40 1 2 3 4 5 target Df: 1e-8

17 Schaffer F7, condition 10

2 3 5 10 20 40 1 2 3 4 5 target Df: 1e-8

18 Schaffer F7, condition 1000

2 3 5 10 20 40 1 2 3 4 5 6 7 target Df: 1e-8

19 Griewank-Rosenbrock F8F2

The performance of the tested algorithms is similar to the performance of the BIPOP-CMA-ES without the step-size adaptation. Especially on Griewank-Rosenbrock, the tested algorithm is partly better than the best 2009 portfolio.

12 / 23

slide-13
SLIDE 13

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noiseless: Weakly-structured Multimodal Function

1 2 3 4 5 6 7 8

log10 of (# f-evals / dimension)

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of function+target pairs

PSAaSmD PSAaSmC PSAaLmC PSAaLmD BIPOP-CMA best 2009

f20-f24,5-D 1 2 3 4 5 6 7 8

log10 of (# f-evals / dimension)

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of function+target pairs

PSAaLmD PSAaLmC PSAaSmC PSAaSmD BIPOP-CMA best 2009

f20-f24,20-D

The BIPOP-CMA-ES performs better than the tested algorithm because the tested algorithms doesn’t have the mechanism to tackle weakly-structure.

13 / 23

slide-14
SLIDE 14

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noiseless: Comparing the variants

1 2 3 4 5 6 7 8

log10 of (# f-evals / dimension)

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of function+target pairs

PSAaLmC PSAaLmD PSAaSmD BIPOP-CMA best 2009 PSAaSmC

f15-f19,10-D

Variants with α = 1.1 are better than ones with α = √ 2.

14 / 23

slide-15
SLIDE 15

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noiseless Summary

◮ On Well-structured multimodal function, the tested algorithm performs

well without the step-size adaptaiton.

◮ For lack of the step-size adaptation, the aRT is higher for most of the

unimodal functions and the than the best 2009 portfolio.

◮ When the step-size is less important, the tested algorithm performs well. ◮ Variants with α = 1.1 tends to be better than ones with α =

√ 2

15 / 23

slide-16
SLIDE 16

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noisy: Unimodal Function

2 3 5 10 20 40 1 2 3 4 target Df: 1e-8

101 Sphere moderate Gauss

PSAaLmC PSAaLmD PSAaSmC PSAaSmD BIPOP-CMA-ES 2 3 5 10 20 40 1 2 3 4 target Df: 1e-8

102 Sphere moderate unif

2 3 5 10 20 40 1 2 3 4 target Df: 1e-8

103 Sphere moderate Cauchy

2 3 5 10 20 40 1 2 3 4 5 6 7 target Df: 1e-8

104 Rosenbrock moderate Gauss

2 3 5 10 20 40 1 2 3 4 5 6 target Df: 1e-8

105 Rosenbrock moderate unif

2 3 5 10 20 40 1 2 3 4 5 6 7 target Df: 1e-8

106 Rosenbrock moderate Cauchy

On sphere function, the algorithm is slower than the BIPOP-CMA-ES for lack

  • f the step-size adaptation.

The failure on the Rosenbrock functions is mainly due to the same reason.

16 / 23

slide-17
SLIDE 17

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noisy: Unimodal Function

2 3 5 10 20 40 1 2 3 4 5 6 target Df: 1e-8

113 Step-ellipsoid Gauss

2 3 5 10 20 40 1 2 3 4 5 6 target Df: 1e-8

114 Step-ellipsoid unif

2 3 5 10 20 40 1 2 3 4 5 target Df: 1e-8

115 Step-ellipsoid Cauchy

On step-ellipsoid function, where the step-size adaptation is less important, the algorithm performs well.

17 / 23

slide-18
SLIDE 18

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noisy: Well-structured Multimodal Function

2 3 5 10 20 40 1 2 3 4 5 6 7 target Df: 1e-8

122 Schaffer F7 Gauss

2 3 5 10 20 40 1 2 3 4 5 6 7 target Df: 1e-8

123 Schaffer F7 unif

2 3 5 10 20 40 1 2 3 4 5 6 target Df: 1e-8

124 Schaffer F7 Cauchy

On schaffer function, the performance of the tested algorithm is similarly to the best 2009 portfolio, and partly better than it.

18 / 23

slide-19
SLIDE 19

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noisy: Compairing the variants

2 3 5 10 20 40 1 2 3 4 5 6 target Df: 1e-8

113 Step-ellipsoid Gauss

2 3 5 10 20 40 1 2 3 4 5 6 target Df: 1e-8

116 Ellipsoid Gauss

2 3 5 10 20 40 1 2 3 4 5 6 7 target Df: 1e-8

119 Sum of diff powers Gauss

The algorithms using cm = 1/D sometimes get worse in low dimension because the learning rate is too large.

19 / 23

slide-20
SLIDE 20

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noisy: Compairing the variants

1 2 3 4 5 6 7 8

log10 of (# f-evals / dimension)

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of function+target pairs

PSAaLmD PSAaLmC PSAaSmD PSAaSmC BIPOP-CMA best 2009

f101-f130,10-D

Variants with α = 1.1 are better than ones with α = √ 2.

20 / 23

slide-21
SLIDE 21

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Noisy Summary

◮ On Well-structured multimodal function, the tested algorithm

performance is similarly to the best 2009.

◮ For lack of the step-size adaptation, the convargence speed scales worse

  • n Sphere function and the aRT is higher for most of the unimodal

functions than the best 2009 portfolio.

◮ variants with α = 1.1 tends to be better than ones with α =

√ 2

◮ cm = 1/D is too large at low dimension.

21 / 23

slide-22
SLIDE 22

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Conclusion

Summary

◮ On well-structured multimodal function, the tested algorithm performance

is similarly to the best 2009.

◮ For lack of the step-size adaptation, the aRT is higher for most of the

unimodal function and the weakly-structured function than the best 2009 portfolio.

◮ On noisy function, cm = 1/D is too large at low dimension.

Future Work

◮ We incorporate the rank-one adaptation and the step-size adaptation.

22 / 23

slide-23
SLIDE 23

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Using the small learning rate works as averaging the mean vector in successive iteration.

(a) with a larger learning rate (b) with a smaller learning rate

23 / 23