Information Geometric Optimization How information theory sheds new - - PowerPoint PPT Presentation

information geometric optimization
SMART_READER_LITE
LIVE PREVIEW

Information Geometric Optimization How information theory sheds new - - PowerPoint PPT Presentation

Information Geometric Optimization How information theory sheds new light on black-box optimization Anne Auger, Inria and CMAP Main reference: Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A


slide-1
SLIDE 1

Information Geometric Optimization

How information theory sheds new light on black-box optimization

Anne Auger, Inria and CMAP

slide-2
SLIDE 2

Main reference: Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles, JMLR (accepted)

slide-3
SLIDE 3

Black-Box Optimization

3

  • ptimize

discrete optimization continuous optimization

f : Ω 7! R

Ω = {0, 1}n

Ω ⊂ Rn

slide-4
SLIDE 4

x ∈ Ω

f(x) ∈ R

Black-Box Optimization

4

  • ptimize

discrete optimization continuous optimization

f : Ω 7! R

Ω = {0, 1}n

Ω ⊂ Rn

gradients not available or not useful

slide-5
SLIDE 5

5

Adaptive Stochastic Black-Box Algorithm

Xi

t+1 = Sol(θt, Ui t+1), i = 1, . . . , λ

Evaluate solutions Update state of the algorithm Sample candidate solutions

θt: state of the algorithm

{Ut+1, t ∈ N} i.i.d.

f

  • Xi

t+1

  • Xi

t+1

θt+1 = F ⇣ θt,

  • X1

t+1, f(X1 t+1)

  • , . . . ,

⇣ Xλ

t+1, f(Xλ t+1)

⌘⌘

slide-6
SLIDE 6

Xi

t+1 = Sol(θt, Ui t+1), i = 1, . . . , λ

Evaluate and rank solutions Update state of the algorithm

θt+1 = F ⇣ θt, US(1)

t+1 , . . . , US(λ) t+1

⌘ f ⇣ XS(1)

t+1

⌘ ≤ . . . ≤ f ⇣ XS(λ)

t+1

⌘ S permutation with index of ordered solutions

Sample candidate solutions

6

Comparison-based Stochastic Algorithms

Invariance to strictly increasing transformations

slide-7
SLIDE 7

➊ Black-Box Optimization Typical difficulties ➋ Information Geometric Optimization ➌ Invariance ➍ Recovering well-known algorithms CMA-ES PBIL, cGA

Overview

7

slide-8
SLIDE 8

8

Family of probability distributions continuous multicomponent parameter

(Pθ)θ∈Θ on Ω θ ∈ Θ

Information Geometric Optimization

Setting

slide-9
SLIDE 9

9

Family of probability distributions continuous multicomponent parameter

(Pθ)θ∈Θ on Ω θ ∈ Θ

Information Geometric Optimization

Setting

Example: Ω = Rn

Pθ multivariate normal distribution θ = (m, C)

Θ: statistical manifold

slide-10
SLIDE 10

10

Transform original optimization problem on Onto optimization problem on : Minimize

Changing Viewpoint I

Θ F(θ) = Z f(x)Pθ(dx)

minx∈Ω f(x)

slide-11
SLIDE 11

11

Minimizing F

Find dirac-delta distribution concentrated on argminxf(x)

[Wiestra et al, 2014]

Transform original optimization problem on Onto optimization problem on : Minimize

Changing Viewpoint I

Θ F(θ) = Z f(x)Pθ(dx)

minx∈Ω f(x)

slide-12
SLIDE 12

12

Changing Viewpoint I

Transform original optimization problem on Onto optimization problem on : Minimize

Θ

Minimizing F

Find dirac-delta distribution concentrated on

argminxf(x)

But not invariant to strictly increasing transformations of f

F(θ) = Z f(x)Pθ(dx)

minx∈Ω f(x)

slide-13
SLIDE 13

13

Changing Viewpoint II

Invariant under strictly increasing transformation of Transform original optimization problem on Onto optimization problem on : Maximize

Θ with w : [0, 1] → R decreasing weight function

Jθt(θ) = Z W f

θt(x)

| {z } Pθ(dx)

w(Pθt[y : f(y) ≤ f(x)])

minx∈Ω f(x) f Rationale: f “small” ↔ W f

θt(x) “large”

[Ollivier et al.]

slide-14
SLIDE 14

14

Maximizing

Information Geometric Optimization

θt+δt = θt + δt e r

θ

Z W f

θt(x) Pθ(dx)

Jθt(θ)

Perform natural gradient step on

Θ

slide-15
SLIDE 15

15

Maximizing

Information Geometric Optimization

θt+δt = θt + δt e r

θ

Z W f

θt(x) Pθ(dx)

Jθt(θ)

Perform natural gradient step on

Θ

slide-16
SLIDE 16

16

Natural Gradient

Fisher Information Metric Natural gradient : gradient wrt Fisher metric defined via Fisher matrix

e rθ

Iij(θ) = Z

x

∂ ln Pθ(x) ∂θi ∂ ln Pθ(x) ∂θj Pθ(dx) = − Z

x

∂2 ln Pθ(x) ∂θi∂θj Pθ(dx)

e r = I−1 ∂

∂θ

slide-17
SLIDE 17

17

Fisher Information Metric

Equivalently defined via second order expansion of KL KL(Pθ+δθ| |Pθ) = 1 2 X Iij(θ) δθiδθj + O(δθ3) KL(Pθ0| |Pθ) = Z ln Pθ0(dx) Pθ(dx) Pθ(dx)

Kullback–Leibler divergence: measure of “distance” between distributions Relation between KL divergence and Fisher matrix

slide-18
SLIDE 18

18

Natural Gradient

Fisher Information Metric Natural gradient : gradient wrt Fisher metric defined via Fisher matrix intrinsic: independent of chosen parametrization

e rθ

θ of Pθ

Fisher metric essentially the only way to obtain this property [Amari, Nagaoka, 2001]

Iij(θ) = Z

x

∂ ln Pθ(x) ∂θi ∂ ln Pθ(x) ∂θj Pθ(dx) = − Z

x

∂2 ln Pθ(x) ∂θi∂θj Pθ(dx)

slide-19
SLIDE 19

19

Maximizing

Information Geometric Optimization Perform natural gradient step on

Θ Jθt(θ)

does not depend on rf

θt+δt = θt + δt e r

θ

Z W f

θt(x) Pθ(dx)

= θt + δt Z W f

θt(x) e

r

θ ln Pθ(x)|θ=θtPθt(dx)

= θt + δt Z w(Pθt[y : f(y)  f(x)]) e r

θ ln Pθ(x)|θ=θtPθt(dx)

slide-20
SLIDE 20

20

Maximizing

Information Geometric Optimization Perform natural gradient step on

Θ Jθt(θ)

does not depend on rf

θt+δt = θt + δt e r

θ

Z W f

θt(x) Pθ(dx)

= θt + δt Z W f

θt(x) e

r

θ ln Pθ(x)|θ=θtPθt(dx)

= θt + δt Z w(Pθt[y : f(y)  f(x)]) e r

θ ln Pθ(x)|θ=θtPθt(dx)

slide-21
SLIDE 21

21

Maximizing

Information Geometric Optimization Perform natural gradient step on

Θ Jθt(θ)

does not depend on rf θt+δt = θt + δt e r

θ

Z W f

θt(x) Pθ(dx)

= θt + δt Z W f

θt(x)

e r

θPθ(x)

Pθt(x) Pθt(x)dx = θt + δt Z W f

θt(x) e

r

θ ln Pθ(x)|θ=θtPθt(dx)

= θt + δt Z w(Pθt[y : f(y)  f(x)]) e r

θ ln Pθ(x)|θ=θtPθt(dx)

slide-22
SLIDE 22

22

Maximizing

Information Geometric Optimization Perform natural gradient step on

Θ Jθt(θ)

does not depend on rf θt+δt = θt + δt e r

θ

Z W f

θt(x) Pθ(dx)

= θt + δt Z W f

θt(x)

e r

θPθ(x)

Pθt(x) Pθt(x)dx = θt + δt Z W f

θt(x) e

r

θ ln Pθ(x)|θ=θtPθt(dx)

= θt + δt Z w(Pθt[y : f(y)  f(x)]) e r

θ ln Pθ(x)|θ=θtPθt(dx)

➊ ➋

δt → 0

IGO flow: IGO algorithms: discretization of integrals

slide-23
SLIDE 23

23

IGO gradient flow

Information Geometric Optimization set of continuous time trajectories in the - space defined by the ODE:

Θ

dθt dt = Z W f

θt(x) e

r

θ ln Pθ(x)|θ=θtPθt(dx)

[Ollivier et al.]

slide-24
SLIDE 24

24

Information Geometric Optimization Algorithm

Information Geometric Optimization (IGO) Monte Carlo Approximation of Integrals

Sample Xi ∼ Pθt, i = 1, . . . N

IGO Algorithm

w(Pθt[y : f(y) ≤ f(x)]) ≈ w ⇣

rk(Xi)+1/2 N

θt+δt = θt + δt 1 N

N

X

i=1

w ✓rk(Xi) + 1/2 N ◆ e rθ ln Pθ(Xi)|θ=θt = θt + δt

N

X

i=1

ˆ wi e rθ ln Pθ(Xi)|θ=θt

rk(Xi) = #{j|f(Xj) < f(Xi)}

slide-25
SLIDE 25

25

IGO Algorithm

Monte Carlo Approximation of Integrals

Sample Xi ∼ Pθt, i = 1, . . . N

IGO Algorithm

w(Pθt[y : f(y) ≤ f(x)]) ≈ w ⇣

rk(Xi)+1/2 N

θt+δt = θt + δt 1 N

N

X

i=1

w ✓rk(Xi) + 1/2 N ◆ e rθ ln Pθ(Xi)|θ=θt = θt + δt

N

X

i=1

ˆ wi e rθ ln Pθ(Xi)|θ=θt consistent estimator of integral

[Ollivier et al.]

ˆ wi = 1 N w ✓rk(Xi) + 1/2 N ◆

slide-26
SLIDE 26

26

Instantiation of IGO

Multivariate Normal Distributions IGO Algorithm

Pθ multivariate normal distribution, θ = (m, C)

mt+δt = mt + δt

N

X

i=1

ˆ wi(Xi − mt) Ct+δt = Ct + δt

N

X

i=1

ˆ wi

  • (Xi − mt)(Xi − mt)T − Ct

Recovers the CMA-ES with rank-mu update algorithm

[Akimoto et al. 2010]

N = λ δt learning rate for covariance matrix additional learning rate for the mean

slide-27
SLIDE 27

27

Instantiation of IGO

Bernoulli measures

Ω = {0, 1}d Pθ(x) = pθ1(x1) . . . pθd(xd) family of Bernoulli measures

Recovers PBIL (Population based incremental learning)

[Baluja, Caruana 1995]

cGA (compact Genetic Algorithm) [Harick et al. 1999]

slide-28
SLIDE 28

28

Conclusions

Information Geometric Optimization framework: a unified picture of discrete and continuous optimization theoretical foundations for existing algorithms some parts of CMA-ES algorithm not explained by IGO framework New algorithms: large-scale variant of CMA-ES based on IGO, …

CMA-ES state-of-the-art in continuous bb optimization step-size adaptation, cumulation

slide-29
SLIDE 29

[Ollivier et al.] Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles, JMLR (cond. accepted) [Akimoto et al. 2010] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi Bidirectional relation between CMA evolution strategies and natural evolution strategies, PPSN 2010 [Hansen et al. 2003] N. Hansen, S.D. Müller, and P. Koumoutsakos, Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES), ECJ 2003 [Amari, Nagaoka 1993] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, 1993 [Baluja, Caruana 1995] Shumeet Baluja and Rich Caruana. Removing the genetics from the standard genetic algorithm. ICML, 1995. [Harick et al. 1999] Georges R Harik, Fernando G Lobo, and David E Goldberg. The compact genetic algorithm. IEEE Trans EC, 1999. [Wiestra et al, 2014] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. JMLR, 2014

References