Information Geometric Optimization How information theory sheds new - - PowerPoint PPT Presentation
Information Geometric Optimization How information theory sheds new - - PowerPoint PPT Presentation
Information Geometric Optimization How information theory sheds new light on black-box optimization Anne Auger, Inria and CMAP Main reference: Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A
Main reference: Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles, JMLR (accepted)
Black-Box Optimization
3
- ptimize
discrete optimization continuous optimization
f : Ω 7! R
Ω = {0, 1}n
Ω ⊂ Rn
x ∈ Ω
f(x) ∈ R
Black-Box Optimization
4
- ptimize
discrete optimization continuous optimization
f : Ω 7! R
Ω = {0, 1}n
Ω ⊂ Rn
gradients not available or not useful
5
Adaptive Stochastic Black-Box Algorithm
Xi
t+1 = Sol(θt, Ui t+1), i = 1, . . . , λ
Evaluate solutions Update state of the algorithm Sample candidate solutions
θt: state of the algorithm
{Ut+1, t ∈ N} i.i.d.
f
- Xi
t+1
- Xi
t+1
θt+1 = F ⇣ θt,
- X1
t+1, f(X1 t+1)
- , . . . ,
⇣ Xλ
t+1, f(Xλ t+1)
⌘⌘
Xi
t+1 = Sol(θt, Ui t+1), i = 1, . . . , λ
Evaluate and rank solutions Update state of the algorithm
θt+1 = F ⇣ θt, US(1)
t+1 , . . . , US(λ) t+1
⌘ f ⇣ XS(1)
t+1
⌘ ≤ . . . ≤ f ⇣ XS(λ)
t+1
⌘ S permutation with index of ordered solutions
Sample candidate solutions
6
Comparison-based Stochastic Algorithms
Invariance to strictly increasing transformations
➊ Black-Box Optimization Typical difficulties ➋ Information Geometric Optimization ➌ Invariance ➍ Recovering well-known algorithms CMA-ES PBIL, cGA
Overview
7
8
Family of probability distributions continuous multicomponent parameter
(Pθ)θ∈Θ on Ω θ ∈ Θ
Information Geometric Optimization
Setting
9
Family of probability distributions continuous multicomponent parameter
(Pθ)θ∈Θ on Ω θ ∈ Θ
Information Geometric Optimization
Setting
Example: Ω = Rn
Pθ multivariate normal distribution θ = (m, C)
Θ: statistical manifold
10
Transform original optimization problem on Onto optimization problem on : Minimize
Changing Viewpoint I
Ω
Θ F(θ) = Z f(x)Pθ(dx)
minx∈Ω f(x)
11
Minimizing F
⇔
Find dirac-delta distribution concentrated on argminxf(x)
[Wiestra et al, 2014]
Transform original optimization problem on Onto optimization problem on : Minimize
Changing Viewpoint I
Ω
Θ F(θ) = Z f(x)Pθ(dx)
minx∈Ω f(x)
12
Changing Viewpoint I
Transform original optimization problem on Onto optimization problem on : Minimize
Ω
Θ
Minimizing F
⇔
Find dirac-delta distribution concentrated on
argminxf(x)
But not invariant to strictly increasing transformations of f
F(θ) = Z f(x)Pθ(dx)
minx∈Ω f(x)
13
Changing Viewpoint II
Invariant under strictly increasing transformation of Transform original optimization problem on Onto optimization problem on : Maximize
Ω
Θ with w : [0, 1] → R decreasing weight function
Jθt(θ) = Z W f
θt(x)
| {z } Pθ(dx)
w(Pθt[y : f(y) ≤ f(x)])
minx∈Ω f(x) f Rationale: f “small” ↔ W f
θt(x) “large”
[Ollivier et al.]
14
Maximizing
Information Geometric Optimization
θt+δt = θt + δt e r
θ
Z W f
θt(x) Pθ(dx)
Jθt(θ)
Perform natural gradient step on
Θ
15
Maximizing
Information Geometric Optimization
θt+δt = θt + δt e r
θ
Z W f
θt(x) Pθ(dx)
Jθt(θ)
Perform natural gradient step on
Θ
16
Natural Gradient
Fisher Information Metric Natural gradient : gradient wrt Fisher metric defined via Fisher matrix
e rθ
Iij(θ) = Z
x
∂ ln Pθ(x) ∂θi ∂ ln Pθ(x) ∂θj Pθ(dx) = − Z
x
∂2 ln Pθ(x) ∂θi∂θj Pθ(dx)
e r = I−1 ∂
∂θ
17
Fisher Information Metric
Equivalently defined via second order expansion of KL KL(Pθ+δθ| |Pθ) = 1 2 X Iij(θ) δθiδθj + O(δθ3) KL(Pθ0| |Pθ) = Z ln Pθ0(dx) Pθ(dx) Pθ(dx)
Kullback–Leibler divergence: measure of “distance” between distributions Relation between KL divergence and Fisher matrix
18
Natural Gradient
Fisher Information Metric Natural gradient : gradient wrt Fisher metric defined via Fisher matrix intrinsic: independent of chosen parametrization
e rθ
θ of Pθ
Fisher metric essentially the only way to obtain this property [Amari, Nagaoka, 2001]
Iij(θ) = Z
x
∂ ln Pθ(x) ∂θi ∂ ln Pθ(x) ∂θj Pθ(dx) = − Z
x
∂2 ln Pθ(x) ∂θi∂θj Pθ(dx)
19
Maximizing
Information Geometric Optimization Perform natural gradient step on
Θ Jθt(θ)
does not depend on rf
θt+δt = θt + δt e r
θ
Z W f
θt(x) Pθ(dx)
= θt + δt Z W f
θt(x) e
r
θ ln Pθ(x)|θ=θtPθt(dx)
= θt + δt Z w(Pθt[y : f(y) f(x)]) e r
θ ln Pθ(x)|θ=θtPθt(dx)
20
Maximizing
Information Geometric Optimization Perform natural gradient step on
Θ Jθt(θ)
does not depend on rf
θt+δt = θt + δt e r
θ
Z W f
θt(x) Pθ(dx)
= θt + δt Z W f
θt(x) e
r
θ ln Pθ(x)|θ=θtPθt(dx)
= θt + δt Z w(Pθt[y : f(y) f(x)]) e r
θ ln Pθ(x)|θ=θtPθt(dx)
21
Maximizing
Information Geometric Optimization Perform natural gradient step on
Θ Jθt(θ)
does not depend on rf θt+δt = θt + δt e r
θ
Z W f
θt(x) Pθ(dx)
= θt + δt Z W f
θt(x)
e r
θPθ(x)
Pθt(x) Pθt(x)dx = θt + δt Z W f
θt(x) e
r
θ ln Pθ(x)|θ=θtPθt(dx)
= θt + δt Z w(Pθt[y : f(y) f(x)]) e r
θ ln Pθ(x)|θ=θtPθt(dx)
22
Maximizing
Information Geometric Optimization Perform natural gradient step on
Θ Jθt(θ)
does not depend on rf θt+δt = θt + δt e r
θ
Z W f
θt(x) Pθ(dx)
= θt + δt Z W f
θt(x)
e r
θPθ(x)
Pθt(x) Pθt(x)dx = θt + δt Z W f
θt(x) e
r
θ ln Pθ(x)|θ=θtPθt(dx)
= θt + δt Z w(Pθt[y : f(y) f(x)]) e r
θ ln Pθ(x)|θ=θtPθt(dx)
➊ ➋
δt → 0
IGO flow: IGO algorithms: discretization of integrals
23
IGO gradient flow
Information Geometric Optimization set of continuous time trajectories in the - space defined by the ODE:
Θ
dθt dt = Z W f
θt(x) e
r
θ ln Pθ(x)|θ=θtPθt(dx)
[Ollivier et al.]
24
Information Geometric Optimization Algorithm
Information Geometric Optimization (IGO) Monte Carlo Approximation of Integrals
Sample Xi ∼ Pθt, i = 1, . . . N
IGO Algorithm
w(Pθt[y : f(y) ≤ f(x)]) ≈ w ⇣
rk(Xi)+1/2 N
⌘
θt+δt = θt + δt 1 N
N
X
i=1
w ✓rk(Xi) + 1/2 N ◆ e rθ ln Pθ(Xi)|θ=θt = θt + δt
N
X
i=1
ˆ wi e rθ ln Pθ(Xi)|θ=θt
rk(Xi) = #{j|f(Xj) < f(Xi)}
25
IGO Algorithm
Monte Carlo Approximation of Integrals
Sample Xi ∼ Pθt, i = 1, . . . N
IGO Algorithm
w(Pθt[y : f(y) ≤ f(x)]) ≈ w ⇣
rk(Xi)+1/2 N
⌘
θt+δt = θt + δt 1 N
N
X
i=1
w ✓rk(Xi) + 1/2 N ◆ e rθ ln Pθ(Xi)|θ=θt = θt + δt
N
X
i=1
ˆ wi e rθ ln Pθ(Xi)|θ=θt consistent estimator of integral
[Ollivier et al.]
ˆ wi = 1 N w ✓rk(Xi) + 1/2 N ◆
26
Instantiation of IGO
Multivariate Normal Distributions IGO Algorithm
Pθ multivariate normal distribution, θ = (m, C)
mt+δt = mt + δt
N
X
i=1
ˆ wi(Xi − mt) Ct+δt = Ct + δt
N
X
i=1
ˆ wi
- (Xi − mt)(Xi − mt)T − Ct
Recovers the CMA-ES with rank-mu update algorithm
[Akimoto et al. 2010]
N = λ δt learning rate for covariance matrix additional learning rate for the mean
27
Instantiation of IGO
Bernoulli measures
Ω = {0, 1}d Pθ(x) = pθ1(x1) . . . pθd(xd) family of Bernoulli measures
Recovers PBIL (Population based incremental learning)
[Baluja, Caruana 1995]
cGA (compact Genetic Algorithm) [Harick et al. 1999]
28
Conclusions
Information Geometric Optimization framework: a unified picture of discrete and continuous optimization theoretical foundations for existing algorithms some parts of CMA-ES algorithm not explained by IGO framework New algorithms: large-scale variant of CMA-ES based on IGO, …
CMA-ES state-of-the-art in continuous bb optimization step-size adaptation, cumulation
[Ollivier et al.] Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles, JMLR (cond. accepted) [Akimoto et al. 2010] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi Bidirectional relation between CMA evolution strategies and natural evolution strategies, PPSN 2010 [Hansen et al. 2003] N. Hansen, S.D. Müller, and P. Koumoutsakos, Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES), ECJ 2003 [Amari, Nagaoka 1993] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, 1993 [Baluja, Caruana 1995] Shumeet Baluja and Rich Caruana. Removing the genetics from the standard genetic algorithm. ICML, 1995. [Harick et al. 1999] Georges R Harik, Fernando G Lobo, and David E Goldberg. The compact genetic algorithm. IEEE Trans EC, 1999. [Wiestra et al, 2014] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. JMLR, 2014