information geometric optimization
play

Information Geometric Optimization How information theory sheds new - PowerPoint PPT Presentation

Information Geometric Optimization How information theory sheds new light on black-box optimization Anne Auger, Inria and CMAP Main reference: Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A


  1. Information Geometric Optimization How information theory sheds new light on black-box optimization Anne Auger, Inria and CMAP

  2. Main reference: Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles , JMLR (accepted)

  3. Black-Box Optimization optimize f : Ω 7! R Ω = { 0 , 1 } n discrete optimization continuous optimization Ω ⊂ R n 3

  4. Black-Box Optimization optimize f : Ω 7! R Ω = { 0 , 1 } n discrete optimization continuous optimization Ω ⊂ R n f ( x ) ∈ R x ∈ Ω gradients not available or not useful 4

  5. Adaptive Stochastic Black-Box Algorithm θ t : state of the algorithm Sample candidate solutions X i t +1 = S ol ( θ t , U i t +1 ) , i = 1 , . . . , λ { U t +1 , t ∈ N } i.i.d. Evaluate solutions � X i � X i f t +1 t +1 Update state of the algorithm ⇣ ⇣ ⌘⌘ X λ t +1 , f ( X λ X 1 t +1 , f ( X 1 � � θ t +1 = F t +1 ) t +1 ) θ t , , . . . , 5

  6. Comparison-based Stochastic Algorithms Invariance to strictly increasing transformations Sample candidate solutions X i t +1 = S ol ( θ t , U i t +1 ) , i = 1 , . . . , λ Evaluate and rank solutions ⇣ ⌘ ⇣ ⌘ X S (1) X S ( λ ) f ≤ . . . ≤ f t +1 t +1 S permutation with index of ordered solutions Update state of the algorithm ⇣ ⌘ θ t , U S (1) t +1 , . . . , U S ( λ ) θ t +1 = F t +1 6

  7. Overview ➊ Black-Box Optimization Typical difficulties ➋ Information Geometric Optimization ➌ Invariance ➍ Recovering well-known algorithms CMA-ES PBIL, cGA 7

  8. Information Geometric Optimization Setting ( P θ ) θ ∈ Θ on Ω Family of probability distributions continuous multicomponent parameter θ ∈ Θ 8

  9. Information Geometric Optimization Setting Family of probability distributions ( P θ ) θ ∈ Θ on Ω continuous multicomponent parameter θ ∈ Θ Θ : statistical manifold Example: Ω = R n P θ multivariate normal distribution θ = ( m, C ) 9

  10. Changing Viewpoint I Ω Transform original optimization problem on min x ∈ Ω f ( x ) Onto optimization problem on : Minimize Θ Z F ( θ ) = f ( x ) P θ ( dx ) 10

  11. Changing Viewpoint I Ω Transform original optimization problem on min x ∈ Ω f ( x ) Onto optimization problem on : Minimize Θ Z F ( θ ) = f ( x ) P θ ( dx ) Find dirac-delta distribution Minimizing F ⇔ concentrated on argmin x f ( x ) [Wiestra et al, 2014] 11

  12. Changing Viewpoint I Ω Transform original optimization problem on min x ∈ Ω f ( x ) Onto optimization problem on : Minimize Θ Z F ( θ ) = f ( x ) P θ ( dx ) Find dirac-delta distribution But not invariant to strictly increasing Minimizing F ⇔ argmin x f ( x ) concentrated on transformations of f 12

  13. Changing Viewpoint II Invariant under strictly increasing transformation of f Ω Transform original optimization problem on min x ∈ Ω f ( x ) Onto optimization problem on : Maximize Θ Z W f J θ t ( θ ) = θ t ( x ) } P θ ( dx ) | {z w ( P θ t [ y : f ( y ) ≤ f ( x )]) with w : [0 , 1] → R decreasing weight function Rationale: f “small” ↔ W f θ t ( x ) “large” [Ollivier et al.] 13

  14. Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ 14

  15. Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ 15

  16. Natural Gradient Fisher Information Metric e Natural gradient : r θ gradient wrt Fisher metric defined via Fisher matrix ∂ ln P θ ( x ) ∂ ln P θ ( x ) Z I ij ( θ ) = P θ ( dx ) ∂θ i ∂θ j x ∂ 2 ln P θ ( x ) Z = − P θ ( dx ) ∂θ i ∂θ j x e r = I − 1 ∂ ∂θ 16

  17. Fisher Information Metric Equivalently defined via second order expansion of KL Kullback–Leibler divergence: measure of “distance” between distributions ln P θ 0 ( dx ) Z KL( P θ 0 | | P θ ) = P θ ( dx ) P θ ( dx ) Relation between KL divergence and Fisher matrix | P θ ) = 1 X I ij ( θ ) δθ i δθ j + O ( δθ 3 ) KL( P θ + δθ | 2 17

  18. Natural Gradient Fisher Information Metric e Natural gradient : r θ gradient wrt Fisher metric defined via Fisher matrix ∂ ln P θ ( x ) ∂ ln P θ ( x ) Z I ij ( θ ) = P θ ( dx ) ∂θ i ∂θ j x ∂ 2 ln P θ ( x ) Z = − P θ ( dx ) ∂θ i ∂θ j x intrinsic: independent of chosen parametrization θ of P θ Fisher metric essentially the only way to obtain this property [Amari, Nagaoka, 2001] 18

  19. Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ Z = θ t + δ t W f θ t ( x ) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r Z = θ t + δ t w ( P θ t [ y : f ( y )  f ( x )]) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r does not depend on r f 19

  20. Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ Z = θ t + δ t W f θ t ( x ) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r Z = θ t + δ t w ( P θ t [ y : f ( y )  f ( x )]) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r does not depend on r f 20

  21. Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ Z e θ P θ ( x ) r = θ t + δ t W f θ t ( x ) P θ t ( x ) dx P θ t ( x ) Z = θ t + δ t W f θ t ( x ) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r Z = θ t + δ t w ( P θ t [ y : f ( y )  f ( x )]) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r does not depend on r f 21

  22. Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ Z e θ P θ ( x ) r = θ t + δ t W f θ t ( x ) P θ t ( x ) dx P θ t ( x ) Z = θ t + δ t W f θ t ( x ) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r Z = θ t + δ t w ( P θ t [ y : f ( y )  f ( x )]) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r does not depend on r f IGO flow: δ t → 0 ➊ IGO algorithms: discretization of integrals ➋ 22

  23. IGO gradient flow Information Geometric Optimization set of continuous time trajectories in the - space Θ defined by the ODE: Z d θ t W f θ t ( x ) e dt = θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r [Ollivier et al.] 23

  24. Information Geometric Optimization Algorithm Information Geometric Optimization (IGO) Monte Carlo Approximation of Integrals Sample X i ∼ P θ t , i = 1 , . . . N ⇣ ⌘ rk( X i )+1 / 2 w ( P θ t [ y : f ( y ) ≤ f ( x )]) ≈ w N rk( X i ) = # { j | f ( X j ) < f ( X i ) } IGO Algorithm ✓ rk( X i ) + 1 / 2 ◆ N X θ t + δ t = θ t + δ t 1 e w r θ ln P θ ( X i ) | θ = θ t N N i =1 N X = θ t + δ t w i e ˆ r θ ln P θ ( X i ) | θ = θ t i =1 24

  25. IGO Algorithm [Ollivier et al.] Monte Carlo Approximation of Integrals Sample X i ∼ P θ t , i = 1 , . . . N ⇣ ⌘ rk( X i )+1 / 2 w ( P θ t [ y : f ( y ) ≤ f ( x )]) ≈ w N IGO Algorithm ✓ rk( X i ) + 1 / 2 ◆ N X θ t + δ t = θ t + δ t 1 e w r θ ln P θ ( X i ) | θ = θ t N N i =1 N X = θ t + δ t w i e ˆ r θ ln P θ ( X i ) | θ = θ t i =1 consistent estimator of integral ✓ rk( X i ) + 1 / 2 ◆ w i = 1 ˆ N w N 25

  26. Instantiation of IGO Multivariate Normal Distributions [Akimoto et al. 2010] P θ multivariate normal distribution, θ = ( m, C ) IGO Algorithm N m t + δ t = m t + δ t X w i ( X i − m t ) ˆ i =1 N C t + δ t = C t + δ t X ( X i − m t )( X i − m t ) T − C t � � ˆ w i i =1 Recovers the CMA-ES with rank-mu update algorithm N = λ δ t learning rate for covariance matrix additional learning rate for the mean 26

  27. Instantiation of IGO Bernoulli measures Ω = { 0 , 1 } d P θ ( x ) = p θ 1 ( x 1 ) . . . p θ d ( x d ) family of Bernoulli measures Recovers PBIL (Population based incremental learning) [Baluja, Caruana 1995] cGA (compact Genetic Algorithm) [Harick et al. 1999] 27

  28. Conclusions Information Geometric Optimization framework: a unified picture of discrete and continuous optimization theoretical foundations for existing algorithms CMA-ES state-of-the-art in continuous bb optimization some parts of CMA-ES algorithm not explained by IGO framework step-size adaptation, cumulation New algorithms: large-scale variant of CMA-ES based on IGO, … 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend