stochastic methods for continuous optimization
play

Stochastic Methods for Continuous Optimization Anne Auger and Dimo - PowerPoint PPT Presentation

Stochastic Methods for Continuous Optimization Anne Auger and Dimo Brockhoff Paris-Saclay Master - Master 2 Informatique - Parcours Apprentissage, Information et Contenu (AIC) anne.auger@inria.fr 2015 Overview Problem Statement Black Box


  1. Stochastic Methods for Continuous Optimization Anne Auger and Dimo Brockhoff Paris-Saclay Master - Master 2 Informatique - Parcours Apprentissage, Information et Contenu (AIC) anne.auger@inria.fr 2015

  2. Overview Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Theory Algorithms Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank- µ Update Summary and Final Remarks

  3. Problem Statement Continuous Domain Search/Optimization ◮ Task: minimize an objective function ( fitness function, loss function) in continuous domain f : X ⊆ R n → R , x �→ f ( x ) ◮ Black Box scenario (direct search scenario) x f(x) ◮ gradients are not available or not useful ◮ problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding ◮ Search costs: number of function evaluations

  4. What Makes a Function Difficult to Solve? Why stochastic search? 1.0 ◮ non-linear, non-quadratic, non-convex 0.8 0.6 on linear and quadratic functions 0.4 much better search policies are 0.2 0.0 1.0 0.5 0.0 0.5 1.0 available 100 90 80 ◮ ruggedness 70 60 50 non-smooth, discontinuous, 40 30 20 multimodal, and/or noisy 10 −4 0 −3 −2 −1 0 1 2 3 4 3 function 2 ◮ dimensionality (size of search space) 1 0 (considerably) larger than three −1 −2 ◮ non-separability −3 −3 −2 −1 0 1 2 3 dependencies between the objective variables ◮ ill-conditioning gradient direction Newton direction

  5. Separable Problems Definition (Separable Problem) A function f is separable if � � arg min ( x 1 ,..., x n ) f ( x 1 , . . . , x n ) = arg min x 1 f ( x 1 , . . . ) , . . . , arg min x n f ( . . . , x n ) ⇒ it follows that f can be optimized in a sequence of n independent 1-D optimization processes Example: Additively 3 decomposable functions 2 1 n � 0 f ( x 1 , . . . , x n ) = f i ( x i ) −1 i = 1 −2 Rastrigin function f ( x ) = 10 n + � n i = 1 ( x 2 −3 i − 10 cos ( 2 π x i )) −3 −2 −1 0 1 2 3

  6. Non-Separable Problems Building a non-separable problem from a separable one ( 1 , 2 ) Rotating the coordinate system ◮ f : x �→ f ( x ) separable ◮ f : x �→ f ( R x ) non-separable R rotation matrix 3 3 2 2 R 1 1 − → 0 0 −1 −1 −2 −2 −3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2Salomon (1996). "Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms." BioSystems, 39(3):263-278

  7. Ill-Conditioned Problems � � ◮ If f is convex quadratic, f : x �→ 1 2 x T Hx = 1 i h i , i x 2 i + 1 i � = j h i , j x i x j , 2 2 with H positive, definite, symmetric matrix H is the Hessian matrix of f ◮ ill-conditioned means a high condition number of Hessian Matrix H cond ( H ) = λ max ( H ) λ min ( H ) Example / exercice 1 0.8 0.6 0.4 f ( x ) = 1 0.2 2 ( x 2 1 + 9 x 2 2 ) 0 −0.2 −0.4 condition number equals 9 −0.6 −0.8 −1 −1 −0.5 0 0.5 1 Shape of the iso-fitness lines

  8. Ill-conditionned Problems consider the curvature of iso-fitness lines ill-conditioned means “squeezed” lines of equal function value (high curvatures) gradient direction − f ′ ( x ) T Newton direction − H − 1 f ′ ( x ) T Condition number equals nine here. Condition numbers up to 10 10 are not unusual in real world problems.

  9. Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))

  10. Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))

  11. Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))

  12. Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))

  13. Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))

  14. Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))

  15. Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ )) Everything depends on the definition of P and F θ

  16. Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ )) Everything depends on the definition of P and F θ In Evolutionary Algorithms the distribution P is often implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for Estimation of Distribution Algorithms

  17. Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n

  18. Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters

  19. Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters The question remains how to update m , C , and σ .

  20. Normal Distribution 1-D case Standard Normal Distribution 0.4 probability density of the 1-D standard normal distribution N ( 0 , 1 ) 0.3 probability density (expected (mean) value, variance) = (0,1) 0.2 � � − x 2 1 √ p ( x ) = exp 0.1 2 2 π 0 −4 −2 0 2 4 General case � m , σ 2 � ◮ Normal distribution N (expected value, variance) = ( m , σ 2 ) � � − ( x − m ) 2 1 density: p m ,σ ( x ) = 2 πσ exp √ 2 σ 2 ◮ A normal distribution is entirely determined by its mean value and variance ◮ The family of normal distributions is closed under linear transformations: if X is normally distributed then a linear transformation aX + b is also normally distributed � m , σ 2 � ◮ Exercice: Show that m + σ N ( 0 , 1 ) = N

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend