Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole - PowerPoint PPT Presentation

Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ )) Everything depends on the definition of P and F θ

Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ )) Everything depends on the definition of P and F θ In Evolutionary Algorithms the distribution P is often implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for Estimation of Distribution Algorithms

A Simple Example: The Pure Random Search Also an Ineffective Example The Pure Random Search ◮ Sample uniformly at random a solution ◮ Return the best solution ever found Exercice See the exercice on the document "Exercices - class 1". Non-adaptive Algorithm For the pure random search P ( x | θ ) is independent of θ (i.e. no θ to be adapted): the algorithm is "blind" In this class: present algorithms that are "much better" than that

Evolution Strategies New search points are sampled normally distributed x i = m + σ y i for i = 1 , . . . , λ with y i i.i.d. ∼ N ( 0 , C ) where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n

Evolution Strategies New search points are sampled normally distributed x i = m + σ y i for i = 1 , . . . , λ with y i i.i.d. ∼ N ( 0 , C ) where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters

Evolution Strategies New search points are sampled normally distributed x i = m + σ y i for i = 1 , . . . , λ with y i i.i.d. ∼ N ( 0 , C ) where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters The question remains how to update m , C , and σ .

Normal Distribution 1-D case Standard Normal Distribution 0.4 probability density of the 1-D standard normal distribution N ( 0 , 1 ) 0.3 probability density (expected (mean) value, variance) = (0,1) 0.2 � � − x 2 1 p ( x ) = √ exp 0.1 2 2 π 0 −4 −2 0 2 4 General case � m , σ 2 � ◮ Normal distribution N (expected value, variance) = ( m , σ 2 ) � � − ( x − m ) 2 1 density: p m ,σ ( x ) = 2 πσ exp √ 2 σ 2 ◮ A normal distribution is entirely determined by its mean value and variance ◮ The family of normal distributions is closed under linear transformations: if X is normally distributed then a linear transformation aX + b is also normally distributed � m , σ 2 � ◮ Exercice: Show that m + σ N ( 0 , 1 ) = N

Normal Distribution General case A random variable following a 1-D normal distribution is determined by its mean value m and variance σ 2 . In the n -dimensional case it is determined by its mean vector and covariance matrix Covariance Matrix If the entries in a vector X = ( X 1 , . . . , X n ) T are random variables, each with finite variance, then the covariance matrix Σ is the matrix whose ( i , j ) entries are the covariance of ( X i , X j ) � � Σ ij = cov ( X i , X j ) = E ( X i − µ i )( X j − µ j ) where µ i = E ( X i ) . Considering the expectation of a matrix as the expectation of each entry, we have Σ = E [( X − µ )( X − µ ) T ] Σ is symmetric, positive definite

The Multi-Variate ( n -Dimensional) Normal Distribution Any multi-variate normal distribution N ( m , C ) is uniquely determined by its mean value m ∈ R n and its symmetric positive definite n × n covariance matrix C . � � 1 − 1 2 ( x − m ) T C − 1 ( x − m ) density: p N ( m , C ) ( x ) = ( 2 π ) n / 2 | C | 1 / 2 exp , The mean value m 2−D Normal Distribution ◮ determines the displacement (translation) 0.4 ◮ value with the largest density (modal value) 0.3 0.2 ◮ the distribution is symmetric about the 0.1 0 5 distribution mean 5 0 0 N ( m , C ) = m + N ( 0 , C ) −5 −5

The Multi-Variate ( n -Dimensional) Normal Distribution Any multi-variate normal distribution N ( m , C ) is uniquely determined by its mean value m ∈ R n and its symmetric positive definite n × n covariance matrix C . � � 1 − 1 2 ( x − m ) T C − 1 ( x − m ) density: p N ( m , C ) ( x ) = ( 2 π ) n / 2 | C | 1 / 2 exp , The mean value m 2−D Normal Distribution ◮ determines the displacement (translation) 0.4 ◮ value with the largest density (modal value) 0.3 0.2 ◮ the distribution is symmetric about the 0.1 0 5 distribution mean 5 0 0 N ( m , C ) = m + N ( 0 , C ) −5 −5 The covariance matrix C ◮ determines the shape ◮ geometrical interpretation: any covariance matrix can be uniquely identified with the iso-density ellipsoid { x ∈ R n | ( x − m ) T C − 1 ( x − m ) = 1 }

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid { x ∈ R n | ( x − m ) T C − 1 ( x − m ) = 1 } Lines of Equal Density m , σ 2 I N � � ∼ m + σ N ( 0 , I ) one degree of freedom σ components are independent standard normally distributed where I is the identity matrix (isotropic case) and D is a diagonal matrix � 0 , AA T � (reasonable for separable problems) and A × N ( 0 , I ) ∼ N holds for all A .

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid { x ∈ R n | ( x − m ) T C − 1 ( x − m ) = 1 } Lines of Equal Density m , σ 2 I m , D 2 � N � � ∼ m + σ N ( 0 , I ) N � ∼ m + D N ( 0 , I ) one degree of freedom σ n degrees of freedom components are components are independent standard independent, scaled normally distributed where I is the identity matrix (isotropic case) and D is a diagonal matrix � 0 , AA T � (reasonable for separable problems) and A × N ( 0 , I ) ∼ N holds for all A .

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid { x ∈ R n | ( x − m ) T C − 1 ( x − m ) = 1 } Lines of Equal Density 1 m , σ 2 I m , D 2 � N � � ∼ m + σ N ( 0 , I ) N � ∼ m + D N ( 0 , I ) 2 N ( 0 , I ) N ( m , C ) ∼ m + C ( n 2 + n ) / 2 degrees of freedom one degree of freedom σ n degrees of freedom components are components are components are independent standard independent, scaled correlated normally distributed where I is the identity matrix (isotropic case) and D is a diagonal matrix � 0 , AA T � (reasonable for separable problems) and A × N ( 0 , I ) ∼ N holds for all A .

Where are we? Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Theory Algorithms Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank- µ Update

Adaptation: What do we want to achieve? New search points are sampled normally distributed x i = m + σ y i for i = 1 , . . . , λ with y i i.i.d. ∼ N ( 0 , C ) where x i , m ∈ R n , σ ∈ R + , C ∈ R n × n ◮ the mean vector should represent the favorite solution ◮ the step-size controls the step-length and thus convergence rate should allow to reach fastest convergence rate possible ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid adaptation should allow to learn the “topography” of the problem particulary important for ill-conditionned problems C ∝ H − 1 on convex quadratic functions

Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Theory Algorithms Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank- µ Update

Evolution Strategies (ES) Simple Update for Mean Vector Let µ : # parents, λ : # offspring Plus (elitist) and comma (non-elitist) selection ( µ + λ ) -ES: selection in { parents } ∪ { offspring } ( µ, λ ) -ES: selection in { offspring } ES algorithms emerged in the community of bio-inspired methods where a parallel between optimization and evolution of species as described by Darwin served in the origin as inspiration for the methods. Nowadays this parallel is mainly visible through the terminology used: candidate solutions are parents or offspring, the objective function is a fitness function, ... ( 1 + 1 ) -ES Sample one offspring from parent m x = m + σ N ( 0 , C ) If x better than m select m ← x

The ( µ/µ, λ )-ES - Update of the mean vector Non-elitist selection and intermediate (weighted) recombination Given the i -th solution point x i = m + σ y i �� ∼N ( 0 , C ) Let x i : λ the i -th ranked solution point, such that f ( x 1 : λ ) ≤ · · · ≤ f ( x λ : λ ) . Notation: we denote y i : λ the vector such that x i : λ = m + σ y i : λ Exercice: realize that y i : λ is generally not distributed as N ( 0 , C ) The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied.

The ( µ/µ, λ )-ES - Update of the mean vector Non-elitist selection and intermediate (weighted) recombination Given the i -th solution point x i = m + σ y i �� ∼N ( 0 , C ) Let x i : λ the i -th ranked solution point, such that f ( x 1 : λ ) ≤ · · · ≤ f ( x λ : λ ) . Notation: we denote y i : λ the vector such that x i : λ = m + σ y i : λ Exercice: realize that y i : λ is generally not distributed as N ( 0 , C ) The new mean reads µ � m ← w i x i : λ i = 1 where � µ i = 1 w i 2 =: µ w ≈ λ 1 w 1 ≥ · · · ≥ w µ > 0 , i = 1 w i = 1 , � µ 4 The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied.

The ( µ/µ, λ )-ES - Update of the mean vector Non-elitist selection and intermediate (weighted) recombination Given the i -th solution point x i = m + σ y i �� ∼N ( 0 , C ) Let x i : λ the i -th ranked solution point, such that f ( x 1 : λ ) ≤ · · · ≤ f ( x λ : λ ) . Notation: we denote y i : λ the vector such that x i : λ = m + σ y i : λ Exercice: realize that y i : λ is generally not distributed as N ( 0 , C ) The new mean reads µ µ � � m ← w i x i : λ = m + σ w i y i : λ i = 1 i = 1 � �� =: y w where � µ i = 1 w i 2 =: µ w ≈ λ 1 w 1 ≥ · · · ≥ w µ > 0 , i = 1 w i = 1 , � µ 4 The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied.

Invariance Under Monotonically Increasing Functions Rank-based algorithms Update of all parameters uses only the ranks f ( x 1 : λ ) ≤ f ( x 2 : λ ) ≤ ... ≤ f ( x λ : λ ) g ( f ( x 1 : λ )) ≤ g ( f ( x 2 : λ )) ≤ ... ≤ g ( f ( x λ : λ )) ∀ g g is strictly monotonically increasing g preserves ranks

Why Step-Size Control? 0 10 random search step−size too small | (1+1)-ES function value constant step−size (red & green) −3 10 n � x 2 | f ( x ) = step−size too large i −6 i = 1 10 in [ − 2 . 2 , 0 . 8 ] n for n = 10 optimal step−size (scale invariant) −9 10 0 0.5 1 1.5 2 function evaluations 4 x 10

Why Step-Size Control? ( 5 / 5 w , 10 ) -ES, 11 runs 10 0 with optimal step-size with step-size control f ( x ) 10 -1 � n � x 2 f ( x ) = � m − x ∗ � = 10 -2 i i = 1 10 -3 for n = 10 and x 0 ∈ [ − 0 . 2 , 0 . 8 ] n 10 -4 10 -5 0 200 400 600 800 1000 1200 function evaluations with optimal step-size σ

Why Step-Size Control? ( 5 / 5 w , 10 ) -ES, 2 × 11 runs 10 0 with optimal step-size with step-size control f ( x ) 10 -1 � n � x 2 f ( x ) = � m − x ∗ � = 10 -2 i i = 1 10 -3 for n = 10 and x 0 ∈ [ − 0 . 2 , 0 . 8 ] n 10 -4 10 -5 0 200 400 600 800 1000 1200 function evaluations with optimal versus adaptive step-size σ with too small initial σ

Why Step-Size Control? ( 5 / 5 w , 10 ) -ES 10 0 with optimal step-size with step-size control f ( x ) respective step-size 10 -1 � n � x 2 f ( x ) = � m − x ∗ � = 10 -2 i i = 1 10 -3 for n = 10 and x 0 ∈ [ − 0 . 2 , 0 . 8 ] n 10 -4 10 -5 0 200 400 600 800 1000 1200 function evaluations comparing number of f -evals to reach � m � = 10 − 5 : 1100 − 100 ≈ 1.5 650

Why Step-Size Control? ( 5 / 5 w , 10 ) -ES 10 0 with optimal step-size with step-size control f ( x ) respective step-size 10 -1 � n � x 2 f ( x ) = � m − x ∗ � = 10 -2 i i = 1 10 -3 for n = 10 and x 0 ∈ [ − 0 . 2 , 0 . 8 ] n 10 -4 10 -5 0 200 400 600 800 1000 1200 1400 1600 function evaluations comparing optimal versus default damping parameter d σ : 1700 1100 ≈ 1 . 5

Why Step-Size Control? constant σ constant σ constant σ 0.2 0 10 random search random search random search normalized progress 0.15 function value − 3 10 0.1 − 6 0.05 10 adaptive adaptive adaptive 0 optimal step − size optimal step − size optimal step − size step − size σ step − size σ step − size σ (scale invariant) (scale invariant) (scale invariant) − 9 10 −3 −2 −1 0 10 10 10 10 0 500 1000 1500 normalized step size function evaluations σ ← σ ∗ opt � parent � − ϕ ∗ σ ∗ ϕ ∗ opt n evolution window refers to the step-size interval ( ) where reasonable performance is observed

Step-size control Theory ◮ On well conditioned problem (sphere function f ( x ) = � x � 2 ) step-size adaptation should allow to reach (close to) optimal convergence rates need to be able to solve optimally simple scenario (linear function, sphere function) that quite often (always?) need to be solved when addressing a real-world problem ◮ Is it possible to quantify optimal convergence rate for step-size adaptive ESs?

Lower bound for convergence Exemplified on (1+1)-ES Consider a (1+1)-ES with any step-size adaptation mechanism (1+1)-ES with adaptive step-size Iteration k : ˜ X k + 1 = X k + σ k N k + 1 with ( N k ) k i.i.d. ∼ N ( 0 , I ) �� parent step − size offspring � ˜ if f ( ˜ X k + 1 ) ≤ f ( X k ) X k + 1 X k + 1 = otherwise X k

Lower bound for convergence (II) Exemplify on (1+1)-ES Theorem For any objective function f : R n → R , for any y ∗ ∈ R n E [ln � X k + 1 − y ∗ � ] ≥ E [ln � X k − y ∗ � ] − τ �� lower bound τ = max σ ∈ R + E [ln − � where e 1 + σ N� ] �� ( 1 , 0 ,..., 0 ) � �� =: ϕ ( σ )

"Tight" lower bound Theorem Lower bound reached on the sphere function f ( x ) = g ( � x − y ∗ � ) , (with g : R → R , increasing mapping) for step-size proportional to the distance to the optimum where σ k = σ � x − y ∗ � with σ := σ opt such that ϕ ( σ opt ) = τ .

(Log)-Linear convergence of scale-invariant step-size ES Theorem The (1+1)-ES with step-size proportional to the distance to the optimum σ k = σ � x � converges (log)-linearly on the sphere function f ( x ) = g ( � x � ) , (with g : R → R , increasing mapping) in the sense 1 k ln � X k � � X 0 � − k →∞ − ϕ ( σ ) =: CR ( 1 + 1 ) ( σ ) − − → almost surely. 0 − 0.05 − 0.1 0 − 0.15 distance to optimum c(sigma)*dimension 10 dim=2 − 0.2 min for dim=2 dim=3 − 0.25 min for dim=3 dim=5 − 0.3 −10 min for dim=5 10 dim=10 − 0.35 min for dim=10 dim=20 − 0.4 min for dim=20 − 0.45 dim=160 −20 min for dim=160 10 − 0.5 0 1000 2000 3000 4000 5000 0 2 4 6 8 10 function evaluations sigma*dimension n = 20 and σ = 0 . 6 / n

Asymptotic results When n → ∞ Theorem Let σ > 0 , the convergence rate of the (1+1)-ES with scale-invariant step-size on spherical functions satisfies at the limit − σ 2 + σ 2 � σ � = − σ � � � − σ � n →∞ n × CR ( 1 + 1 ) lim √ exp 2 Φ n 8 2 2 π where Φ is the cumulative distribution of a normal distribution. optimal convergence rate decreases to zero like 1 n 0 − 0.05 − 0.1 − 0.15 c(sigma)*dimension − 0.2 − 0.25 − 0.3 − 0.35 − 0.4 − 0.45 − 0.5 0 2 4 6 8 10 sigma*dimension

Summary of theory results constant σ constant σ constant σ 0.2 0 10 random search random search random search normalized progress 0.15 function value − 3 10 0.1 − 6 0.05 10 adaptive adaptive adaptive 0 optimal step − size optimal step − size optimal step − size step − size σ step − size σ step − size σ (scale invariant) (scale invariant) (scale invariant) − 9 10 −3 −2 −1 0 10 10 10 10 0 500 1000 1500 normalized step size function evaluations σ ← σ ∗ opt � parent � − ϕ ∗ σ ∗ ϕ ∗ opt n evolution window refers to the step-size interval ( ) where reasonable performance is observed

Methods for Step-Size Control ◮ 1 / 5-th success rule ab , often applied with “+”-selection increase step-size if more than 20 % of the new solutions are successful, decrease otherwise ◮ σ -self-adaptation c , applied with “,”-selection mutation is applied to the step-size and the better one, according to the objective function value, is selected simplified “global” self-adaptation ◮ path length control d (Cumulative Step-size Adaptation, CSA) e , applied with “,”-selection a Rechenberg 1973, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , Frommann-Holzboog b Schumer and Steiglitz 1968. Adaptive step size random search. IEEE TAC c Schwefel 1981, Numerical Optimization of Computer Models , Wiley d Hansen & Ostermeier 2001, Completely Derandomized Self-Adaptation in Evolution Strategies, Evol. Comput. 9(2) e Ostermeier et al 1994, Step-size adaptation based on non-local use of selection information, PPSN IV

One-fifth success rule �� ↓ ↓ increase σ decrease σ

One-fifth success rule �� Probability of success ( p s ) Probability of success ( p s ) 1 / 5 1 / 2 “too small”

One-fifth success rule p s : # of successful offspring / # offspring (per iteration) � 1 � 3 × p s − p target Increase σ if p s > p target σ ← σ × exp 1 − p target Decrease σ if p s < p target ( 1 + 1 ) -ES p target = 1 / 5 IF offspring better parent p s = 1, σ ← σ × exp( 1 / 3 ) ELSE p s = 0, σ ← σ/ exp( 1 / 3 ) 1 / 4

Why 1 / 5? Asymptotic convergence rate and probability of success of scale-invariant step-size (1+1)-ES 0.5 0.4 0.3 c(sigma)*dimension 0.2 0.1 0 − 0.1 CR (1+1) min (CR (1+1) ) − 0.2 proba of success − 0.3 0 2 4 6 8 10 sigma*dimension sphere - asymptotic results, i.e. n = ∞ (see slides before) 1 / 5 trade-off of optimal probability of success on the sphere and corridor

Path Length Control (CSA) The Concept of Cumulative Step-Size Adaptation x i = m + σ y i m ← m + σ y w Measure the length of the evolution path the pathway of the mean vector m in the iteration sequence ⇓ ⇓ decrease σ increase σ

Path Length Control (CSA) The Equations Sampling of solutions, notations as on slide “The ( µ/µ, λ )-ES - Update of the mean vector” with C equal to the identity. Initialize m ∈ R n , σ ∈ R + , evolution path p σ = 0 , set c σ ≈ 4 / n , d σ ≈ 1.

Path Length Control (CSA) The Equations Sampling of solutions, notations as on slide “The ( µ/µ, λ )-ES - Update of the mean vector” with C equal to the identity. Initialize m ∈ R n , σ ∈ R + , evolution path p σ = 0 , set c σ ≈ 4 / n , d σ ≈ 1. where y w = � µ ← m m + σ y w i = 1 w i y i : λ update mean � √ µ w 1 − ( 1 − c σ ) 2 p σ ← ( 1 − c σ ) p σ + y w �� accounts for w i accounts for 1 − c σ � c σ � �� p σ � ← σ × E �N ( 0 , I ) � − 1 σ exp update step-size d σ � �� > 1 ⇐ ⇒ � p σ � is greater than its expectation

Step-size adaptation What is achieved ( 1 + 1 ) -ES with one-fifth success rule (blue) constant σ constant σ constant σ 0 10 random search random search random search function value −3 10 n � x 2 f ( x ) = i i = 1 −6 step−size σ 10 in [ − 0 . 2 , 0 . 8 ] n for n = 10 adaptive adaptive adaptive optimal step−size optimal step−size optimal step−size step−size σ step−size σ step−size σ (scale invariant) (scale invariant) (scale invariant) −9 10 0 500 1000 1500 function evaluations Linear convergence

Step-size adaptation What is achieved ( 5 / 5 , 10 ) -CSA-ES, default parameters with optimal step-size with step-size control 10 0 respective step-size 10 -1 � m − x ∗ � n � x 2 f ( x ) = 10 -2 i i = 1 10 -3 in [ − 0 . 2 , 0 . 8 ] n for n = 30 10 -4 10 -5 0 500 1000 1500 2000 2500 3000 3500 4000 function evaluations

Evolution Strategies Recalling New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid The remaining question is how to update C .

Covariance Matrix Adaptation Rank-One Update y w = � µ m ← m + σ y w , i = 1 w i y i : λ , y i ∼ N i ( 0 , C ) initial distribution, C = I

Covariance Matrix Adaptation Rank-One Update y w = � µ m ← m + σ y w , i = 1 w i y i : λ , y i ∼ N i ( 0 , C ) y w , movement of the population mean m (disregarding σ )

Covariance Matrix Adaptation Rank-One Update y w = � µ m ← m + σ y w , i = 1 w i y i : λ , y i ∼ N i ( 0 , C ) mixture of distribution C and step y w , C ← 0 . 8 × C + 0 . 2 × y w y T w

Covariance Matrix Adaptation Rank-One Update y w = � µ m ← m + σ y w , i = 1 w i y i : λ , y i ∼ N i ( 0 , C ) new distribution (disregarding σ )

Covariance Matrix Adaptation Rank-One Update y w = � µ m ← m + σ y w , i = 1 w i y i : λ , y i ∼ N i ( 0 , C ) movement of the population mean m

Covariance Matrix Adaptation Rank-One Update y w = � µ m ← m + σ y w , i = 1 w i y i : λ , y i ∼ N i ( 0 , C ) mixture of distribution C and step y w , C ← 0 . 8 × C + 0 . 2 × y w y T w

Covariance Matrix Adaptation Rank-One Update y w = � µ m ← m + σ y w , i = 1 w i y i : λ , y i ∼ N i ( 0 , C ) new distribution, C ← 0 . 8 × C + 0 . 2 × y w y T w the ruling principle: the adaptation increases the likelihood of successful steps, y w , to appear again

Covariance Matrix Adaptation Rank-One Update Initialize m ∈ R n , and C = I , set σ = 1, learning rate c cov ≈ 2 / n 2 While not terminate = m + σ y i , y i ∼ N i ( 0 , C ) , x i µ � ← m m + σ y w where y w = w i y i : λ i = 1 1 ( 1 − c cov ) C + c cov µ w y w y T ← where µ w = i = 1 w i 2 ≥ 1 C � µ w � �� rank-one

Problem Statement Stochastic search algorithms - basics Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank- µ Update

Cumulation The Evolution Path Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of iteration steps. It can be expressed as a sum of consecutive steps of the mean m . An exponentially weighted sum of steps y w is used g � ( 1 − c c ) g − i y ( i ) p c ∝ w � �� i = 0 exponentially fading weights

Cumulation The Evolution Path Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of iteration steps. It can be expressed as a sum of consecutive steps of the mean m . An exponentially weighted sum of steps y w is used g � ( 1 − c c ) g − i y ( i ) p c ∝ w � �� i = 0 exponentially fading weights The recursive construction of the evolution path (cumulation): 1 − ( 1 − c c ) 2 √ µ w � p c ← ( 1 − c c ) p c + y w �� decay factor normalization factor m − m old input = σ 1 where µ w = � w i 2 , c c ≪ 1. History information is accumulated in the evolution path.

Cumulation Utilizing the Evolution Path w = − y w ( − y w ) T the sign of y w We used y w y T w for updating C . Because y w y T is lost.

Cumulation Utilizing the Evolution Path w = − y w ( − y w ) T the sign of y w We used y w y T w for updating C . Because y w y T is lost. The sign information is (re-)introduced by using the evolution path . 1 − ( 1 − c c ) 2 √ µ w � ← ( 1 − c c ) p c p c + y w � �� decay factor normalization factor T C ← ( 1 − c cov ) C + c cov p c p c � �� rank-one 1 where µ w = � w i 2 , c c ≪ 1.

Using an evolution path for the rank-one update of the covariance matrix reduces the number of function evaluations to adapt to a straight ridge from O ( n 2 ) to O ( n ) . ( 3 ) The overall model complexity is n 2 but important parts of the model can be learned in time of order n 3Hansen, Müller and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1) , pp. 1-18

Rank- µ Update = m + σ y i , ∼ N i ( 0 , C ) , x i y i � µ ← m + σ y w = i = 1 w i y i : λ y w m The rank- µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step.

Rank- µ Update = m + σ y i , ∼ N i ( 0 , C ) , x i y i � µ ← m + σ y w = i = 1 w i y i : λ y w m The rank- µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step. The matrix µ � w i y i : λ y T C µ = i : λ i = 1 computes a weighted mean of the outer products of the best µ steps and has rank min( µ, n ) with probability one.

Rank- µ Update = m + σ y i , ∼ N i ( 0 , C ) , x i y i � µ ← m + σ y w = i = 1 w i y i : λ y w m The rank- µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step. The matrix µ � w i y i : λ y T C µ = i : λ i = 1 computes a weighted mean of the outer products of the best µ steps and has rank min( µ, n ) with probability one. The rank- µ update then reads C ← ( 1 − c cov ) C + c cov C µ where c cov ≈ µ w / n 2 and c cov ≤ 1.

� y i : λ 1 � y i : λ y T m + 1 y i ∼ N ( 0 , C ) = ← = m + σ y i , C µ m new x i i : λ µ µ C ← ( 1 − 1 ) × C + 1 × C µ sampling of calculating C where new distribution µ = 50, w 1 = · · · = λ = 150 solutions w µ = 1 where C = I and µ , and σ = 1 c cov = 1

The rank- µ update ◮ increases the possible learning rate in large populations roughly from 2 / n 2 to µ w / n 2 ◮ can reduce the number of necessary iterations roughly from O ( n 2 ) to O ( n ) ( 4 ) given µ w ∝ λ ∝ n Therefore the rank- µ update is the primary mechanism whenever a large population size is used say λ ≥ 3 n + 10 4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1) , pp. 1-18

The rank- µ update ◮ increases the possible learning rate in large populations roughly from 2 / n 2 to µ w / n 2 ◮ can reduce the number of necessary iterations roughly from O ( n 2 ) to O ( n ) ( 4 ) given µ w ∝ λ ∝ n Therefore the rank- µ update is the primary mechanism whenever a large population size is used say λ ≥ 3 n + 10 The rank-one update ◮ uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O ( n 2 ) to O ( n ) . 4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1) , pp. 1-18

The rank- µ update ◮ increases the possible learning rate in large populations roughly from 2 / n 2 to µ w / n 2 ◮ can reduce the number of necessary iterations roughly from O ( n 2 ) to O ( n ) ( 4 ) given µ w ∝ λ ∝ n Therefore the rank- µ update is the primary mechanism whenever a large population size is used say λ ≥ 3 n + 10 The rank-one update ◮ uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O ( n 2 ) to O ( n ) . Rank-one update and rank- µ update can be combined 4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1) , pp. 1-18

Summary of Equations The Covariance Matrix Adaptation Evolution Strategy Input: m ∈ R n , σ ∈ R + , λ Initialize: C = I , and p c = 0 , p σ = 0 , Set: c c ≈ 4 / n , c σ ≈ 4 / n , c 1 ≈ 2 / n 2 , c µ ≈ µ w / n 2 , c 1 + c µ ≤ 1, � µ w 1 d σ ≈ 1 + n , and w i = 1 ...λ such that µ w = i = 1 w i 2 ≈ 0 . 3 λ � µ While not terminate x i = m + σ y i , y i ∼ N i ( 0 , C ) , for i = 1 , . . . , λ sampling m ← � µ where y w = � µ i = 1 w i x i : λ = m + σ y w update mean i = 1 w i y i : λ 1 − ( 1 − c c ) 2 √ µ w y w � I {� p σ � < 1 . 5 √ n } p c ← ( 1 − c c ) p c + 1 cumulation for C 1 − ( 1 − c σ ) 2 √ µ w C − 1 � 2 y w p σ ← ( 1 − c σ ) p σ + cumulation for σ C ← ( 1 − c 1 − c µ ) C + c 1 p c p c T + c µ � µ i = 1 w i y i : λ y T update C i : λ � � �� p σ � c σ σ ← σ × exp E �N ( 0 , I ) � − 1 update of σ d σ Not covered on this slide: termination, restarts, useful output, boundaries and encoding

Experimentum Crucis (0) What did we want to achieve? ◮ reduce any convex-quadratic function f ( x ) = x T Hx i = 1 10 6 i − 1 e.g. f ( x ) = � n n − 1 x 2 i to the sphere model f ( x ) = x T x without use of derivatives ◮ lines of equal density align with lines of equal fitness C ∝ H − 1 in a stochastic sense

Experimentum Crucis (1) f convex quadratic, separable blue:abs(f), cyan:f−min(f), green:sigma, red:axis ratio Object Variables (9−D) 10 10 15 x(1)=3.0931e−06 x(2)=2.2083e−06 5 10 10 x(6)=5.6127e−08 x(7)=2.7147e−08 0 10 5 x(8)=4.5138e−09 x(9)=2.741e−09 −5 1e−05 10 0 x(5)=−1.0864e−07 x(4)=−3.8371e−07 1e−08 f=2.66178883753772e−10 −10 10 −5 x(3)=−6.9109e−07 0 2000 4000 6000 0 2000 4000 6000 Principle Axes Lengths Standard Deviations in Coordinates divided by sigma 2 2 10 10 1 2 3 0 0 10 10 4 5 6 −2 −2 10 10 7 8 −4 −4 10 10 9 0 2000 4000 6000 0 2000 4000 6000 function evaluations function evaluations i = 1 10 α i − 1 f ( x ) = � n n − 1 x 2 i , α = 6

Experimentum Crucis (2) f convex quadratic, as before but non-separable (rotated) blue:abs(f), cyan:f−min(f), green:sigma, red:axis ratio Object Variables (9−D) 10 10 4 x(1)=2.0052e−06 x(5)=1.2552e−06 5 10 2 x(6)=1.2468e−06 x(9)=−7.3812e−08 0 10 0 x(4)=−2.9981e−07 x(7)=−8.3583e−07 −5 10 −2 x(3)=−2.0364e−06 8e−06 2e−06 x(2)=−2.1131e−06 f=7.91055728188042e−10 C ∝ H − 1 for all −10 10 −4 x(8)=−2.6301e−06 0 2000 4000 6000 0 2000 4000 6000 g , H Principle Axes Lengths Standard Deviations in Coordinates divided by sigma 2 10 3 1 8 0 10 0 10 2 7 5 −2 10 6 9 −4 10 4 0 2000 4000 6000 0 2000 4000 6000 function evaluations function evaluations � � x T H x f ( x ) = g , g : R → R stricly increasing

Comparison to BFGS, NEWUOA, PSO and DE f convex quadratic, separable with varying condition number α BFGS (Broyden et al 1970) Ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07 NEWUAO (Powell 2004) 7 10 DE (Storn & Price 1996) PSO (Kennedy & Eberhart 6 10 1995) 5 CMA-ES (Hansen & 10 Ostermeier 2001) SP1 4 10 f ( x ) = g ( x T H x ) with H diagonal 3 10 g identity (for BFGS and NEWUOA BFGS 2 NEWUOA) DE2 10 PSO CMAES g any order-preserving = 1 10 0 2 4 6 8 10 strictly increasing function (for 10 10 10 10 10 10 Condition number all other) SP1 = average number of objective function evaluations 5 to reach the target function value of g − 1 ( 10 − 9 ) 5Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA

Comparison to BFGS, NEWUOA, PSO and DE f convex quadratic, non-separable (rotated) with varying condition number α BFGS (Broyden et al 1970) Rotated Ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07 NEWUAO (Powell 2004) 7 10 DE (Storn & Price 1996) PSO (Kennedy & Eberhart 6 10 1995) 5 CMA-ES (Hansen & 10 Ostermeier 2001) SP1 4 10 f ( x ) = g ( x T H x ) with H full 3 10 g identity (for BFGS and NEWUOA BFGS 2 NEWUOA) DE2 10 PSO CMAES g any order-preserving = 1 10 0 2 4 6 8 10 strictly increasing function (for 10 10 10 10 10 10 Condition number all other) SP1 = average number of objective function evaluations 6 to reach the target function value of g − 1 ( 10 − 9 ) 6Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA

Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole - PowerPoint PPT Presentation

Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole Polytechnique) Laurent Dumas (U. Versailles) AMS Master - Optimization Paris-Saclay Master RandOpt Team Inria and CMAP (Ecole Polytechnique) anne.auger@inria.fr 2017 2018

Geometric Interpretation of the Derivative (Review) Geometric Interpretation of the Derivative

2. Theory of the Derivative 2.1 Tangent Lines 2.2 Definition of Derivative 2.3 Rates of Change

Derivative Function Math 132 Stewart 2.2 In Notes 2.1, we defined the derivative of a

Derivative Free Optimization Optimization and AMS Masters - University Paris Saclay Exercices -

Securities & Securities & Derivative Derivative Litigation Report Litigation Report

Securities & Securities & Derivative Derivative Litigation Repor t t Litigation Repor

Securities & Securities & Derivative Derivative Litigation Report Litigation Report

Securities Board of India Guest Lecture Convergence of Derivative and Cash Markets Andrew Sheng

Some basic rules of differentiation R1(Constant Function Rule) The derivative of the function

Sobolev spaces Updated June 1, 2020 Plan 2 Outline: Weak derivative Relation to ordinary

Derivative Applications MAC 2233 Instantaneous Rates of Change of a Function The derivative

Hack the Derivative! Erik Taubeneck Software Engineer October 20th, 2015 American University

MAT 166 Calculus for Bus/Soc Chapter 4 Notes Techniques for Finding the Derivative

Adjoint Derivative Computation Moritz Diehl and Carlo Savorgnan Adjoint Derivative Computation

Science One Integral Calculus January 2017 Happy New Year! Differential Calculus central idea:

Stochastic / Randomized Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole

Direct Search Methods (nongradient methods) 1. Random search methods 2. Univariate method (one

Network Overlap Community Structure Fabricio A. Breve 1 , Liang Zhao 1 , Marcos G. Quiles 2 ,

CONTENTS Background Data collection Opportunities Strengths Capabilities

A Small Step to Remember: Study of Single Model VS Dynamic Model Liguang Zhou School of Science

Coupling C-GRASP with Direct Search methods B. Martin , X. Gandibleux , L. Granvilliers

Single Layer Recurrent Network Bidirectional Symmetric Connection Binary /

7. Artificial neural networks Introduction to neural networks Despite struggling to understand

Artificial Neural Networks and Deep Learning Christian Borgelt Dept. of Mathematics / Dept. of

Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole - PowerPoint PPT Presentation

Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole Polytechnique) Laurent Dumas (U. Versailles) AMS Master - Optimization Paris-Saclay Master RandOpt Team Inria and CMAP (Ecole Polytechnique) anne.auger@inria.fr 2017 2018

Geometric Interpretation of the Derivative (Review) Geometric Interpretation of the Derivative

2. Theory of the Derivative 2.1 Tangent Lines 2.2 Definition of Derivative 2.3 Rates of Change

Derivative Function Math 132 Stewart 2.2 In Notes 2.1, we defined the derivative of a

Derivative Free Optimization Optimization and AMS Masters - University Paris Saclay Exercices -

Securities &amp; Securities &amp; Derivative Derivative Litigation Report Litigation Report

Securities &amp; Securities &amp; Derivative Derivative Litigation Repor t t Litigation Repor

Securities &amp; Securities &amp; Derivative Derivative Litigation Report Litigation Report

Securities Board of India Guest Lecture Convergence of Derivative and Cash Markets Andrew Sheng

Some basic rules of differentiation R1(Constant Function Rule) The derivative of the function

Sobolev spaces Updated June 1, 2020 Plan 2 Outline: Weak derivative Relation to ordinary

Derivative Applications MAC 2233 Instantaneous Rates of Change of a Function The derivative

Hack the Derivative! Erik Taubeneck Software Engineer October 20th, 2015 American University

MAT 166 Calculus for Bus/Soc Chapter 4 Notes Techniques for Finding the Derivative

Adjoint Derivative Computation Moritz Diehl and Carlo Savorgnan Adjoint Derivative Computation

Science One Integral Calculus January 2017 Happy New Year! Differential Calculus central idea:

Stochastic / Randomized Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole

Direct Search Methods (nongradient methods) 1. Random search methods 2. Univariate method (one

Network Overlap Community Structure Fabricio A. Breve 1 , Liang Zhao 1 , Marcos G. Quiles 2 ,

CONTENTS Background Data collection Opportunities Strengths Capabilities

A Small Step to Remember: Study of Single Model VS Dynamic Model Liguang Zhou School of Science

Coupling C-GRASP with Direct Search methods B. Martin , X. Gandibleux , L. Granvilliers

Single Layer Recurrent Network Bidirectional Symmetric Connection Binary /

7. Artificial neural networks Introduction to neural networks Despite struggling to understand

Artificial Neural Networks and Deep Learning Christian Borgelt Dept. of Mathematics / Dept. of

Securities & Securities & Derivative Derivative Litigation Report Litigation Report

Securities & Securities & Derivative Derivative Litigation Repor t t Litigation Repor

Securities & Securities & Derivative Derivative Litigation Report Litigation Report