separability
play

Separability f : x = ( x 1 , , x n ) n f ( x ) Given , let us - PowerPoint PPT Presentation

Separability f : x = ( x 1 , , x n ) n f ( x ) Given , let us de fi ne the 1-D functions that are cuts of along the di ff erent coordinates: f f i n ) ( y ) = f ( x i 1 , , x i i 1 , y , x i i +1 , , x i n ) ( x i


  1. Separability f : x = ( x 1 , …, x n ) ∈ ℝ n ↦ f ( x ) ∈ ℝ Given , let us de fi ne the 1-D functions that are cuts of along the di ff erent coordinates: f f i n ) ( y ) = f ( x i 1 , …, x i i − 1 , y , x i i +1 , …, x i n ) ( x i 1 ,…, x i ( x i 1 , …, x i n ) ∈ ℝ n − 1 ( x i 1 , …, x i n ) = ( x i 1 , …, x i i − 1 , x i i +1 , …, x i for , with n ) De fi nition: A function is separable if for all i, for all f ( x i 1 , …, x i n ) ∈ ℝ n − 1 x i x i n ) ∈ ℝ n − 1 , for all ( ̂ 1 , …, ̂ argmin y f i n ) ( y ) = argmin y f i n ) ( y ) ( x i 1 ,…, x i x i x i ( ̂ 1 ,…, ̂ a weak de fi nition of separability 39

  2. Separability (cont) x j Proposition: Let be a separable then for all f i argmin f ( x 1 , …, x n ) = ( argmin f 1 n − 1 ) ( x n ) ) n ) ( x 1 ), …, argmin f n ( x n 1 ,…, x n ( x 1 2 ,…, x 1 and can be optimized using minimization along the f n coordinates. Exercice: prove the previous proposition 40

  3. Example: Additively Decomposable Functions n ∑ Exercice: Let for having a unique f ( x 1 , …, x n ) = h i ( x i ) h i i =1 argmin. Prove that is separable. We say in this case that is f f additively decomposable. Example: Rastrigin function 3 2 n 1 ∑ ( x 2 f ( x ) = 10 n + i − 10 cos(2 π x i )) 0 i =1 − 1 − 2 − 3 − 3 − 2 − 1 0 1 2 3 41

  4. Non-separable Problems Separable problems are typically easy to optimize. Yet di ff icult real-word problems are non-separable . One needs to be careful when evaluating optimization algorithms that not too many test functions are separable and if so that the algorithms do not exploit separability . Otherwise: good performance on test problems will not re fl ect good performance of the algorithm to solve di ff icult problems Algorithms known to exploit separability: Many Genetic Algorithms (GA), Most Particle Swarm Optimization (PSO) 42

  5. Non-separable Problems Building a non-separable problem from a separable one 43

  6. Ill-conditioned Problems - Case of Convex-quadratic functions Exercice: Consider a convex-quadratic function f ( x ) = 1 2( x − x ⋆ ) H ( x − x ⋆ ) with a symmetric, positive, de fi nite H (SPD) matrix. 1. why is it called a convex-quadratic function? What is the Hessian matrix of ? f The condition number of the matrix (with respect to the H Euclidean norm) is de fi ned as cond( H ) = λ max ( H ) λ min ( H ) with and being respectively the largest and smallest λ max () λ min () eigenvalues. 44

  7. Ill-conditioned Problems Ill-conditioned means a high condition number of the Hessian matrix . H f ( x ) = 1 2( x 2 1 + 9 x 2 Consider now the speci fi c case of the function 2 ) 1. Compute its Hessian matrix, its condition number 2. Plots the level sets of , relate the condition number to the f axis ratio of the level sets of f 3. Generalize to a general convex-quadratic function Real-world problems are often ill-conditioned. 4. Why to you think it is the case? 5. why are ill-conditioned problems di ff icult? (see also Exercice 2.5 ) 45

  8. Ill-conditioned Problems 46

  9. Part II: Algorithms 47

  10. Landscape of Derivative Free Optimization Algorithms Deterministic Algorithms Quasi-Newton with estimation of gradient (BFGS) [Broyden et al. 1970] Simplex downhill [Nelder and Mead 1965] Pattern search, Direct Search [Hooke and Jeeves 1961] Trust-region/Model Based methods (NEWUOA, BOBYQA) [Powell, 06,09] Stochastic (randomized) search methods Evolutionary Algorithms (continuous domain) Di ff erential Evolution [Storn, Price 1997] Particle Swarm Optimization [Kennedy and Eberhart 1995] Evolution Strategies, CMA-ES [Rechenberg 1965, Hansen, Ostermeier 2001] Estimation of Distribution Algorithms (EDAs) [Larrañaga, Lozano, 2002] Cross Entropy Method (same as EDAs) [Rubinstein, Kroese, 2004] Genetic Algorithms [Holland 1975, Goldberg 1989] Simulated Annealing [Kirkpatrick et al. 1983] 48

  11. A Generic Template for Stochastic Search ℝ n De fi ne , a family of probability distributions on { P θ : θ ∈ Θ } Generic template to optimize f : ℝ n → ℝ Initialize distribution parameter , set population size λ ∈ ℕ θ While not terminate 1. Sample according to x 1 , …, x λ P θ 2. Evaluate on x 1 , …, x λ f 3. Update parameters θ ← F ( θ , x 1 , …, x λ , f ( x 1 ), …, f ( x λ )) the update of should drive to concentrate on the optima of θ P θ f 49

  12. To obtain an optimization algorithm we need: ➊ to de fi ne { P θ , θ ∈ Θ } ➋ to de fi ne the update function of F θ 50

  13. Which probability distribution to sample candidate solutions? 51

  14. Normal distribution - 1D case 52

  15. Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with p ( x 1 , x 2 ) = 53

  16. Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with 1 exp ( − 1 2( x − μ ) T Σ − 1 ( x − μ ) ) p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) = Z 1 Z 2 Σ = ( 2 ) σ 2 0 1 x = ( x 1 , x 2 ) T μ = ( μ 1 , μ 2 ) T with σ 2 0 54

  17. Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with 1 exp ( − 1 2( x − μ ) T Σ − 1 ( x − μ ) ) p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) = Z 1 Z 2 Σ = ( 2 ) σ 2 0 1 x = ( x 1 , x 2 ) T μ = ( μ 1 , μ 2 ) T with σ 2 0 ( μ 1 , μ 2 ) σ 1 > σ 2 55

  18. Generalization to n Variables: General Case Gaussian Vector - Multivariate Normal Distribution X = ( X 1 , …, X n ) ∈ ℝ n A random vector is a Gaussian vector (or multivariate normal) if and only if for all real numbers , the random variable has a normal a 1 , …, a n a 1 X 1 + … + a n X n distribution. 56

  19. Gaussian Vector - Multivariate Normal Distribution 57

  20. Density of a n-dimensional Gaussian vector : 𝒪 ( m , C ) (2 π ) n /2 | C | 1/2 exp ( − 1 2( x − m ) ⊤ C − 1 ( x − m ) ) 1 p 𝒪 ( m . C ) ( x ) = The mean vector : m determines the displacement is the value with the largest density the distribution is symmetric around the mean 𝒪 ( m , C ) = m + 𝒪 (0, C ) The covariance matrix: determines the geometrical shape (see next slides) 58

  21. Geometry of a Gaussian Vector Consider a Gaussian vector , remind that lines of equal 𝒪 ( m , C ) densities are given by: { x | Δ 2 = ( x − m ) T C − 1 ( x − m ) = cst} C = U Λ U ⊤ Decompose with U orthogonal, i.e. C = ( | ) ( 2 ) ( σ 2 u 1 u 2 u 2 − ) 0 u 1 − 1 | σ 2 0 | Y = U ⊤ ( x − m ) Let , then in the coordinate system, (u1,u2), the lines of equal densities are given by σ 2 { x | Δ 2 = Y 2 + Y 2 1 2 u2 = cst} σ 2 σ 2 u1 ( μ 1 , μ 2 ) 1 2 σ 1 59

  22. 60

  23. Evolution Strategies 61

  24. Evolution Strategies σ 2 C In fact, the covariance matrix of the sampling distribution is but it is convenient to refer to as the covariance matrix (it is a C covariance matrix but not of the sampling distribution) 62

  25. How to update the di ff erent parameters ? m , σ , C 63

  26. Update the Mean: a Simple Algorithm the (1+1)-ES Notation and Terminology: one solution kept one new solution (1+1)-ES from one iteration sampled at each to the next iteration The + means that we keep the best between current solution and new solution, we talk about elitist selection (1+1)-ES algorithm (update of the mean) sample one candidate solution from the mean m x = m + σ 𝒪 (0, C ) if is better than (i.e. if ), select x m f ( x ) ≤ f ( m ) m m ← x 64

  27. The (1+1)-ES algorithm is a simple algorithm, yet: •the elitist selection is not robust to outliers we cannot loose solutions accepted by “chance”, for instance that look good because the noise gave it a low function value •there is no population (just a single solution is sampled) which makes it less robust In practice, one should rather use a: -ES ( μ / μ , λ ) The best solutions are solutions are μ λ selected and recombined sampled (to form the new mean) at each iteration 65

  28. The -ES - Update of the Mean Vector ( μ / μ , λ ) 66

  29. What changes in the previous slide if instead of optimizing , we optimize where g ∘ f g : Im( f ) → ℝ f is strictly increasing? 67

  30. Invariance Under Monotonically Increasing Functions Comparison-based/ranking-based algorithms: Update of all parameters uses only the ranking: f ( x 1: λ ) ≤ f ( x 2: λ ) ≤ … ≤ f ( x λ : λ ) g ( f ( x 1: λ )) ≤ g ( f ( x 2: λ )) ≤ … ≤ g ( f ( x λ : λ )) for all strictly increasing g : Im( f ) → ℝ 68

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend