Separability f : x = ( x 1 , , x n ) n f ( x ) Given , let us - PowerPoint PPT Presentation

Separability f : x = ( x 1 , …, x n ) ∈ ℝ n ↦ f ( x ) ∈ ℝ Given , let us de fi ne the 1-D functions that are cuts of along the di ff erent coordinates: f f i n ) ( y ) = f ( x i 1 , …, x i i − 1 , y , x i i +1 , …, x i n ) ( x i 1 ,…, x i ( x i 1 , …, x i n ) ∈ ℝ n − 1 ( x i 1 , …, x i n ) = ( x i 1 , …, x i i − 1 , x i i +1 , …, x i for , with n ) De fi nition: A function is separable if for all i, for all f ( x i 1 , …, x i n ) ∈ ℝ n − 1 x i x i n ) ∈ ℝ n − 1 , for all ( ̂ 1 , …, ̂ argmin y f i n ) ( y ) = argmin y f i n ) ( y ) ( x i 1 ,…, x i x i x i ( ̂ 1 ,…, ̂ a weak de fi nition of separability 39

Separability (cont) x j Proposition: Let be a separable then for all f i argmin f ( x 1 , …, x n ) = ( argmin f 1 n − 1 ) ( x n ) ) n ) ( x 1 ), …, argmin f n ( x n 1 ,…, x n ( x 1 2 ,…, x 1 and can be optimized using minimization along the f n coordinates. Exercice: prove the previous proposition 40

Example: Additively Decomposable Functions n ∑ Exercice: Let for having a unique f ( x 1 , …, x n ) = h i ( x i ) h i i =1 argmin. Prove that is separable. We say in this case that is f f additively decomposable. Example: Rastrigin function 3 2 n 1 ∑ ( x 2 f ( x ) = 10 n + i − 10 cos(2 π x i )) 0 i =1 − 1 − 2 − 3 − 3 − 2 − 1 0 1 2 3 41

Non-separable Problems Separable problems are typically easy to optimize. Yet di ff icult real-word problems are non-separable . One needs to be careful when evaluating optimization algorithms that not too many test functions are separable and if so that the algorithms do not exploit separability . Otherwise: good performance on test problems will not re fl ect good performance of the algorithm to solve di ff icult problems Algorithms known to exploit separability: Many Genetic Algorithms (GA), Most Particle Swarm Optimization (PSO) 42

Non-separable Problems Building a non-separable problem from a separable one 43

Ill-conditioned Problems - Case of Convex-quadratic functions Exercice: Consider a convex-quadratic function f ( x ) = 1 2( x − x ⋆ ) H ( x − x ⋆ ) with a symmetric, positive, de fi nite H (SPD) matrix. 1. why is it called a convex-quadratic function? What is the Hessian matrix of ? f The condition number of the matrix (with respect to the H Euclidean norm) is de fi ned as cond( H ) = λ max ( H ) λ min ( H ) with and being respectively the largest and smallest λ max () λ min () eigenvalues. 44

Ill-conditioned Problems Ill-conditioned means a high condition number of the Hessian matrix . H f ( x ) = 1 2( x 2 1 + 9 x 2 Consider now the speci fi c case of the function 2 ) 1. Compute its Hessian matrix, its condition number 2. Plots the level sets of , relate the condition number to the f axis ratio of the level sets of f 3. Generalize to a general convex-quadratic function Real-world problems are often ill-conditioned. 4. Why to you think it is the case? 5. why are ill-conditioned problems di ff icult? (see also Exercice 2.5 ) 45

Ill-conditioned Problems 46

Part II: Algorithms 47

Landscape of Derivative Free Optimization Algorithms Deterministic Algorithms Quasi-Newton with estimation of gradient (BFGS) [Broyden et al. 1970] Simplex downhill [Nelder and Mead 1965] Pattern search, Direct Search [Hooke and Jeeves 1961] Trust-region/Model Based methods (NEWUOA, BOBYQA) [Powell, 06,09] Stochastic (randomized) search methods Evolutionary Algorithms (continuous domain) Di ff erential Evolution [Storn, Price 1997] Particle Swarm Optimization [Kennedy and Eberhart 1995] Evolution Strategies, CMA-ES [Rechenberg 1965, Hansen, Ostermeier 2001] Estimation of Distribution Algorithms (EDAs) [Larrañaga, Lozano, 2002] Cross Entropy Method (same as EDAs) [Rubinstein, Kroese, 2004] Genetic Algorithms [Holland 1975, Goldberg 1989] Simulated Annealing [Kirkpatrick et al. 1983] 48

A Generic Template for Stochastic Search ℝ n De fi ne , a family of probability distributions on { P θ : θ ∈ Θ } Generic template to optimize f : ℝ n → ℝ Initialize distribution parameter , set population size λ ∈ ℕ θ While not terminate 1. Sample according to x 1 , …, x λ P θ 2. Evaluate on x 1 , …, x λ f 3. Update parameters θ ← F ( θ , x 1 , …, x λ , f ( x 1 ), …, f ( x λ )) the update of should drive to concentrate on the optima of θ P θ f 49

To obtain an optimization algorithm we need: ➊ to de fi ne { P θ , θ ∈ Θ } ➋ to de fi ne the update function of F θ 50

Which probability distribution to sample candidate solutions? 51

Normal distribution - 1D case 52

Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with p ( x 1 , x 2 ) = 53

Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with 1 exp ( − 1 2( x − μ ) T Σ − 1 ( x − μ ) ) p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) = Z 1 Z 2 Σ = ( 2 ) σ 2 0 1 x = ( x 1 , x 2 ) T μ = ( μ 1 , μ 2 ) T with σ 2 0 54

Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with 1 exp ( − 1 2( x − μ ) T Σ − 1 ( x − μ ) ) p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) = Z 1 Z 2 Σ = ( 2 ) σ 2 0 1 x = ( x 1 , x 2 ) T μ = ( μ 1 , μ 2 ) T with σ 2 0 ( μ 1 , μ 2 ) σ 1 > σ 2 55

Generalization to n Variables: General Case Gaussian Vector - Multivariate Normal Distribution X = ( X 1 , …, X n ) ∈ ℝ n A random vector is a Gaussian vector (or multivariate normal) if and only if for all real numbers , the random variable has a normal a 1 , …, a n a 1 X 1 + … + a n X n distribution. 56

Gaussian Vector - Multivariate Normal Distribution 57

Density of a n-dimensional Gaussian vector : 𝒪 ( m , C ) (2 π ) n /2 | C | 1/2 exp ( − 1 2( x − m ) ⊤ C − 1 ( x − m ) ) 1 p 𝒪 ( m . C ) ( x ) = The mean vector : m determines the displacement is the value with the largest density the distribution is symmetric around the mean 𝒪 ( m , C ) = m + 𝒪 (0, C ) The covariance matrix: determines the geometrical shape (see next slides) 58

Geometry of a Gaussian Vector Consider a Gaussian vector , remind that lines of equal 𝒪 ( m , C ) densities are given by: { x | Δ 2 = ( x − m ) T C − 1 ( x − m ) = cst} C = U Λ U ⊤ Decompose with U orthogonal, i.e. C = ( | ) ( 2 ) ( σ 2 u 1 u 2 u 2 − ) 0 u 1 − 1 | σ 2 0 | Y = U ⊤ ( x − m ) Let , then in the coordinate system, (u1,u2), the lines of equal densities are given by σ 2 { x | Δ 2 = Y 2 + Y 2 1 2 u2 = cst} σ 2 σ 2 u1 ( μ 1 , μ 2 ) 1 2 σ 1 59

Evolution Strategies 61

Evolution Strategies σ 2 C In fact, the covariance matrix of the sampling distribution is but it is convenient to refer to as the covariance matrix (it is a C covariance matrix but not of the sampling distribution) 62

How to update the di ff erent parameters ? m , σ , C 63

Update the Mean: a Simple Algorithm the (1+1)-ES Notation and Terminology: one solution kept one new solution (1+1)-ES from one iteration sampled at each to the next iteration The + means that we keep the best between current solution and new solution, we talk about elitist selection (1+1)-ES algorithm (update of the mean) sample one candidate solution from the mean m x = m + σ 𝒪 (0, C ) if is better than (i.e. if ), select x m f ( x ) ≤ f ( m ) m m ← x 64

The (1+1)-ES algorithm is a simple algorithm, yet: •the elitist selection is not robust to outliers we cannot loose solutions accepted by “chance”, for instance that look good because the noise gave it a low function value •there is no population (just a single solution is sampled) which makes it less robust In practice, one should rather use a: -ES ( μ / μ , λ ) The best solutions are solutions are μ λ selected and recombined sampled (to form the new mean) at each iteration 65

The -ES - Update of the Mean Vector ( μ / μ , λ ) 66

What changes in the previous slide if instead of optimizing , we optimize where g ∘ f g : Im( f ) → ℝ f is strictly increasing? 67

Invariance Under Monotonically Increasing Functions Comparison-based/ranking-based algorithms: Update of all parameters uses only the ranking: f ( x 1: λ ) ≤ f ( x 2: λ ) ≤ … ≤ f ( x λ : λ ) g ( f ( x 1: λ )) ≤ g ( f ( x 2: λ )) ≤ … ≤ g ( f ( x λ : λ )) for all strictly increasing g : Im( f ) → ℝ 68

Separability f : x = ( x 1 , , x n ) n f ( x ) Given , let us - PowerPoint PPT Presentation

Separability f : x = ( x 1 , , x n ) n f ( x ) Given , let us de fi ne the 1-D functions that are cuts of along the di ff erent coordinates: f f i n ) ( y ) = f ( x i 1 , , x i i 1 , y , x i i +1 , , x i n ) ( x i

Separability of Context-Free Languages by Piecewise Testable Languages Wojciech Czerwi ski

A Note on Decidable Separability by Piecewise Testable Languages Wojciech Czerwi ski Wim

Regular Separability of WSTS Roland Meyer joint work with Wojciech Czerwi nski, S lawomir

Separability of the Lambda Calculus and Term Rewriting Systems Department of Computer Science

Learning with Low Rank Approximations or how to use near separability to extract content from

Path separability of Graphs Emilie Diot and Cyril Gavoille LaBRI, University of Bordeaux, France

Regular Separability of WSTS Wojciech Czerwiski 1 , Sawomir Lasota 1 , Roland Meyer 2 ,

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013

Review Linear separability (and use of features) Class probabilities for linear

Regular separability of languages of well-structured transition systems Roland Mayer Wojciech

Linear Classification Linear separability Inseparability Real world problems: there may not

Separability, Expressiveness and Decidability in the Ambient Logic AS mobilit e - December

Graphs of separability at most two: structural characterizations and their consequences Ferdinando

Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC)

Topological approaches in machine learning D. A. Zighed University of Lyon (Lumire Lyon 2)

Deep learning 3.3. Linear separability and feature design Fran cois Fleuret

MLES & Multivariate Normal Theory STA721 Linear Models Duke University Merlise Clyde

T-61.3050 Machine Learning: Basic Principles Dimensionality Reduction Kai Puolam aki

Some basics in probability and statistics . Course of Machine Learning Master Degree in Computer

Adversarial event generator tuning with Bayesian Optimization Maxim Borisyak, Andrey Ustyuzhanin

Unit 2: Probability and distributions 1. Probability and conditional probability GOVT 3990 -

MA162: Finite mathematics . Jack Schmidt University of Kentucky November 16, 2011 Schedule:

Decision Theory III (MATH 3071) Lecture 1 2017/18 Useful Information Lecturers: Dr Camila

Probability Basics Tushar Shanker Data Scientist DataCamp Statistical Simulation in Python

Separability f : x = ( x 1 , , x n ) n f ( x ) Given , let us - PowerPoint PPT Presentation

Separability f : x = ( x 1 , , x n ) n f ( x ) Given , let us de fi ne the 1-D functions that are cuts of along the di ff erent coordinates: f f i n ) ( y ) = f ( x i 1 , , x i i 1 , y , x i i +1 , , x i n ) ( x i

Separability of Context-Free Languages by Piecewise Testable Languages Wojciech Czerwi ski

A Note on Decidable Separability by Piecewise Testable Languages Wojciech Czerwi ski Wim

Regular Separability of WSTS Roland Meyer joint work with Wojciech Czerwi nski, S lawomir

Separability of the Lambda Calculus and Term Rewriting Systems Department of Computer Science

Learning with Low Rank Approximations or how to use near separability to extract content from

Path separability of Graphs Emilie Diot and Cyril Gavoille LaBRI, University of Bordeaux, France

Regular Separability of WSTS Wojciech Czerwiski 1 , Sawomir Lasota 1 , Roland Meyer 2 ,

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013

Review Linear separability (and use of features) Class probabilities for linear

Regular separability of languages of well-structured transition systems Roland Mayer Wojciech

Linear Classification Linear separability Inseparability Real world problems: there may not

Separability, Expressiveness and Decidability in the Ambient Logic AS mobilit e - December

Graphs of separability at most two: structural characterizations and their consequences Ferdinando

Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC)

Topological approaches in machine learning D. A. Zighed University of Lyon (Lumire Lyon 2)

Deep learning 3.3. Linear separability and feature design Fran cois Fleuret

MLES &amp; Multivariate Normal Theory STA721 Linear Models Duke University Merlise Clyde

T-61.3050 Machine Learning: Basic Principles Dimensionality Reduction Kai Puolam aki

Some basics in probability and statistics . Course of Machine Learning Master Degree in Computer

Adversarial event generator tuning with Bayesian Optimization Maxim Borisyak, Andrey Ustyuzhanin

Unit 2: Probability and distributions 1. Probability and conditional probability GOVT 3990 -

MA162: Finite mathematics . Jack Schmidt University of Kentucky November 16, 2011 Schedule:

Decision Theory III (MATH 3071) Lecture 1 2017/18 Useful Information Lecturers: Dr Camila

Probability Basics Tushar Shanker Data Scientist DataCamp Statistical Simulation in Python

MLES & Multivariate Normal Theory STA721 Linear Models Duke University Merlise Clyde