Stochastic / Randomized Derivative Free Optimization Anne Auger - - PDF document

▶

Nov 19, 2023 125 likes •316 views

Stochastic / Randomized Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole Polytechnique, IP Paris) Class notes for Optimization Master and AMS Master, Paris Saclay December 7, 2019 1 Preamble Those notes are intended for the

SLIDE 1

Stochastic / Randomized Derivative Free Optimization

Anne Auger (Inria and CMAP, Ecole Polytechnique, IP Paris) Class notes for Optimization Master and AMS Master, Paris Saclay December 7, 2019

SLIDE 2

Preamble

Those notes are intended for the students following the Derivative Free Optimization class from the Optimization Master of Paris Saclay and AMS (Analyse Mod´ elisation et Simula- tion) master. The material presented in the lecture is not following a particular textbook and those notes are there to compensate the lack of textbook. I appreciate any feedback. Bonus points will be given to students who find mistakes or typos that will help me improve the notes.

SLIDE 3

1 A few Definitions, Reminders and Terminology 4 1.1 Reminders and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Argmin and argmax . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Level sets and sublevel sets . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Convex-quadratic functions . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Introduction to Black-box Optimization 6 2.1 Derivative-free and Black-box Optimization . . . . . . . . . . . . . . . . . . 6 2.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 What is the goal? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Cost / Runtime of an algorithm . . . . . . . . . . . . . . . . . . . . 11 2.2 What makes an optimization problem difficult? . . . . . . . . . . . . . . . . 11 2.2.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Ill-conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Non-separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Multi-modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.5 Ruggedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.6 Non-xxx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

What this class is about?

This class presents derivative-free optimization or black-box methods to optimize numerical

problems. We will assume in most of the class an unconstrained minimization problem:

min

x f(x)

where f : Ω ⊂ Rn → R. The class is divided in two parts. The first part is devoted to the presentation of important theoretical concepts and algorithms that are randomized or stochastic with a 2

SLIDE 4

CONTENTS 3 strong focus on algorithms that are often considered as state-of-the-art and belong to the family of Evolution Strategies. The second part is devoted to the presentation of deterministic algorithms. Those notes cover the first part of the class only.

SLIDE 5

Chapter 1 A few Definitions, Reminders and Terminology

1.1 Reminders and Terminology 1.2 Definitions

1.2.1 Argmin and argmax

Given a function f : Rn → R, we denote as argminx f(x) = A as the set of points of Rn such that for all x⋆ in A f(x⋆) ≤ f(x) for all x ∈ Rn . A similar definition holds for the argmax. We might use the somehow ambiguous terminology of minimum to either designate the minimum function value: min{f(x) : x ∈ Ω} or one point of Rn where this minimum is achieved, that is on point belonging to argminx f(x).

1.2.2 Level sets and sublevel sets

Given a function f : x ∈ Rn → f(x) ∈ R, we define the level set of f as Lc := {x ∈ Rn|f(x) = c} for c ∈ R . We define the sublevel set of f as Lc := {x ∈ Rn|f(x) ≤ c} for c ∈ R . 4

SLIDE 6

CHAPTER 1. A FEW DEFINITIONS, REMINDERS AND TERMINOLOGY 5

1.2.3 Convex-quadratic functions

Let H be a symmetric positive definite (SPD) matrix, a convex-quadratic function is defined as f(x) = 1 2(x − x⋆)H(x − x⋆), x ∈ Rn, x⋆ ∈ Rn . (1.1) Convex-quadratic functions play a central role in numerical optimization. They are simple to understand but yet allow to model important difficulties like ill-conditioned problems (this notion will be formalized later) such that they often serve as test problems to evaluate and understand the behavior of algorithms. The exercices below while not presenting any specific difficulties are central for the understanding of this lecture. Exercice 1.1 Consider a convex-quadratic function as given in (1.1). Show that

1. x⋆ is the minimum of f.
2. H is the Hessian matrix of f.
3. The function f is convex.
4. The function f has is a unique optimum.

Exercice 1.2 Let f(x) = 1

2xT Hx where x ∈ R2 and H is the 2 × 2 matrix:

H = 9 1

.
1. Plot the level sets of f.
2. Relate the axis ratio of the level sets to the eigenvalues of H.
3. What is changing in the picture of the level sets you have drawn if H is instead

H = P T 9 1

where P is an orthogonal matrix.

4. More generally, deduce the geometric shape of the level sets of a convex-quadratic

function.

1.2.4 Probability and Statistics

Mean vector, variance, standard deviation, covariance matrix Gaussian vectors Global optimum, local optimum.

SLIDE 7

Chapter 2 Introduction to Black-box Optimization

2.1 Derivative-free and Black-box Optimization

This class is centered on derivative-free optimization methods where we are interested to minimize a function f : Ω ⊂ Rn → R but we do not have access to its derivatives or equivalently to the gradient of f. Those derivatives can either exist in the mathematical sense (i.e. the function is differentiable) but we cannot easily compute them or the function can be non-differentiable. We will more precisely assume a black-box scenario where the function f to be optimized is seen by the algorithm as a zeroth-order oracle that can be queried, which means that we can give as input to the oracle a point x ∈ Rn and the oracle returns the function f(x) (a first-order oracle returns f(x) and ∇f(x)). This modelization of the function was notably formalized for analyzing the (query) complexity of class of algorithms depending on the information that they are using and the class of problems (convex, smooth, ...). We refer to Nesterov [?] and Bubeck [?] for further details. Remark 2.1 With this terminology of black-box optimization, Quasi-Newton methods can then be seen as first order black-box methods.

2.1.1 Examples

Many examples of (real-world) optimization problems fall into this category of black-box problems. In particular, it is very common that the function to be optimized is the result of a simulation (this is sometimes referred to as simulation-based optimization) that can involve the numerical resolution of partial differential equations, ... Those simulations are typically so complex that we do not want to look into the simulation to extract possibly information 6

SLIDE 8

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 7 to be used within the optimization algorithm and rather handle the problem as a black-box problem. The function can also be a real black-box. For instance, in the context of industry collaborations one might be asked to help to optimize the design of an object, for instance part of a car, part of a plane, part of a launcher so as to optimize a certain criteria that can be the production cost (while satisfying certain physical constraints), the recurrent cost of a launcher [?]. In those cases the function to be optimized can be provided to us as an executable code but we do not have access to its source code. Hence we have to optimize a real black-box.

2.1.2 What is the goal?

So far, we have talked about optimizing a numerical optimization problem. What does it precisely mean? First of all, we should keep in mind that locating the exact solution of the problem is typically impossible because of the continuous nature of the search space. Instead, an

ptimization algorithm will return a sequence of points {xt : t ∈ N} that will converge to

the optimum of the problem, denoted x⋆, that is: lim

t→∞ xt = x⋆ .

Equivalently, given a precision ǫ, the algorithm will aim at returning a solution which approximates x⋆ with precision ǫ. We can think for the moment as precision in terms of (Euclidean) distance to the optimum, that is the algorithm tries to find a point xT such that xT − x⋆ ≤ ǫ. We will see later on that we should be careful in how we define precision. Essential optimum of a function We will see in this lecture algorithms that can handle well functions that have disconti-

nuities. Hence, we do not assume that our underlying functions are continuous. Yet, in

this context, talking about locating the optimum of a function can be rather meaningless. Consider for instance the 1-dimensional function f(x) = x2. Its optimum is in 0. Consider now a function h which is equal everywhere to f except for x = 1 where we set h(1) = −2. Then the optimum of h is in 1. Yet, it is safe to say that all reasonable optimization algorithm will converge to x = 0 while optimizing h. We can say that the optimum of h is not robust. We can also say that if such a situation happens for a real-world problem, one will also be interested to locate x = 0 and not x = 1 which is an outlier of the function. Exercice 2.1 Think about an alternative definition of the optimum of a function that would give that the optimum of h is in x = 0.

SLIDE 9

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 8 In this context, a more reasonable notion of optimum of a function is called essential

ptimum.

Given a measure µ defined on the search space Ω ⊂ Rn and a measurable function f, the essential minimum of a function is defined as the largest α such that the set {x : f(x) < α} has measure zero. Exercice 2.2 Prove that the essential minimum of h defined above is 0. We have formalized our optimization problem as a black-box problem. Does it mean that we are interested in a method that can possibly converge to the optimum of “any” black-box problem? The answer is no. The two reasons being that such a method will generally be much too slow in most of the cases but also that optimization problems have typically some structure (see below). One such method is the pure random search algorithm. Assume we have a bounded search domain where we know that the optimum of the function we optimize lies. The pure random search consists in sampling independently uniformly on this search domain and return the smallest function value found. Exercice 2.3 (Pure Random Search (PRS)) We consider the following optimization algorithm. [Objective: minimize f : [−1, 1]n → R Xt is the estimate of the optimum at iteration t Input (Ut)t≥1 independent identically distributed each Ut ∼ U[−1,1]nis uniformly distributed in [−1, 1]n) ]

1. Initialize t = 1, X1 = U1
2. while not terminate

3. t = t + 1 4. If f(Ut) ≤ f(Xt−1) 5. Xt = Ut 6. Else 8. Xt = Xt−1

1. Show that for all t ≥ 1

f(Xt) = min{f(U1), . . . , f(Ut)}

2. We consider the simple case where f(x) = x∞ (where x∞ := max(|x1|, . . . , |xn|)).

Show the convergence in probability of the PRS algorithm towards the optimum of f, that is prove that for all ǫ > 0 lim

t→∞ Pr (Xt∞ ≥ ǫ) = 0

SLIDE 10

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 9 Hint: Prove and use the equality {Xt∞ ≥ ǫ} = ∩t

k=1{Uk∞ ≥ ǫ}

3. Let Tǫ = inf{t|Xt ∈ [−ǫ, ǫ]n} (with ǫ > 0) be the first hitting time of [−ǫ, ǫ]n.

Show that Tǫ follows a geometric distribution with a parameter p that we will deter- mine. Deduce the expected value of Tǫ, that is the expected hitting time of the PRS algorithm.

4. When we implement an optimization algorithm (without derivatives), the cost of the

algorithm is the number of calls to the objective function. Write a pseudo-code of the PRS algorithm where at each iteration the objective function f is called only once. So what would be a more interesting method than pure random search? We will not answer immediately this question but just want to make a few remarks.

1. We have to realize that there is a trade-off between how universal or general

the method is and how quick it can converge:

a very universal method is the PRS algorithm that converges under very mild assump-

tions related to the neighborhood of the global optimum. Yet, as we have illustrated in Exercice 2.3, the convergence is slow.

a very specific (and useless) method would be an algorithm which would always return

the same search point, say x = 0. This algorithm would be the quickest possible on all functions having their optimum in zero and would be pretty useless on all other

functions. This algorithm is not translation invariant, we will discuss the importance
f invariance in optimization later on.

In the end, both algorithms are not satisfactory: the first one because it is too universal and too slow, and the second one because it is too specific and gives the right answer on a very small class of functions. Of course those examples are extreme because both algorithms are blind to the problem: they do not try to use the information gained during the search to find the solution (quicker).

SLIDE 11

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 10 By gaining information during the search, we mean that during the course of the optimization, the algorithm is querying points from which he obtains f values from the black-box. The finite sequence of all points queried together with their function value up to iteration t: {(xk, f(xk)) : 1 ≤ k ≤ t} (we assume here for the sake of illustration that a single point is queried at each iteration, yet powerful algorithms typically query more than one point at each iteration as we will see later on) is what is called information gained about the objective function.

2. The second consequence of what we said above is that we need to be careful

about convergence rates. Assessing the global convergence of a method can be rather meaningless with respect to practice if the algorithm is too slow. Additionally there exists simple modifications of an algorithm to make it converge globally: Consider any algorithm and modify by adding every 105 (this number is arbitrary) iterations a pure random search step, that is sampling uniformly on Ω. If we define xk as being the best solution found till iteration k, then under-mild assumptions on f ensuring that the PRS algorithm converges, we will obtain the convergence of xk towards the optimum. This algorithm can typically be 105 slower than PRS, but it does converge. This last example can look artificial, yet this is not uncommon to read publications where such a similar modification of a Cool-New algorithm is made to ensure its global convergence. Often the modification will not look obviously practically useless because the authors do not write that they apply the modification every 105 iterations but in effect their argumentation for the global convergence of the Cool-New algorithm just relies on the convergence of the PRS. Yet, they typically state that they prove the global convergence of Cool-New ensuring thus that Cool-New is really a cool algorithm.

3. The last remark is that even if we assume a black-box setting, there is no reason

to believe that we will encounter on real applications all possible black-box functions with the same probability. There are reasons to believe that real-world problems (if they are properly formulated) have some underlying structure that can be exploited by an algorithm to optimize the function. Typically we do not expect that real-world problems are needles in the haystack. Overall, we can formulate that algorithmic-wise, the goal is: Goal: Have algorithms that have a runtime (to reach targets close to the optimum) as small as possible on the class of problems that are relevant in practice.

SLIDE 12

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 11

2.1.3 Cost / Runtime of an algorithm

In the black-box setting, it is very common that the function to evaluate is relatively costly. This means that one function evaluation can take several seconds, minutes or even hours. In this context, the most prominent cost of an algorithm is the number of calls to the objective function. Consequently, when comparing different algorithms, we compare how many times the algorithm calls the black-box function. If an algorithm needs 150 calls to the black-box (or equivalently 150 number of function evaluations) to reach a function value below 10−2 and another algorithm needs 103 function evaluations to reach the same target of 10−2, then the first algorithm will be said to be faster. We call the number of function evaluation to reach a certain target the runtime of the algorithm though it does not directly relate to the (wall-clock) time you have to wait to reach the target.

2.2 What makes an optimization problem difficult?

We discuss in this part well-known difficulties of optimization problems that occur frequently for practical problems. We will see in the rest of the class the strategies encoded into the algorithms to address those difficulties. To address some difficulties, it will be natural to use stochastic/randomized approaches, for other difficulties, they need to be addressed by both deterministic and stochastic algorithms.

2.2.1 Curse of Dimensionality

Assume you optimize a 1-dimensional function f : [0, 1] → R. Which simple procedure can you think about to find an approximation of the optimum with precision ǫ? One possible answer is to discretize the segment [0, 1] and consider the points (x1, x2, x3, . . .) = (ǫ, 2ǫ, 3ǫ, . . .), then evaluate all those points on f, that is compute f(x1), f(x2), f(x3), . . ., and simply return the smallest function value. This procedure is referred to grid search

r exhaustive search.

If ǫ is small enough, under mild assumptions on f, this procedure will give an approximation of the optimum. How many function evaluations do you need with such a procedure to evaluate all the points of a grid of discretization step 10−i? The answer is 10i (convince yourself that this is true). Which means that for a grid of discretization step 0.01 you will need to evaluate 100 points. Consider now that you want to optimize a 10-dimensional function f : [0, 1]10 → R and that you want to use the same procedure, that is discretize the search space and evaluate all the points from the discretization.

SLIDE 13

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 12 How many points do you need to place to get a similar coverage in terms of distance between adjacent points than in 1D to find the optimum with precision 0.01? The answer is 10010 points (you discretize each coordinate 100 times and evaluate each point of the resulting grid). Exercice 2.4 Imagine a procedure to find out how long it takes to evaluate 10010 points

n a simple function on your computer. Find out whether it is realistic to evaluate 10010

points. If it takes 7 seconds to evaluate 106 times the function f(x) = 10

i=1 x2 i (experiment

realized in Python on a Mac with several cores in 2017), then it would takes roughly 108 days for evaluating 1020 points. What are the consequences for optimization?

If we place 100 points on the 10-dimensional search space, those points will appear

very isolated or sparse is a vast empty space

A search procedure like exhaustive search that is feasible is small dimension (say up

to dimension 3) is totally useless in larger dimension (≥ 5). Hence optimization in dimension 5 can already be seen as optimization in large dimension. What we have described above is related to what is called curse of dimensionality a term introduced by Richard Bellman which refers to the problem caused by the rapid (exponential) increase in volume that comes from adding dimensions to a mathematical space.

2.2.2 Ill-conditioning

An important difficulty in optimization is related to ill-conditioning. Consider a convex- quadratic function as defined in Section 1.2.3. We have seen that the level sets of such functions are (hyper)-ellipsoids and that the axis ratio, that is the ratio between the largest axis and smallest axis of this ellipsoid is equal to the square root of ratio between the largest and smallest eigenvalue of H, the Hessian matrix of the function. This ratio is equal to the condition number of the matrix (with respect to the Euclidean norm).

SLIDE 14

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 13 The condition number of a matrix depends on the particular matrix norm

used. It is defined as

cond(A) = A.A−1 (2.1) for A a matrix norm. If we use the matrix norm induced by the Eu- clidean norm .2 on Rn, then the condition number of a symmetric matrix equals the ratio between the largest and smallest eigenvalue of the matrix: cond2(A) = λmax(A) λmin(A) (2.2) where λmax and λmin denote respectively the largest and smallest eigenvalue of a matrix. It can be seen as a measure of how close a matrix is from being singular. Definition 2.1 An ill-conditioned convex-quadratic problem is associated to a convex- quadratic function with a Hessian matrix having a high condition number. We start talking about ill-conditioning when the condition number is larger than 104. Condition number of the order of 1010 is frequent on practical problems. Hence an ill-conditioned convex-quadratic function will have level sets that are very elongated along some directions and very narrow in others. If we think about the parallel between optimization and climbing / hiking in the mountain where the goal is to reach the highest top of the surrounding mountains (a local maximum) as fast as possible. Then a ill-conditioned problem will correspond to a very narrow valley in one direction and very flat in another one. Typically a valley where we need to do 105 steps in one direction to increase the altitude by the same amount than one step in an orthogonal direction will correspond to a problem with a condition number of 1010. Why are ill-conditioned problems difficult? With this parallel in mind between optimization and mountaineering, it becomes clear why ill-conditioned problems are difficult. In black-box optimization, the algorithm is “blind”, i.e. it does not see the full landscape like a hiker would be able to do looking at a topographic map. What it can do is probe different steps and decide where to go. One strategy would be to try to go along the steep- est ascent, i.e. follow the gradient or approximation of the gradient. Yet we know that this strategy is bad on ill-conditioned problems (see next exercice). Exercice 2.5 Consider a convex-quadratic function f. Compute ∇f and explain why following −∇f will typically lead to a slow convergence rate on an ill-conditioned function. Give the expression in terms of H and ∇f of what would be the optimal direction. This direction is called the Newton direction.

SLIDE 15

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 14 Hence to solve efficiently ill-conditioned problems, an algorithm needs to learn “second-

rder” information, basically what are the directions where the problem is flat and the

directions where the problem is steep. In the direction where the problem is flat, the algorithm will need to do steps with a much larger amplitude than in the directions where the problem is steep. This will be the core of the CMA-ES algorithm that will be presented in the class. Ill-conditioned problems beyond the convex-quadratic case When the function is not convex-quadratic, we can still talk about ill-conditioned problems by considering the level sets of the function. If the level sets are squeezed in some directions and more flat in

thers we can talk about ill-conditioned problems.

Why do ill-conditioned problems occur frequently? Ill-conditioned problems ap- pear frequently because optimization problems typically involve to optimize parameters that can have different physical meanings with different order of magnitudes (units). Even when putting all the knowledge we have to encode differently the function and try to have variables having the same oder of magnitude of change, we can be wrong by several order

f magnitudes, leading to ill-conditioned problems.

2.2.3 Non-separability

Non-separability is a difficulty in optimization. To define what a non-separable function is, we first need to define separability. We present here a weak definition of separability. Informally speaking, a separable problem is a problem that can be optimized variable per variable (we can separate the variables to optimize it). Formally given f : (x1, . . . , xn) ∈ Rn → f(x) ∈ R, let us define the 1-D functions fi

(xi

1,...,xi n)(y) = f(xi

1, . . . , xi i−1, y, xi i+1, . . . , xi n) .

(2.3) for (xi

1, . . . , xi n) ∈ Rn−1 (where we use the notation (xi 1, . . . , xi n) = (xi 1, . . . , xi i−1, xi i+1, . . . , xi n)).

Definition 2.2 A function f is separable if for all i, for all (xi

1, . . . , xi n) in Rn−1, for all

(ˆ xi

1, . . . , ˆ

n) in Rn−1

argminy fi

(xi

1,...,xi n)(y) = argminy fi

(ˆ xi

1,...,ˆ

n)(y) .

In the case where the optimum is not unique, this equality is in terms of equal sets. The previous definition means that the optimum of the 1-D functions (2.3) does not depend

n which n−1 vector (xi

1, . . . , xi n) is fixed, i.e. all the cuts of functions along the coordinate

xi have the same optimum. This definition implies that we can minimize a separable functions by doing n coordinate

search. More formally the following proposition holds.

SLIDE 16

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 15 Proposition 2.1 Let f be a separable function, then for all xj

argmin f(x1, . . . , xn) =

argmin f1

(x1

2,...,x1 n)(x1), . . . , argmin fn

(xn

1 ,...,xn n−1)(xn)

(2.4)

and f can be optimized using n minimization along the coordinates. In practice, many problems are non-separable that is do not satisfy Definition 2.2. Informally speaking we can say that we have “correlations” between variables. Exercice 2.6 Prove Proposition 2.1. An important class of separable problems are additively decomposable functions. Exercice 2.7 (Additively decomposable functions) Let f(x1, . . . , xn) = n

i=1 hi(xi)

for hi : R → R having a unique argmin. Prove that f is separable. We say in this case that f is additively decomposable. While we have defined separability in Definition 2.2, separability is often defined as additively decomposability. This is a much stronger definition of separability. Exercice 2.8 The Rastrigin function is defined as fRastrigin(x) = 10n+n

i=1[x2 i −10 cos(2πxi)].

Show that the Rastrigin function is separable. Exercice 2.9 Show that a convex-quadratic function is separable if and only if its Hessian matrix is diagonal. Exercice 2.10 Let f : Rn → R be a separable function, and g : Im(f) → R be a strictly increasing function. Show that g ◦ f is separable. Exercice 2.11 (Product of positive functions) Let f be the product of positive functions f(x1, . . . , xn) =

gi(xi) (2.5) with gi : R → R≥0. Show that f is separable. Transforming a separable problem into a non-separable one If we assume that we have a separable function f, we can easily built a non-separable problems in the following way: take O an orthogonal matrix which is not equal to the identity. Then x → f(Ox) is non separable.

SLIDE 17

CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 16 It is relatively easy to exploit the separability of a test functions without fully realizing it (for instance if some search operators tend to generate solutions along coordinates). At the same time it is easy to build separable test functions. In the past some classes of algorithms became popular (for instance genetic algorithms, particle swarm optimization algorithms, algorithms using Cauchy distributions) mainly because their good performance was coming from exploiting the separability of test

functions. This issue of bias of testbeds towards separable test problems

is nowadays relatively known.

2.2.4 Multi-modality

Another difficulty in optimization is related to multi-modality, that is (in the context of minimization) having more than one local optimum. Yet, multi-modality is often wrongly diagnosed. It is very common to read in scientific papers that an algorithm fails because it is trapped into a local optimum while the problem might come for instance from having an ill-conditioned problem which makes that an algorithm not equipped with a mechanism to handle ill-conditioned problems is so slow that it looks almost stuck.

2.2.5 Ruggedness

Another type of difficulties come from what can be generically called “ruggedness” of the

function. A function will be informally called rugged if it is for instance, non-smooth, non-

differentiable, non-continuous. The function can look very noisy (see illustrations during the class)

2.2.6 Non-xxx

Many of the difficulties we have defined are related to non properties (like non-continuity, non-separability, non-differentiability ...). We should keep in mind that it is much more difficult to address a function that has a “non”-property than a property, mainly because the class of functions is typically wider.

Stochastic / Randomized Derivative Free Optimization

Contents

Chapter 1

A few Definitions, Reminders and Terminology

1.1 Reminders and Terminology 1.2 Definitions

Chapter 2

Introduction to Black-box Optimization

2.1 Derivative-free and Black-box Optimization

2.2 What makes an optimization problem difficult?