Performance Assessment in Optimization Anne Auger, CMAP & Inria - - PowerPoint PPT Presentation

performance assessment in optimization anne auger cmap
SMART_READER_LITE
LIVE PREVIEW

Performance Assessment in Optimization Anne Auger, CMAP & Inria - - PowerPoint PPT Presentation

Performance Assessment in Optimization Anne Auger, CMAP & Inria Visualization and presentation of single runs Displaying 3 runs (three trials) Displaying 3 runs (three trials) Displaying 3 runs (three trials) Displaying 51 runs Which


slide-1
SLIDE 1

Anne Auger, CMAP & Inria Performance Assessment in Optimization

slide-2
SLIDE 2

Visualization and presentation of single runs

slide-3
SLIDE 3

Displaying 3 runs (three trials)

slide-4
SLIDE 4

Displaying 3 runs (three trials)

slide-5
SLIDE 5

Displaying 3 runs (three trials)

slide-6
SLIDE 6

Displaying 51 runs

slide-7
SLIDE 7

Which Statistics?

slide-8
SLIDE 8

More problems with average / expectations

from Hansen GECCO 2019 Experimentation tutorial

slide-9
SLIDE 9

Which Statistics?

slide-10
SLIDE 10

Implications

from Hansen GECCO 2019 Experimentation tutorial

slide-11
SLIDE 11

Benchmarking Black-Box Optimizers

Benchmarking: running an algorithm on several test functions in order to evaluate the performance of the algorithm

slide-12
SLIDE 12

Why Numerical Benchmarking?

Evaluate the performance of optimization algorithms Compare the performance of different algorithms

understand strength and weaknesses of algorithms help in design of new algorithms

slide-13
SLIDE 13

On performance measures …

slide-14
SLIDE 14

Performance measure - What to measure?

CPU time (to reach a given target) drawbacks: depend on the implementation, on the language, on the machine time is spent on code optimization instead of science

Testing heuristics, we have it all wrong, J.N. Hooker, 1995 Journal of Heuristics

Prefer “absolute” value: # of function evaluations to reach a given target assumptions: internal cost of the algorithm negligible

  • r measured independently
slide-15
SLIDE 15

On performance measures - Requirements “Algorithm A is 10/100 times faster than Algorithm B to solve this type of problems”

slide-16
SLIDE 16

“Algorithm A is 10/100 times faster than Algorithm B to solve this type of problems” quantitative measures On performance measures - Requirements As opposed to

displayed: mean f-value after 3.10^5 f-evals (51 runs) bold: statistically significant concluded: “EFWA significantly better than EFWA-NG”

Source: Dynamic search in fireworks algorithm, Shaoqiu Zheng, Andreas Janecek, Junzhi Li and Ying Tan CEC 2014

slide-17
SLIDE 17

a performance measure should be quantitative, with a ratio scale well-interpretable with a meaning relevant in the “real world” simple

On performance measures - Requirements

slide-18
SLIDE 18

Fixed Cost versus Fixed Budget - Collecting Data

slide-19
SLIDE 19

Fixed Cost versus Fixed Budget - Collecting Data

Collect for a given target (several target), the number of function evaluations needed to reach a target Repeat several times: if algorithms are stochastic, never draw a conclusion from a single run if deterministic algorithm, repeat by changing (randomly) the initial conditions

slide-20
SLIDE 20

ECDF: Empirical Cumulative Distribution Function of the Runtime

slide-21
SLIDE 21

Definition of an ECDF

Let be real random variables. Then the empirical cumulative distribution function (ECDF) is defined as

(X1, …, Xn) ̂ F(t) = 1 n

n

i=1

1Xi≤t

slide-22
SLIDE 22

We display the ECDF of the runtime to reach target function values (see next slides for illustrations)

slide-23
SLIDE 23

A Convergence Graph

A Convergence Graph

slide-24
SLIDE 24

First Hitting Time is Monotonous

slide-25
SLIDE 25

15 Runs

slide-26
SLIDE 26

target

15 Runs ≤ 15 Runtime Data Points

slide-27
SLIDE 27

Empirical CDF

1 0.8 0.6 0.4 0.2

the ECDF of run lengths to reach the target

  • has for each

data point a vertical step of constant size

  • displays for each

x-value (budget) the count of

  • bservations to

the left (first hitting times)

Empirical Cumulative Distribution

slide-28
SLIDE 28

Empirical CDF

1 0.8 0.6 0.4 0.2

interpretations possible:

  • 80% of the runs

reached the target

  • e.g. 60% of the

runs need between 2000 and 4000 evaluations

Empirical Cumulative Distribution

slide-29
SLIDE 29

Aggregation

15 runs

slide-30
SLIDE 30

Aggregation

15 runs 50 targets

slide-31
SLIDE 31

Aggregation

15 runs 50 targets

slide-32
SLIDE 32

15 runs 50 targets ECDF with 750 steps

Aggregation

slide-33
SLIDE 33

We can aggregate over:

  • different targets
  • different functions and targets

We should not aggregate over dimension as functions of different dimensions have typically very different runtimes

slide-34
SLIDE 34

ECDF aggregated over targets - single functions

ECDF for 3 different algorithms

slide-35
SLIDE 35

ECDF aggregated over targets - single function

ECDF for a single algorithm different dimensions

slide-36
SLIDE 36

ERT/ART: Average Runtime

slide-37
SLIDE 37

Which performance measure ?

slide-38
SLIDE 38

Which performance measure ?

slide-39
SLIDE 39

Expected Running Time (restart algo)

ERT = E[RT r] = 1−ps

ps E[RTunsuccessful + E[RTsuccessful]

Estimator for ERT

b ps =

#succ #Runs

\ RTunsucc = Average Evals of unsuccessful runs \ RTsucc = Average Evals of successful runs ART =

#Evals #success

slide-40
SLIDE 40

Example: scaling behavior A R T

A

slide-41
SLIDE 41

On Test functions

slide-42
SLIDE 42

Test functions

function testbed (set of test functions) should “reflect reality”: should model typical difficulties one is willing to solve Example: BBOB testbed (implemented in the COCO platform) the test functions are mainly non-convex and non-separable scalable with the search space dimension not too easy to solve, but yet comprehensible

slide-43
SLIDE 43

Test functions (cont.)

If aggregation of results over all functions from a testbed (through ECDF):

  • ne needs to be careful that some difficulties are not over-

represented or that not too many easy functions are present

slide-44
SLIDE 44
  • 24 functions in 5 groups:
  • 6 dimensions: 2, 3, 5, 10, 20, (40 optional)

The bbob Testbed

slide-45
SLIDE 45

BBOB testbed

Black-Box Optimization Benchmarking test suite noiseless / noisy testbed

http://coco.gforge.inria.fr/doku.php?id=start

noiseless testbed noisy testbed

slide-46
SLIDE 46

COCO platform: automatizing the benchmarking process

slide-47
SLIDE 47

https://github.com/numbbo/coco Step 1: download COCO

slide-48
SLIDE 48

https://github.com/numbbo/coco Step 2: installation of post-processing

slide-49
SLIDE 49

http://coco.gforge.inria.fr/doku.php?id=algorithms Step 3: downloading data for the moment: IPOP-CMA-ES

slide-50
SLIDE 50

https://github.com/numbbo/coco postprocess python –m bbob_pproc IPOP-CMA-ES