Josu Ceberio
Bayesian Analysis for Algorithm Performance Comparison
Is it possible to compare optimization algorithms without hypothesis testing?
Bayesian Analysis for Algorithm Performance Comparison Is it - - PowerPoint PPT Presentation
Bayesian Analysis for Algorithm Performance Comparison Is it possible to compare optimization algorithms without hypothesis testing? Josu Ceberio Is there a reproducibility crisis? Fuente: Monya Baker (2016) Is there a reproducibility crisis?
Josu Ceberio
Is it possible to compare optimization algorithms without hypothesis testing?
Fuente: Monya Baker (2016) Is there a reproducibility crisis? Nature, 533, 452-454
Idea for solving a set
efficiently.
Is my algorithm better than the state-
On which problems is my algorithm better? Why is my algorithm better (or worse)?
Compare the performance
state-of-the-art on some benchmark of problems. The analysis of the results should take into account the associated uncertainty.
What conclusions do we draw from the experimentation? How do we answer to the formulated questions?
How l likely i is m my p proposal t to be be the b best a algorithm t to s solve a p problem? How l likely i is m my p proposal t to be be the b best a algorithm f from the c compared o
STATISTICAL A ANALYSIS O OF EXPERIMENTAL R RESULTS NULL HYPOTHESIS STATISTICAL TESTING
WHAT N NHST C COMPUTES
Unknown Behaviour Observed Sample
We assume the null hypothesis, the average performance of the compared methods is the same. Then, the observed difference is computed from data and the probability of observing such a difference (or bigger) is estimated: the p-value. The p-value refers to the probability of erroneously assuming that there are differences when actually there are not. It is used to measure the magnitude of difference, as it decreases when the difference increases. WHAT N NHST C COMPUTES
WHAT W WE W WOULD L LIKE T TO K KNOW
Unknown Behaviour Observed Sample
Many alternatives to handle uncertainty associated with empirical results:
6WDWLVWLFDOQDO\VLV +DQGERRN
$&RPSUHKHQVLH+DQGERRNRI6DLVLFDO &RQFHSV7HFKQLTHVDQG6RIDUH7RROV (GLLRQ 'U0LFKDHO-GH6PLK
WHAT N NHST C COMPUTES
BAYESIAN STATISTICAL ANALYSIS
STATISTICAL A ANALYSIS O OF EXPERIMENTAL R RESULTS NULL HYPOTHESIS STATISTICAL TESTING Unknown Behaviour Observed Sample
The method focuses on estimating relevant information about the underlying performance parametric distribution represented by a set of parameters θ. This method asses the distribution of θ conditioned on a sample s drawn from the performance distribution. Instead of having a single probability distribution to model the underlying performance, Bayesian statistics considers all possible distributions and assigns a probability to each.
Posterior distribution
Likelihood function Prior distribution
HOW D DO W WE C COMPARE M MULTIPLE AL ALGORITHMS?
Minimizing some instances of a problem Minimizing a given instance of a problem
Algorithm f1
GA 100 PSO 90 ILP 135 SA 105 GP 95 . . . . . .
Observed Sample
σ1
3 1 5 4 2 . . .
Algorithm f2
GA 130 PSO 80 ILP 135 SA 30 GP 300 . . . . . .
σ2
3 2 4 1 5 . . .
σ3
3 5 2 4 1 . . .
σ4
4 5 3 1 2 . . .
σ5
4 3 2 5 1 . . .
Algorithm f3
GA 37 PSO 352 ILP 19 SA 100 GP 10 . . . . . .
Algorithm f4
GA 566 PSO 756 ILP 101 SA 56 GP 57 . . . . . .
Algorithm f5
GA 256 PSO 125 ILP 89 SA 369 GP 36 . . . . . .
rankings, permutations
n
i=1
j=i wσj
Posterior distribution of the weights Likelihood of the sample Prior distribution of the weights N
k=1 n
i=1
i
j=i wσ(k)
j
n
i=1
i
<latexit sha1_base64="/gfyjh4UDNfus5EbeDuQVHLsAyw=">ACE3icbVDLSsNAFJ34rPUVdelmsAgiWBIt6EYounFZwT6gScNkMmHTiZhZqKUkH9w46+4caGIWzfu/BunbRbaeuDC4Zx7ufceP2FUKsv6NhYWl5ZXVktr5fWNza1tc2e3JeNUYNLEMYtFx0eSMpJU1HFSCcRBEU+I21/eD32/dESBrzOzVKiBuhPqchxUhpyTOPnVAgnNl5dpU7iYgDL6OXdt7j8MGjvcxBLBkgj8ITO/fMilW1JoDzxC5IBRoeOaXE8Q4jQhXmCEpu7aVKDdDQlHMSF52UkShIeoT7qachQR6WaTn3J4qJUAhrHQxRWcqL8nMhRJOYp83RkhNZCz3lj8z+umKrxwM8qTVBGOp4vClEVw3FAMKCYMVGmiAsqL4V4gHSISkdY1mHYM+PE9ap1X7rGrd1ir1WhFHCeyDA3AEbHAO6uAGNEATYPAInsEreDOejBfj3fiYti4Yxcwe+APj8wfKaZ4G</latexit>i=1 Γ(αi)
i=1 αi)
<latexit sha1_base64="lQ2UQ095A4jrK9whnNjihdhrbPg=">ACL3icbVDLSgMxFM34rPU16tJNsAh1U2ZU0I1QFNRlBauFTh3upBkbmSGJCOUoX/kxl/pRkQRt/6Fa1vDwQO5zLzT1Rypk2nvfgTExOTc/MFuaK8wuLS8vuyuqFTjJFaJ0kPFGNCDTlTNK6YbTRqoiIjTy6h7NPQvb6jSLJHnpfSloBryWJGwFgpdI8PD4JYAcmDVCXtMGcHfv9K4uAEhIByADztQMi2+vmHojPxlfq0Q7fkVbwR8F/ij0kJjVEL3UHQTkgmqDSEg9ZN30tNKwdlGOG0XwyTVMgXbimTUslCKpb+ejePt60ShvHibJPGjxSv0/kILTuicgmBZiO/u0Nxf+8Zmbi/VbOZJoZKsn7oj2CR4WB5uM0WJ4T1LgChm/4pJB2x9xlZctCX4v0/+Sy62K/5OxTvbLV3x3U0DraQGXkoz1URaeohuqIoFs0QI/oyblz7p1n5+U9OuGMZ9bQDzivb04RqSw=</latexit>No way to sample posterior distribution exactly à MCMC
Instance #1 Instance #m Instance #2
Alg1 w1 w2 wn Alg2 Algn
Performance Matrix Weight Vector Sample Run the Algorithms Rank the Algorithms
Alg1 Alg2 Algn
Ranking Matrix MCMC Sampling Query Posterior
0.0 0.2 0.4 0.6
23 F FUNCTIONS T TO O OPTIMIZE:
F17)
Pr Problem Si Size: 11 11 Me Metah aheuri ristic al algori rithms:
Re Results of 1 11.132 ru runs ns are are co collect cted (23 x x 4 4 x x 1 11 x x 1 11)
Es Estim imate th the pr probability of
each al algori rithm be being top top-rank ranked
An Analyze th the un uncertainty ab about th the pr probabilities
QUALITATIVE S SUMMARY Similar perf. (1+(λ+λ)) GA, (1+1)-EA, (1+10)-EAvar, (1+10)-Ealog-n, (1+10)-Eanorm,(1+10)-EAr/2,2r and fGA. Extreme perf. vGA and gHC. Easily treated instances are F1-F6, F8, F11-F13 and F15-16. Best solutions found for n=625
Fix Fixed-ta target t pe perspe pect ctive – Record Running-time
(1+(,)) GA (1+1) EA gHC (1+10) EA_r/2,2r (1+10) EA (1+10) EA_log-n. (1+10) EA_norm. (1+1) EA_var. fGA vGA RLS 0.0 0.2 0.4 0.6
Probability of winning Algorithm
F17, n=625, φ=625 F19, n=100, φ=100
(1+(,)) GA (1+1) EA gHC (1+10) EA_r/2,2r (1+10) EA (1+10) EA_log-n. (1+10) EA_norm. (1+1) EA_var. fGA vGA RLS 0.0 0.1 0.2 0.3 0.4 0.5
Probability of winning Algorithm
Credible Intervals Only 11 samples to do inference à High uncertainty is expected! The more samples, the lower the uncertainty à Credibility intervals are more tight!
Expected probability High uncertainty
IN INTERP RPRE RETABIL ILIT ITY
Fix Fixed-ta target t pe perspe pect ctive – Record Running-time – Set of easy functions
(1+(,)) GA (1+1) EA gHC (1+10) EA_r/2,2r (1+10) EA (1+10) EA_log-n. (1+10) EA_norm. (1+1) EA_var. fGA vGA RLS 0.00 0.25 0.50 0.75 1.00
Probability of winning Algorithm
n=625, all runs
(1+(,)) GA (1+1) EA gHC (1+10) EA_r/2,2r (1+10) EA (1+10) EA_log-n. (1+10) EA_norm. (1+1) EA_var. fGA vGA RLS 0.0 0.2 0.4 0.6
Probability of winning Algorithm
n=625, median
Credible Intervals Set of functions, two paths à (1) take all the runs, (2) take the median of the runs on each instance. gHC is the best in both cases à with more samples the uncertainty is lower
Fix Fixed-ta target t pe perspe pect ctive – Record Running-time – Set of non-easy functions Credible Intervals Good estimations à credible intervals smaller than 0.05 Probabilities are similar à due to overlapping Uncertainty about which is the best à but not due to limitation of data, but due to equivalence in the algorithms
(1+(,)) GA (1+1) EA gHC (1+10) EA_r/2,2r (1+10) EA (1+10) EA_log-n. (1+10) EA_norm. (1+1) EA_var. fGA vGA RLS 0.050 0.075 0.100 0.125 0.150
Probability of winning Algorithm
n=625, all runs
Fix Fixed-bu budget pe perspe pect ctive – Evolution winning probability - %90 credibility intervals
0.0 0.2 0.4 0.6 300 600 900
Budget Winning probability
(1+(,)) GA (1+1) EA gHC (1+10) EA_r/2,2r (1+10) EA (1+10) EA_log-n. (1+10) EA_norm. (1+1) EA_var. fGA vGA RLS
F21, n=100
gHC is the best, but probability decreases while the rest improve. gHC becomes better, as the budget increases.
3 4 5 6 7 8 9 10 11
Algorithms ranked with average data Wilcoxon test for pairwise comparisons, and shaffer’s method for p-value correction.
BAYESIAN A ANALYSIS ESTIMATED P PROBABILITY A AND NOTION O OF U UNCERTAINTY I IN T THE FORM O OF C CREDIBLE I INTERVAL
Im Impact of
the pr prior di distribution – Comparison of three different priors
0.0 0.2 0.4 0.6 ( 1 + (
) G A ( 1 + 1 ) E A g H C ( 1 + 1 ) E A _ r / 2 , 2 r ( 1 + 1 ) E A ( 1 + 1 ) E A _ l
. ( 1 + 1 ) E A _ n
m . ( 1 + 1 ) E A _ v a r . f G A v G A R L S
Algorithm Winning probability Prior
Unifor Empirical Deceptive
F9, n=100, φ=100
Empirical data favours the best performing algorithms Neligible effect (even when median values are considered)
Bayesian inference using Plackett-Luce for analysis of algorithms’ performance ranking Include it in the practical EC performance comparison’ tool set à IOHProfiler Strong points Ability to handle multiple algorithms Interpretability Exact description of the uncertainty WEAKNESSES Aggregating performances into rankings we loose information about the magnitude of differences Limitations of the Plackett-Luce model à From n! to n parameters. How do we deal with ties?
scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems
Josu Ceberio
Thank you very much for your attention!