Learning near-optimal hyperparameters with minimal overhead Gellrt - - PowerPoint PPT Presentation

learning near optimal hyperparameters with minimal
SMART_READER_LITE
LIVE PREVIEW

Learning near-optimal hyperparameters with minimal overhead Gellrt - - PowerPoint PPT Presentation

Learning near-optimal hyperparameters with minimal overhead Gellrt Weisz Andrs Gyrgy Csaba Szepesvri Workshop on Automated Algorithm Design (TTIC 2019) August 7, 2019 1 / 22 Introduction Problem: find good parameter settings


slide-1
SLIDE 1

Learning near-optimal hyperparameters with minimal overhead

Gellért Weisz András György Csaba Szepesvári Workshop on Automated Algorithm Design (TTIC 2019) August 7, 2019

1 / 22

slide-2
SLIDE 2

Introduction

Problem: find good parameter settings (configurations) for general purpose solvers.

◮ No structure assumed over the parameter space. 2 / 22

slide-3
SLIDE 3

Introduction

Problem: find good parameter settings (configurations) for general purpose solvers.

◮ No structure assumed over the parameter space.

Zillions of practical algorithms ⇔ Little theory Want theoretical guarantees on the runtime of

◮ the chosen configuration; and ◮ the configuration process. 2 / 22

slide-4
SLIDE 4

Introduction

Problem: find good parameter settings (configurations) for general purpose solvers.

◮ No structure assumed over the parameter space.

Zillions of practical algorithms ⇔ Little theory Want theoretical guarantees on the runtime of

◮ the chosen configuration; and ◮ the configuration process.

Goal: find a near-optimal configuration solving 1 − δ fraction of the problems in the least expected time.

◮ Since some instances (δ fraction) are hopelessly hard; don’t want to

solve those.

2 / 22

slide-5
SLIDE 5

Problem formulation

Given: n configurations, distribution Γ of problem instances.

pdf Runtime of configuration

tail probability =

pdf Runtime of configuration

expected capped runtime

()

  • 3 / 22
slide-6
SLIDE 6

Problem formulation

Given: n configurations, distribution Γ of problem instances.

pdf Runtime of configuration

tail probability =

pdf Runtime of configuration

expected capped runtime

()

  • 3 / 22
slide-7
SLIDE 7

Problem formulation

Given: n configurations, distribution Γ of problem instances.

pdf Runtime of configuration

tail probability =

pdf Runtime of configuration

expected capped runtime

()

  • 3 / 22
slide-8
SLIDE 8

Problem formulation

Given: n configurations, distribution Γ of problem instances.

pdf Runtime of configuration

tail probability =

pdf Runtime of configuration

expected capped runtime

()

  • Runtime of the optimal

capped configuration: OPTδ = min

i

Rδ(i)

3 / 22

slide-9
SLIDE 9

Problem formulation

Given: n configurations, distribution Γ of problem instances.

pdf Runtime of configuration

tail probability =

pdf Runtime of configuration

expected capped runtime

()

  • Runtime of the optimal

capped configuration: OPTδ = min

i

Rδ(i) Configuration i is (ε, δ)-optimal if Rδ(i) ≤ (1 + ε)OPTδ/2.

3 / 22

slide-10
SLIDE 10

Problem formulation

Given: n configurations, distribution Γ of problem instances.

pdf Runtime of configuration

tail probability =

pdf Runtime of configuration

expected capped runtime

()

  • Runtime of the optimal

capped configuration: OPTδ = min

i

Rδ(i) Configuration i is (ε, δ)-optimal if Rδ(i) ≤ (1 + ε)OPTδ/2. Note that OPTδ ≤ OPTδ/2 ≤ OPT0 – gaps can be large!

3 / 22

slide-11
SLIDE 11

Previous work (before ICML ’19)

4 / 22

slide-12
SLIDE 12

Structured Procrastination

(Kleinberg et al., 2017)

Relaxed goal: Find i with Rδ(i) ≤ (1 + ε)OPT0 Worst-case lower bound: runtime must be at least Ω

  • OPT0 n

ε2δ

  • With probability 1 − ζ, returns an (ε, δ)-optimal configuration in

worst-case time O

  • OPT0

n ε2δ log n log ¯ κ ζε2δ

  • ◮ κ: absolute upper bound on runtimes

5 / 22

slide-13
SLIDE 13

Structured Procrastination

(Kleinberg et al., 2017)

Relaxed goal: Find i with Rδ(i) ≤ (1 + ε)OPT0 Worst-case lower bound: runtime must be at least Ω

  • OPT0 n

ε2δ

  • With probability 1 − ζ, returns an (ε, δ)-optimal configuration in

worst-case time O

  • OPT0

n ε2δ log n log ¯ κ ζε2δ

  • ◮ κ: absolute upper bound on runtimes

Can we remove ¯ κ? Can we improve runtime when problem is easier?

5 / 22

slide-14
SLIDE 14

LEAPSANDBOUNDS

(Weisz et al., 2018)

1

Guess a value θ of OPT, starting from a low value

2

Test whether Rδ(i) ≤ θ for some configuration i:

◮ For each i, run b = ˜

O( 1

δε2 ) instances with instance-wise timeout

τ = 4θ

3δ , abort if empirical average exceeds θ.

3

Return the configuration with the smallest mean amongst successful configurations. If no test succeeded, double θ, continue from Step 2.

Average runtime budget and its use across different configurations and phases

6 / 22

slide-15
SLIDE 15

Why does this work?

w.h.p., for any configuration i:

7 / 22

slide-16
SLIDE 16

Why does this work?

w.h.p., for any configuration i:

◮ if runs complete within θ average runtime: 7 / 22

slide-17
SLIDE 17

Why does this work?

w.h.p., for any configuration i:

◮ if runs complete within θ average runtime:

(i) τ is above δ-quantile for configuration i

7 / 22

slide-18
SLIDE 18

Why does this work?

w.h.p., for any configuration i:

◮ if runs complete within θ average runtime:

(i) τ is above δ-quantile for configuration i (ii) Empirical mean ¯ Ri is ε-close to Rτ(i) = E[X(i, J) ∧ τ], J ∼ Γ

7 / 22

slide-19
SLIDE 19

Why does this work?

w.h.p., for any configuration i:

◮ if runs complete within θ average runtime:

(i) τ is above δ-quantile for configuration i (ii) Empirical mean ¯ Ri is ε-close to Rτ(i) = E[X(i, J) ∧ τ], J ∼ Γ

◮ otherwise, Rδ(i) > θ, hence can safely abandon i for this phase 7 / 22

slide-20
SLIDE 20

Why does this work?

w.h.p., for any configuration i:

◮ if runs complete within θ average runtime:

(i) τ is above δ-quantile for configuration i (ii) Empirical mean ¯ Ri is ε-close to Rτ(i) = E[X(i, J) ∧ τ], J ∼ Γ

◮ otherwise, Rδ(i) > θ, hence can safely abandon i for this phase

Thus, if for any configuration i, ¯ Ri < θ, then for i∗ = argmini ¯ Ri, Rδ(i∗) ≤ (1 + ε)OPT0 w.h.p.

7 / 22

slide-21
SLIDE 21

Guarantees

Theorem

With high probability, (i) the algorithm finds an (ε, δ)-optimal configuration; (ii) the worst case runtime is O

  • OPT0 n

ε2δ log

  • n log OPT0

ζ

  • .

8 / 22

slide-22
SLIDE 22

Guarantees

Theorem

With high probability, (i) the algorithm finds an (ε, δ)-optimal configuration; (ii) the worst case runtime is O

  • OPT0 n

ε2δ log

  • n log OPT0

ζ

  • .

Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy.

8 / 22

slide-23
SLIDE 23

Guarantees

Theorem

With high probability, (i) the algorithm finds an (ε, δ)-optimal configuration; (ii) the worst case runtime is O

  • OPT0 n

ε2δ log

  • n log OPT0

ζ

  • .

Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy. Runtime:

O

  • OPT0

n

i=1 max

  • σ2

i,k

ε2R2

τk(i), 1

εδ, 1 δ log 1 δ

log n log OPT0

ζ

+ log

1 εRτk(i)

.

8 / 22

slide-24
SLIDE 24

Guarantees

Theorem

With high probability, (i) the algorithm finds an (ε, δ)-optimal configuration; (ii) the worst case runtime is O

  • OPT0 n

ε2δ log

  • n log OPT0

ζ

  • .

Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy. Runtime:

O

  • OPT0

n

i=1 max

  • σ2

i,k

ε2R2

τk(i), 1

εδ, 1 δ log 1 δ

log n log OPT0

ζ

+ log

1 εRτk(i)

.

Huge improvement if the variances are small:

σ2

i,k

R2

τk ≪ 1

δ.

8 / 22

slide-25
SLIDE 25

Experiments

Configuring the minisat SAT solver (Sorensson and Een, 2005) 1K configurations, 20K nontrivial problem instances Compare with Structured Procrastination by Kleinberg et al. (2017) Code and (83 CPU years worth of @ year 2018 CPUs) data: https://github.com/deepmind/leaps-and-bounds

200 400 600 800 1000 Configurations (sorted according to mean below delta quantile) 100 101 102 103 Mean below delta quantile (s) delta=0 delta=0.05 delta=0.1 delta=0.2 delta=0.3 delta=0.5

9 / 22

slide-26
SLIDE 26

Experiments

Configuring the minisat SAT solver (Sorensson and Een, 2005) 1K configurations, 20K nontrivial problem instances Compare with Structured Procrastination by Kleinberg et al. (2017) Code and (83 CPU years worth of @ year 2018 CPUs) data: https://github.com/deepmind/leaps-and-bounds

200 400 600 800 1000 Configurations (sorted according to mean below delta quantile) 100 101 102 103 Mean below delta quantile (s) delta=0 delta=0.05 delta=0.1 delta=0.2 delta=0.3 delta=0.5

9 / 22

slide-27
SLIDE 27

Results

ε = 0.2, δ = 0.2, ζ = 0.1 Instead of doubling, use θ := 1.25θ Runs can be stopped and resumed (ie ‘continue’ running on an instance)

200 400 600 800 1000 Configurations (sorted according to mean below 0.2 quantile) 104 105 Total time spent running configuration (s) LeapsAndBounds Structured Procrastination

10 / 22

slide-28
SLIDE 28

Results

ε = 0.2, δ = 0.2, ζ = 0.1 Instead of doubling, use θ := 1.25θ Runs can be stopped and resumed (ie ‘continue’ running on an instance)

200 400 600 800 1000 Configurations (sorted according to mean below 0.2 quantile) 104 105 Total time spent running configuration (s) LeapsAndBounds Structured Procrastination

10 / 22

slide-29
SLIDE 29

Results

ε = 0.2, δ = 0.2, ζ = 0.1 Instead of doubling, use θ := 1.25θ Runs can be stopped and resumed (ie ‘continue’ running on an instance)

200 400 600 800 1000 Configurations (sorted according to mean below 0.2 quantile) 104 105 Total time spent running configuration (s) LeapsAndBounds Structured Procrastination

3–20-times improvement in total work (also across different choices of ε and δ)

10 / 22

slide-30
SLIDE 30

Effect of the multiplier of θ

1.0 1.2 1.4 1.6 1.8 2.0 2.2

Theta multiplier

500 1000 1500 2000 2500 3000

Total runtime (day)

LeapsAndBounds, no resume LeapsAndBounds, resume SP, no resume SP, resume

(cost of pause/resume is not modeled)

11 / 22

slide-31
SLIDE 31

Current work (ICML ’19)

12 / 22

slide-32
SLIDE 32

CAPSANDRUNS algorithm

(Weisz et al., 2019)

For all configurations i, in parallel: Phase I: Find tδ(i) ≤ τi ≤ tδ/2(i): Run Θ(1/δ) instances in parallel until 1 − 3

4δ fraction of them

finishes. Phase II: Find Rτi(i) with ε relative accuracy: Run sufficiently many instances with timeout τi until we get an ε-accurate estimate of Rτi(i) ( ‘Bernstein stopping’ ala Mnih et al. 2008). Return: Of the configurations not rejected, select the one with the smallest average capped runtime

1 2 3 4 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 c1 c2 c3 c4 ✓ ✗ ✓ ✗

13 / 22

slide-33
SLIDE 33

CAPSANDRUNS algorithm

(Weisz et al., 2019)

For all configurations i, in parallel: Phase I: Find tδ(i) ≤ τi ≤ tδ/2(i): Run Θ(1/δ) instances in parallel until 1 − 3

4δ fraction of them

  • finishes. Abort if taking too much time.

Phase II: Find Rτi(i) with ε relative accuracy: Run sufficiently many instances with timeout τi until we get an ε-accurate estimate of Rτi(i) ( ‘Bernstein stopping’ ala Mnih et al. 2008). Adjust best runtime UCB and abort if LCB(i)>UCB. Return: Of the configurations not rejected, select the one with the smallest average capped runtime

1 2 3 4 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 c1 c2 c3 c4 ✓ ✗ ✓ ✗

13 / 22

slide-34
SLIDE 34

Global variables

1: Set N of n algorithm configurations 2: Precision parameter ε ∈ (0, 1

3)

3: Quantile parameter δ ∈ (0, 1) 4: Failure probability parameter ζ ∈ (0, 1

6)

5: Instance distribution Γ 6: b ←

  • 481

δ log

  • 3n

ζ

  • 7: T ← ∞ ⊲ Time limit, updated continuously by

all parallel processes

14 / 22

slide-35
SLIDE 35

Global variables

1: Set N of n algorithm configurations 2: Precision parameter ε ∈ (0, 1

3)

3: Quantile parameter δ ∈ (0, 1) 4: Failure probability parameter ζ ∈ (0, 1

6)

5: Instance distribution Γ 6: b ←

  • 481

δ log

  • 3n

ζ

  • 7: T ← ∞ ⊲ Time limit, updated continuously by

all parallel processes Algorithm 2 QUANTILEEST

1: Inputs: i 2: Initialize: m ←

  • (1 − 3

4δ)b

  • 3: Run configuration i on b instances, in parallel,

until m of these complete. Abort if total work ≥ 2Tb.

4: τ ← runtime of mth completed instance 5: return τ

14 / 22

slide-36
SLIDE 36

Global variables

1: Set N of n algorithm configurations 2: Precision parameter ε ∈ (0, 1

3)

3: Quantile parameter δ ∈ (0, 1) 4: Failure probability parameter ζ ∈ (0, 1

6)

5: Instance distribution Γ 6: b ←

  • 481

δ log

  • 3n

ζ

  • 7: T ← ∞ ⊲ Time limit, updated continuously by

all parallel processes Algorithm 2 QUANTILEEST

1: Inputs: i 2: Initialize: m ←

  • (1 − 3

4δ)b

  • 3: Run configuration i on b instances, in parallel,

until m of these complete. Abort if total work ≥ 2Tb.

4: τ ← runtime of mth completed instance 5: return τ

Algorithm 3 RUNTIMEEST

1: Inputs: i, τ 2: Initialize: j ← 0 3: while True do 4:

Sample jth instance J from Γ

5:

Let Y be τ capped runtime of i on J

6:

Update ¯ Y , ¯ σ2, sample mean and variance

7:

C = c(¯ σ, n, j, ζ, τ)

8:

if ¯ Y − C > T then

9:

return reject i

10:

end if

11:

T ← min{T, ¯ Y + C} ⊲ lowest upper confidence

12:

if C ≤ ε

3(2 ¯

Y − C) then

13:

return accept i with runtime estimate ¯ Y

14:

end if

15:

j ← j + 1

16: end while

14 / 22

slide-37
SLIDE 37

Global variables

1: Set N of n algorithm configurations 2: Precision parameter ε ∈ (0, 1

3)

3: Quantile parameter δ ∈ (0, 1) 4: Failure probability parameter ζ ∈ (0, 1

6)

5: Instance distribution Γ 6: b ←

  • 481

δ log

  • 3n

ζ

  • 7: T ← ∞ ⊲ Time limit, updated continuously by

all parallel processes Algorithm 1 CAPSANDRUNS

1: N ′ ← N

⊲ Pool of competing configurations

2: for configuration i ∈ N, in parallel do 3:

// Phase I:

4:

Run τi ←QUANTILEEST (i)

5:

// Phase II:

6:

if QUANTILEEST (i) aborted then

7:

Remove i from N ′

8:

else

9:

Run RUNTIMEEST (i, τi), abort if |N ′| = 1

10:

if RUNTIMEEST (i, τi) rejected i then

11:

Remove i from N ′

12:

else

13:

¯ Y (i) ← return value of RUNTIMEEST (i, τi)

14:

end if

15:

end if

16: end for 17: return i∗ = argmini∈N ′ ¯

Y (i) and τi∗ Algorithm 2 QUANTILEEST

1: Inputs: i 2: Initialize: m ←

  • (1 − 3

4δ)b

  • 3: Run configuration i on b instances, in parallel,

until m of these complete. Abort if total work ≥ 2Tb.

4: τ ← runtime of mth completed instance 5: return τ

Algorithm 3 RUNTIMEEST

1: Inputs: i, τ 2: Initialize: j ← 0 3: while True do 4:

Sample jth instance J from Γ

5:

Let Y be τ capped runtime of i on J

6:

Update ¯ Y , ¯ σ2, sample mean and variance

7:

C = c(¯ σ, n, j, ζ, τ)

8:

if ¯ Y − C > T then

9:

return reject i

10:

end if

11:

T ← min{T, ¯ Y + C} ⊲ lowest upper confidence

12:

if C ≤ ε

3(2 ¯

Y − C) then

13:

return accept i with runtime estimate ¯ Y

14:

end if

15:

j ← j + 1

16: end while

14 / 22

slide-38
SLIDE 38

CAPSANDRUNS theory

Theorem

With probability 1 − ζ, (i) the algorithm finds an (ε, δ)-optimal configuration; (ii) the total work is ˜ Oζ

  • n OPTδ/2

1 δ +max

  • σ2

max{ε2, ∆2}, r max{ε, ∆}

  • ,

15 / 22

slide-39
SLIDE 39

Refined result

Gap: ∆i = 1 −

OPTδ/2 Rδ(i) .

Variance of R(i, j, τ), j ∼ Γ: σ2

τ(i).

Maximum relative variance: ˆ σ2(i) = supτ∈[tδ(i),tδ/2(i)]

σ2

τ(i)

R2

τ(i).

Relative range r(i) = supτ∈[tδ(i),tδ/2(i)]

τ Rτ(i).

Among the set of configurations N1 not rejected by QUANTILEEST, let i∗ = argmini∈N1 Rτi(i).

16 / 22

slide-40
SLIDE 40

Refined result

Gap: ∆i = 1 −

OPTδ/2 Rδ(i) .

Variance of R(i, j, τ), j ∼ Γ: σ2

τ(i).

Maximum relative variance: ˆ σ2(i) = supτ∈[tδ(i),tδ/2(i)]

σ2

τ(i)

R2

τ(i).

Relative range r(i) = supτ∈[tδ(i),tδ/2(i)]

τ Rτ(i).

Among the set of configurations N1 not rejected by QUANTILEEST, let i∗ = argmini∈N1 Rτi(i).

Theorem

With probability 1 − ζ, (i) the algorithm finds an (ε, δ)-optimal configuration; (ii) the total work is ˜ Oζ

  • OPT δ

2

  • n

δ +

  • i∈N

max

  • max
  • ˆ

σ2(i), ˆ σ2(i∗)

  • max{ε2, ∆2

i }

, max {r(i), r(i∗)} max{ε, ∆i}

  • 16 / 22
slide-41
SLIDE 41

Experiments I

200 400 600 800 1000

Configurations (sorted according to mean below R ±=2 quantile)

104 105 106 107

Total time spent running configuration (s)

CapsAndRuns LeapsAndBounds Structured Procrastination

STRUCTURED PROCRASTINATION LEAPSANDBOUNDS CAPSANDRUNS 20643 (±5) days 1451 (±83) days 586 (±7) days

17 / 22

slide-42
SLIDE 42

Experiments II: Speedup compared to LEAPSANDBOUNDS

0.02 0.04 0.06 0.08 0.10

"

0.02 0.04 0.06 0.08 0.10

±

0.75 3.00 7.00 14.00 21.00 28.00 35.00 42.00 49.00 56.00 63.00 65.01 18 / 22

slide-43
SLIDE 43

Recent work (after ICML ’19)

19 / 22

slide-44
SLIDE 44

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0

20 / 22

slide-45
SLIDE 45

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

20 / 22

slide-46
SLIDE 46

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 20 / 22

slide-47
SLIDE 47

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get 20 / 22

slide-48
SLIDE 48

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

20 / 22

slide-49
SLIDE 49

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

Questions:

20 / 22

slide-50
SLIDE 50

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

Questions:

◮ Improving guarantee from 1/(δε2) to problem-dependent 1/δ + 1/ε2 20 / 22

slide-51
SLIDE 51

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

Questions:

◮ Improving guarantee from 1/(δε2) to problem-dependent 1/δ + 1/ε2 ◮ Simultaneous guarantee against OPTδ/2 for any (ε, δ) 20 / 22

slide-52
SLIDE 52

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

Questions:

◮ Improving guarantee from 1/(δε2) to problem-dependent 1/δ + 1/ε2 ◮ Simultaneous guarantee against OPTδ/2 for any (ε, δ) ◮ Algorithm takes ζ vs. it returns ζ (‘fixed budget’?) 20 / 22

slide-53
SLIDE 53

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

Questions:

◮ Improving guarantee from 1/(δε2) to problem-dependent 1/δ + 1/ε2 ◮ Simultaneous guarantee against OPTδ/2 for any (ε, δ) ◮ Algorithm takes ζ vs. it returns ζ (‘fixed budget’?) ◮ What guarantees can we get? 20 / 22

slide-54
SLIDE 54

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

Questions:

◮ Improving guarantee from 1/(δε2) to problem-dependent 1/δ + 1/ε2 ◮ Simultaneous guarantee against OPTδ/2 for any (ε, δ) ◮ Algorithm takes ζ vs. it returns ζ (‘fixed budget’?) ◮ What guarantees can we get? ◮ Does the new LCB used by SPC help? 20 / 22

slide-55
SLIDE 55

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

Questions:

◮ Improving guarantee from 1/(δε2) to problem-dependent 1/δ + 1/ε2 ◮ Simultaneous guarantee against OPTδ/2 for any (ε, δ) ◮ Algorithm takes ζ vs. it returns ζ (‘fixed budget’?) ◮ What guarantees can we get? ◮ Does the new LCB used by SPC help? ◮ Does it make sense to decrease ζ? 20 / 22

slide-56
SLIDE 56

Structured proscrastination with confidence

(Kleinberg et al., 2019)

Anytime guarantee: With some c, p > 0 universal, for any (ε, δ), for t ≥ c OPT0 n/(δε2), SPC returns with probability 1 − ct−p a config i such that Rδ(i) ≤ (1 + ε)OPT0 Making CAPSANDRUNS (more) anytime: Fix ε; add outer loop that decreases δ

◮ Works because OPT1/2 ≤ OPT1/4 ≤ OPT1/8 ≤ · · · ≤ OPT0 ◮ Guarantee against OPT0 is easier to get ◮ Note: optimal configuration can switch back and forth when δ is

decreased!

Questions:

◮ Improving guarantee from 1/(δε2) to problem-dependent 1/δ + 1/ε2 ◮ Simultaneous guarantee against OPTδ/2 for any (ε, δ) ◮ Algorithm takes ζ vs. it returns ζ (‘fixed budget’?) ◮ What guarantees can we get? ◮ Does the new LCB used by SPC help? ◮ Does it make sense to decrease ζ? ◮ Continuous setting? 20 / 22

slide-57
SLIDE 57

Thank you!

21 / 22

slide-58
SLIDE 58

References I

  • R. Kleinberg, K. Leyton-Brown, and B. Lucier. Efficiency through procrastination: Approximately
  • ptimal algorithm configuration with runtime guarantees. In Proceedings of the International

Joint Conference on Artificial Intelligence (IJCAI), 2017.

  • R. Kleinberg, K. Leyton-Brown, B. Lucier, and D. Graham. Procrastinating with confidence:

Near-optimal, anytime, adaptive algorithm configuration. arXiv preprint arXiv:1902.05454, 2019.

  • V. Mnih, C. Szepesvári, and J.-Y. Audibert. Empirical Bernstein stopping. In Proceedings of the

25th international conference on Machine learning, pages 672–679. ACM, 2008.

  • N. Sorensson and N. Een. Minisat v1. 13-a sat solver with conflict-clause minimization. SAT, 2005

(53):1–2, 2005.

  • G. Weisz, A. György, and C. Szepesvári. Leapsandbounds: A method for approximately optimal

algorithm configuration. In Proceedings of the International Conference on Machine Learning (ICML), 2018.

  • G. Weisz, A. György, and C. Szepesvári. CapsAndRuns: An improved method for approximately
  • ptimal algorithm configuration. In Proceedings of the International Conference on Machine

Learning, pages 6707–6715, 2019.

22 / 22