Optimizer Benchmarking Needs to Account for Hyperparameter Tuning - - PowerPoint PPT Presentation

optimizer benchmarking needs to account for
SMART_READER_LITE
LIVE PREVIEW

Optimizer Benchmarking Needs to Account for Hyperparameter Tuning - - PowerPoint PPT Presentation

Optimizer Benchmarking Needs to Account for Hyperparameter Tuning Prabhu Teja S * 1, 2 Florian Mai * 1, 2 Thijs Vogels 2 Martin Jaggi 2 Franois Fleuret 1, 2 1 Idiap Research Institute, 2 EPFL, Switzerland * Equal Contribution prabhu.teja,


slide-1
SLIDE 1

Optimizer Benchmarking Needs to Account for Hyperparameter Tuning

Prabhu Teja S * 1, 2 Florian Mai * 1, 2 Thijs Vogels 2 Martin Jaggi 2 François Fleuret 1, 2

1Idiap Research Institute, 2EPFL, Switzerland *Equal Contribution prabhu.teja, florian.mai@idiap.ch

slide-2
SLIDE 2

The problem of optimizer evaluation

Hyperparameter θ Expected loss L(θ)

Optimizer B Optimizer A

θ⋆

B

θ⋆

A

Figure: Two optimizers A & B with hyperparameter θ. Which one do we prefer in practice?

prabhu.teja, florian.mai@idiap.ch 1

slide-3
SLIDE 3

The problem of optimizer evaluation

Hyperparameter θ Expected loss L(θ)

Optimizer B Optimizer A

θ⋆

B

θ⋆

A

Figure: Two optimizers A & B with hyperparameter θ. Which one do we prefer in practice?

  • 1. The absolute performance of the optimizer → L(θ⋆

A), L(θ⋆ B)

  • 2. Difficulty of finding good hyperparameter configuration ≈ θ⋆

A, θ⋆ B. prabhu.teja, florian.mai@idiap.ch 1

slide-4
SLIDE 4

The Problem of Optimizer Evaluation: SGD vs Adam

  • 1. SGD often achieves better peak performance than Adam in previous literature
  • 2. We take into cognizance the cost of automatic Hyperparameter Optimization (HPO),

and find:

Probability of being the best Adam (only l.r. tuned) Adam (all params. tuned) SGD (tuned l.r., fixed mom. and w.d.) SGD (l.r. schedule tuned, fixed mom. and w.d.)

10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0

Budget for hyperparameter optimization (# models trained)

58% 17% 13% 12%

Our method eliminates human biases arising from manual hyperparameter tuning.

prabhu.teja, florian.mai@idiap.ch 2

slide-5
SLIDE 5

Revisiting the notion of an Optimizer

Definition An optimizer is a pair M = (UΘ, pΘ), which applies its update rule U(St; Θ) at each step t depending on its current state St. Its hyperparameters Θ = (θ1, . . . , θN) have a prior probability distribution pΘ : (Θ → R) defined. pΘ should be specified by the optimizer designer, e.g., Adam’s ǫ > 0 and close to 0 = ⇒ ǫ ∼ Log-uniform(−8, 0)

prabhu.teja, florian.mai@idiap.ch 3

slide-6
SLIDE 6

HPO aware optimizer benchmarking

Algorithm 1 Benchmark with ‘expected quality at budget’ input: optimizer O, cross-task hyperparameter prior pΘ, task T, tuning budget B Initialize list ← [ ]. for R repetitions do Perform random search with budget B: – S ← sample B elements from pΘ. – list ← [best(S), . . . list]. return mean(list), var(list), or other statistics

prabhu.teja, florian.mai@idiap.ch 4

slide-7
SLIDE 7

Calibrated task independent priors pΘ

Optimizer Tunable parameters Cross-task prior SGD Learning rate ?? Momentum Weight decay Poly decay (p) Adagrad Learning rate Adam Learning rate β1, β2 ǫ

prabhu.teja, florian.mai@idiap.ch 5

slide-8
SLIDE 8

Calibrated task independent priors pΘ

Optimizer Tunable parameters Cross-task prior SGD Learning rate ?? Momentum Weight decay Poly decay (p) Adagrad Learning rate Adam Learning rate β1, β2 ǫ

Sample a large number of points and their performance from a large range of admissible values Maximum Likelihood Estimate (MLE) of the prior’s parameters using the top 20% performant values from the previous step.

prabhu.teja, florian.mai@idiap.ch 5

slide-9
SLIDE 9

Calibrated task independent priors pΘ

Optimizer Tunable parameters Cross-task prior SGD Learning rate Log-normal(-2.09, 1.312) Momentum U[0, 1] Weight decay Log-uniform(-5, -1) Poly decay (p) U[0.5, 5] Adagrad Learning rate Log-normal(-2.004, 1.20) Adam Learning rate Log-normal(-2.69, 1.42) β1, β2 1- Log-uniform(-5, -1) ǫ Log-uniform(-8, 0)

Sample a large number of points and their performance from a large range of admissible values Maximum Likelihood Estimate (MLE) of the prior’s parameters using the top 20% performant values from the previous step.

prabhu.teja, florian.mai@idiap.ch 5

slide-10
SLIDE 10

The importance of Recipes

Optimizer label Tunable parameters SGD-MCWC SGD(γ, µ=0.9, λ=10−5) SGD-MCD SGD(γ, µ=0.9, λ=10−5) + Poly Decay(p) SGD-MW SGD(γ, µ, λ) Adam-LR Adam(γ, β1=0.9, β2=0.999, ǫ=10−8) Adam Adam(γ, β1, β2, ǫ)

SGD(γ, µ, λ) is SGD with γ learning rate, µ momentum, λ weight decay coefficient. Adagrad(γ) is Adagrad with γ learning rate, Adam(γ, β1, β2, ǫ) is Adam with learning rate γ, momentum parameters β1, β2, and normalization parameter ǫ

prabhu.teja, florian.mai@idiap.ch 6

slide-11
SLIDE 11

Performance at a budget

❇✉❞❣❡t ✶ 40 50 60 70 80 90

❚❡st ❆❝❝✉r❛❝②

❇✉❞❣❡t ✹ ❇✉❞❣❡t ✶✻ ❇✉❞❣❡t ✻✹

❈■❋❆❘ ✶✵

❇✉❞❣❡t ✶ 60 65 70 75 80 85 90

❚❡st ❆❝❝✉r❛❝②

❇✉❞❣❡t ✹ ❇✉❞❣❡t ✶✻ ❇✉❞❣❡t ✻✹

■▼❉❜ ▲❙❚▼

Performance of Adam-LR, Adam, SGD-MCWC, SGD-MW, SGD-MCD at various hyperparameter search budgets

prabhu.teja, florian.mai@idiap.ch 7

slide-12
SLIDE 12

Summarizing our findings

20 40 60 80 100 # hyperparameter configurations (budget) 0.75 0.80 0.85 0.90 0.95 1.00 Aggregated relative performance Adam Adam-LR SGDMCWC SGD-Decay

Summary statistics: S(o, k) = 1 |P|

  • p∈P
  • (k, p)

max

  • ′∈O o′(k, p),

where o(k, p) denotes the expected performance of optimizer o ∈ O on test problem p ∈ P after k iterations of hyperparameter search.

prabhu.teja, florian.mai@idiap.ch 8

slide-13
SLIDE 13

Our findings

  • 1. Support the hypothesis that adaptive gradient methods are easier to tune than

non-adaptive methods

The substantial value of the adaptive gradient methods, specifically Adam, is its amenability to hyperparameter search.

prabhu.teja, florian.mai@idiap.ch 9

slide-14
SLIDE 14

Our findings

  • 1. Support the hypothesis that adaptive gradient methods are easier to tune than

non-adaptive methods

The substantial value of the adaptive gradient methods, specifically Adam, is its amenability to hyperparameter search.

  • 2. Tuning optimizers’ hyperparameters apart from the learning rate becomes more

useful as the available tuning budget goes up.

Even with relatively large tuning budget, tuning only the learning rate of Adam is the safer choice, as it achieves good results with high probability.

prabhu.teja, florian.mai@idiap.ch 9

slide-15
SLIDE 15

THANK YOU

prabhu.teja, florian.mai@idiap.ch 10