Safe Grid Search with Optimal Complexity E. Ndiaye Riken AIP Joint - - PowerPoint PPT Presentation

safe grid search with optimal complexity e ndiaye riken
SMART_READER_LITE
LIVE PREVIEW

Safe Grid Search with Optimal Complexity E. Ndiaye Riken AIP Joint - - PowerPoint PPT Presentation

Safe Grid Search with Optimal Complexity E. Ndiaye Riken AIP Joint work with: T. Le, O. Fercoq, J. Salmon, I. Takeuchi 1 / 7 Hyperparameter Tuning ( ) arg min Learning Task: f ( X train )+ ( ) R p E v (


slide-1
SLIDE 1

Safe Grid Search with Optimal Complexity

  • E. Ndiaye

Riken AIP Joint work with: T. Le, O. Fercoq, J. Salmon, I. Takeuchi

1 / 7

slide-2
SLIDE 2

Hyperparameter Tuning

Learning Task: ˆ β(λ) ∈ arg min

β∈Rp

f(Xtrainβ)+λΩ(β) Evaluation: Ev(ˆ β(λ)) = L(ytest, Xtest ˆ β(λ))

λmin λmax Regularization hyperparameter λ 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 ytest − Xtestβ(λ)2 Validation curve at machine precision λmin λmax Regularization hyperparameter λ 2.5 2.6 2.7 2.8 2.9 3.0 ytest − Xtestβ(λ)2 Validation curve at machine precision

How to approximate the best hyperparameter?

2 / 7

slide-3
SLIDE 3

Hyperparameter Tuning

The optimal hyperparameter is given by arg min

λ∈[λmin,λmax]

Ev(ˆ β(λ)) = L(ytest, Xtest ˆ β(λ)) s.t. ˆ β(λ) ∈ arg min

β∈Rp

f(Xtrainβ) + λΩ(β) Issues: The objective λ → Ev(ˆ β(λ)) is non-smooth and non-convex Often, It is unpractical to evaluate Ev( ˆ β(λ))

3 / 7

slide-4
SLIDE 4

Tracking the curve of solutions

ˆ β(λ) ∈ arg min

β∈Rp

f(Xβ) + λΩ(β) Exact Path: For (f, Ω) = (Piecewise Quadratic, Piecewise Linear) the function λ − → ˆ β(λ) is piecewise linear (Lars1 algorithm).

1(Efron et al. , 2004) 2(Mairal and Yu, 2012) 3(Bousquet and Bottou, 2008) 4 / 7

slide-5
SLIDE 5

Tracking the curve of solutions

ˆ β(λ) ∈ arg min

β∈Rp

f(Xβ) + λΩ(β) Exact Path: For (f, Ω) = (Piecewise Quadratic, Piecewise Linear) the function λ − → ˆ β(λ) is piecewise linear (Lars1 algorithm). Drawbacks: Exponential 2 complexity for Lasso O((3p + 1)/2) Numerical instabilities Hard to generalize to others (loss, regularization) Cannot benefited of early stopping rule 3.

1(Efron et al. , 2004) 2(Mairal and Yu, 2012) 3(Bousquet and Bottou, 2008) 4 / 7

slide-6
SLIDE 6

Approximation of the solution path 4

Training Task: ˆ β(λ) ∈ arg min

β∈Rp

f(Xβ) + λΩ(β) =: Pλ(β) Suboptimal gap: Pλ(β(λt)) − Pλ(ˆ β(λ)) ≤ Qt,Vf∗

  • 1 − λ

λt

  • .

λmax λ1 λ2 λ3 λ4 λ5 λmin ǫc ǫ

Upper Bound of the Duality Gap

Qt,Vf∗(ρ) := optimization error at λt + approximation error(λ, λt) ,

4(Giesen et al. 2012) 5 / 7

slide-7
SLIDE 7

Bound the validation Gap

  • Ev(ˆ

β(λ)) − Ev(β(λt))

  • ≤ max

β∈Bλ

L(X′β, X′β(λt)) , Bλ = Ball

  • β(λt), Suboptimal gap on the training
  • ∋ ˆ

β(λ)

→ Approximate the validation path !

6 / 7

slide-8
SLIDE 8

Bound the validation Gap

  • Ev(ˆ

β(λ)) − Ev(β(λt))

  • ≤ max

β∈Bλ

L(X′β, X′β(λt)) , Bλ = Ball

  • β(λt), Suboptimal gap on the training
  • ∋ ˆ

β(λ)

→ Approximate the validation path !

λmin λmax 2.4 2.6 2.8 3.0 3.2

y′ − X′β(λ)2

Validation curve at machine precision Low precision δv × 10 High precision δv/10

ǫv

6 / 7

slide-9
SLIDE 9

min

λt∈Λval(ǫv)

Ev(β(λt)) − min

λ∈[λmin,λmax] Ev(ˆ

β(λ)) ≤ ǫv . Code: https://github.com/EugeneNdiaye/safe grid search Let’s talk during the poster session ;-)

7 / 7