Safe Grid Search with Optimal Complexity Joseph Salmon - - PowerPoint PPT Presentation

safe grid search with optimal complexity
SMART_READER_LITE
LIVE PREVIEW

Safe Grid Search with Optimal Complexity Joseph Salmon - - PowerPoint PPT Presentation

Safe Grid Search with Optimal Complexity Joseph Salmon http://josephsalmon.eu IMAG, Univ Montpellier, CNRS Montpellier, France Joint work with: E. Ndiaye (RIKEN, Nagoya) T. Le (RIKEN, Tokyo) O. Fercoq (Institut Polytechnique de Paris) I.


slide-1
SLIDE 1

Safe Grid Search with Optimal Complexity

Joseph Salmon http://josephsalmon.eu IMAG, Univ Montpellier, CNRS Montpellier, France Joint work with:

  • E. Ndiaye (RIKEN, Nagoya)
  • T. Le (RIKEN, Tokyo)
  • O. Fercoq (Institut Polytechnique de Paris)
  • I. Takeuchi (Nagoya Institute of Technology)

1 / 22

slide-2
SLIDE 2

Simplest model: standard sparse regression

y P Rn : a signal X “ rx1, . . . , xps P Rnˆp: dictionary of atoms/features Assumption : signal well approximated by a sparse combination β˚ P Rp : y « Xβ˚ Objective(s): find ˆ β § Estimation: ˆ β « ˆ β˚ § Prediction: X ˆ β « X ˆ β˚ § Support recovery: supppˆ βq « supppβ˚q Constraints: large p, sparse β˚ » –y fi fl lo

  • mo
  • n

yPRn

« » –x1 . . . xp fi fl loooooooomoooooooon

XPRnˆp

¨ » — – β˚

1

. . . β˚

p

fi ffi fl lo

  • mo
  • n

βPRp

y «

p

ÿ

j“1

β˚

j xj

2 / 22

slide-3
SLIDE 3

The ℓ1 penalty: Lasso and variants

Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ βpλq P arg min

βPRp

ˆ 1 2}y ´ Xβ}2 loooooomoooooon

data fitting term

` λ}β}1 lo

  • mo
  • n

sparsity-inducing penalty

˙ § Solutions are sparse (sparsity level controlled by λ)

3 / 22

slide-4
SLIDE 4

The ℓ1 penalty: Lasso and variants

Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ βpλq P arg min

βPRp

ˆ 1 2}y ´ Xβ}2 loooooomoooooon

data fitting term

` λ}β}1 lo

  • mo
  • n

sparsity-inducing penalty

˙ § Solutions are sparse (sparsity level controlled by λ) § Need to tune/choose λ (standard is Cross-Validation)

3 / 22

slide-5
SLIDE 5

The ℓ1 penalty: Lasso and variants

Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ βpλq P arg min

βPRp

ˆ 1 2}y ´ Xβ}2 loooooomoooooon

data fitting term

` λ}β}1 lo

  • mo
  • n

sparsity-inducing penalty

˙ § Solutions are sparse (sparsity level controlled by λ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009)

3 / 22

slide-6
SLIDE 6

The ℓ1 penalty: Lasso and variants

Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ βpλq P arg min

βPRp

ˆ 1 2}y ´ Xβ}2 loooooomoooooon

data fitting term

` λ}β}1 lo

  • mo
  • n

sparsity-inducing penalty

˙ § Solutions are sparse (sparsity level controlled by λ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009) § Refinements: non-convex approaches Adaptive Lasso Zou (2006), scaled invariance Sun and Zhang (2012), etc.

3 / 22

slide-7
SLIDE 7

The ℓ1 penalty: Lasso and variants

Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ βpλq P arg min

βPRp

ˆ 1 2}y ´ Xβ}2 loooooomoooooon

data fitting term

` λ}β}1 lo

  • mo
  • n

sparsity-inducing penalty

˙ § Solutions are sparse (sparsity level controlled by λ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009) § Refinements: non-convex approaches Adaptive Lasso Zou (2006), scaled invariance Sun and Zhang (2012), etc.

3 / 22

slide-8
SLIDE 8

Well... many Lassos are needed

ˆ βpλq P arg min

βPRp

1 2}y ´ Xβ}2

2 ` λ}β}1

In practice: Step 1 compute T solutions on a grid, i.e., compute βpλ0q, . . . , βpλT ´1q approximating ˆ βpλ0q, . . . , ˆ βpλT ´1q, for some λ0 ą ¨ ¨ ¨ ą λT´1 Step 2 pick the “best” parameter Questions: § performance criterion: how to pick a “best” λ?

§ cross-validation (and variant) § SURE (Stein Unbiased Risk Estimation) § etc.

§ grid choice: how to design the grid itself?

4 / 22

slide-9
SLIDE 9

In practice: who does what?

Standard grid: (R-glmnet / Python-sklearn): geometric grid § λ0 “ λmax :“ }XJy}8 “

p

max

j“1 xxj, yy (critical value)

§ λt “ λmax ˆ 10´δt{pT´1q, T “ 100 and δ “ 3 § λT´1 “ λmax{103 :“ λmin Parameter’s choice: Python-sklearn : vanilla 5-fold Cross-Validation, get smallest mean squared error (averaged over folds) R-glmnet : vanilla 10-fold Cross-Validation, get largest λ such that the error is smaller than the mean squared error (averaged over folds) + 1 standard deviation

5 / 22

slide-10
SLIDE 10

Hold-out cross-validation

From now on : hold-out cross-validation (one single split) Standard choice: 80 % train pntrainq, 20 % test pntestq § X “ Xtrain Y Xtest § y “ ytrain Y ytest § Change the error on test (validation): Etestpˆ βpλqq “ Lpytest, Xtest ˆ βpλqq :“

  • ytest ´ Xtest ˆ

βpλq

  • ˆ
  • r
  • ytest ´ Xtest ˆ

βpλq

6 / 22

slide-11
SLIDE 11

Some practical examples

§ leukemia(1): n “ 72, p “ 7129 (genes expression) y (binary) measure of disease § diabetes(2): n “ 442, p “ 10 (Age, Sex, Body mass index, Average blood pressure, S1, S2, S3, S4, S5, S6) y a quantitative measure of disease progression one year after baseline

(1)https://sklearn.org/modules/generated/sklearn.datasets.fetch_mldata.html (2)https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset

7 / 22

slide-12
SLIDE 12

Example: Training / Testing (leukemia)

λmin λmax 0.2 0.4 0.6 0.8 1.0

Pλ(β)/Pλ(0)

Exact: Pλ(ˆ β(λ)) Exact shifted: Pλ(ˆ β(λ)) + ǫ

Training

λmin λmax 0.5 1.0 1.5

ytest − Xtest ˆ βλ2/ytest2

Exact

Testing

8 / 22

slide-13
SLIDE 13

Example: Training / Testing (leukemia)

λmin λmax 0.2 0.4 0.6 0.8 1.0

Pλ(β)/Pλ(0)

Exact: Pλ(ˆ β(λ)) Exact shifted: Pλ(ˆ β(λ)) + ǫ Approximated: Pλ(β(λ))

Training

λmin λmax 0.5 1.0 1.5

ytest − Xtest ˆ βλ2/ytest2

Exact Approx.

Testing

8 / 22

slide-14
SLIDE 14

Example: Training / Testing (diabetes)

λmin λmax 0.90 0.95 1.00 1.05 1.10

Pλ(β)/Pλ(0)

Exact: Pλ(ˆ β(λ)) Exact shifted: Pλ(ˆ β(λ)) + ǫ

Training

λmin λmax 0.98 1.00 1.02 1.04

ytest − Xtest ˆ βλ2/ytest2

Exact

Testing

9 / 22

slide-15
SLIDE 15

Example: Training / Testing (diabetes)

λmin λmax 0.90 0.95 1.00 1.05 1.10

Pλ(β)/Pλ(0)

Exact: Pλ(ˆ β(λ)) Exact shifted: Pλ(ˆ β(λ)) + ǫ Approximated: Pλ(β(λ))

Training

λmin λmax 0.98 1.00 1.02 1.04

ytest − Xtest ˆ βλ2/ytest2

Exact Approx.

Testing

9 / 22

slide-16
SLIDE 16

Hyperparameter tuning

§ Learning Task: ˆ βpλq P arg min

βPRp

fpXtrainβq looooomooooon

1 2 Xtrainβ´ytrain2

`λ Ωpβq lo

  • mo
  • n

β1

§ Evaluation: Etestpˆ βpλqq “ Lpytest, Xtest ˆ βpλqq

λmin λmax Regularization hyperparameter λ 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 ytest − Xtestβ(λ)2 Validation curve at machine precision λmin λmax Regularization hyperparameter λ 2.5 2.6 2.7 2.8 2.9 3.0 ytest − Xtestβ(λ)2 Validation curve at machine precision

How to choose the grid of hyperparameter?

10 / 22

slide-17
SLIDE 17

Hyperparameter tuning

§ Learning Task: ˆ βpλq P arg min

βPRp

fpXtrainβq looooomooooon

1 2 Xtrainβ´ytrain2

`λ Ωpβq lo

  • mo
  • n

β1

§ Evaluation: Etestpˆ βpλqq “ Lpytest, Xtest ˆ βpλqq

λmin λmax Regularization hyperparameter λ 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 ytest − Xtestβ(λ)2 Validation curve at machine precision λmin λmax Regularization hyperparameter λ 2.5 2.6 2.7 2.8 2.9 3.0 ytest − Xtestβ(λ)2 Validation curve at machine precision

How to choose the grid of hyperparameter?

10 / 22

slide-18
SLIDE 18

Hyperparameter tuning as bilevel optimization

The “optimal” hyperparameter is given by ˆ λ P arg min

λPrλmin,λmaxs

Etestpˆ βpλqq “ Lpytest, Xtest ˆ βpλqq s.t. ˆ βpλq P arg min

βPRp

fpXtrainβq ` λΩpβq Challenges: § non-smooth and non-convex objective function § costly to evaluate Etestpˆ βpλqq (e.g., dense/continuous grid)

11 / 22

slide-19
SLIDE 19

Hyperparameter tuning as bilevel optimization

The “optimal” hyperparameter is given by ˆ λ P arg min

λPrλmin,λmaxs

Etestpˆ βpλqq “ Lpytest, Xtest ˆ βpλqq s.t. ˆ βpλq P arg min

βPRp

fpXtrainβq ` λΩpβq Challenges: § non-smooth and non-convex objective function § costly to evaluate Etestpˆ βpλqq (e.g., dense/continuous grid)

11 / 22

slide-20
SLIDE 20

Tracking the curve of solutions

ˆ βpλq P arg min

βPRp

fpXβq ` λΩpβq :“ Pλpβq Exact Path: For pf, Ωq “ (Piecewise Quadratic, Piecewise Linear) the function λ ÞÝ Ñ ˆ βpλq is piecewise linear (Lars(3)). Drawbacks: § Exponential(4) complexity for Lasso Opp3p ` 1q{2q § Numerical instabilities(5) § Hard to generalize to other losses / regularizations § Cannot benefited of early stopping rule(6)

(3)B. Efron et al. “Least angle regression”. In: Ann. Statist. 32.2 (2004). With discussion, and a rejoinder by the

authors, pp. 407–499.

(4)J. Mairal and B. Yu. “Complexity analysis of the Lasso regularization path”. In: ICML. 2012, pp. 353–360. (5)Y. Li and Y. Singer. “The Well Tempered Lasso”. In: ICML (2018), pp. 3030–3038. (6)L. Bottou and O. Bousquet. “The tradeoffs of large scale learning”. In: NIPS. 2008, pp. 161–168.

12 / 22

slide-21
SLIDE 21

Tracking the curve of solutions

ˆ βpλq P arg min

βPRp

fpXβq ` λΩpβq :“ Pλpβq Exact Path: For pf, Ωq “ (Piecewise Quadratic, Piecewise Linear) the function λ ÞÝ Ñ ˆ βpλq is piecewise linear (Lars(3)). Drawbacks: § Exponential(4) complexity for Lasso Opp3p ` 1q{2q § Numerical instabilities(5) § Hard to generalize to other losses / regularizations § Cannot benefited of early stopping rule(6)

(3)B. Efron et al. “Least angle regression”. In: Ann. Statist. 32.2 (2004). With discussion, and a rejoinder by the

authors, pp. 407–499.

(4)J. Mairal and B. Yu. “Complexity analysis of the Lasso regularization path”. In: ICML. 2012, pp. 353–360. (5)Y. Li and Y. Singer. “The Well Tempered Lasso”. In: ICML (2018), pp. 3030–3038. (6)L. Bottou and O. Bousquet. “The tradeoffs of large scale learning”. In: NIPS. 2008, pp. 161–168.

12 / 22

slide-22
SLIDE 22

Aparté: Duality for the Lasso

ˆ θpλq “ arg max

θP∆X

1 2 y2 ´ λ2 2 y{λ ´ θ2 loooooooooooooomoooooooooooooon

Dλpθq

∆X “ tθ P Rn : @j P rps, |xJ

j θ| ď 1u: dual feasible set

13 / 22

slide-23
SLIDE 23

Aparté: Duality for the Lasso

ˆ θpλq “ arg max

θP∆X

1 2 y2 ´ λ2 2 y{λ ´ θ2 loooooooooooooomoooooooooooooon

Dλpθq

∆X “ tθ P Rn : @j P rps, |xJ

j θ| ď 1u: dual feasible set

y λ

∆X

{ θ : x⊤

1 θ = 1}

{ θ : x⊤

1 θ = −1}

{ θ : x

⊤ 2

θ = − 1 } { θ : x

⊤ 2

θ = 1 } {θ : x⊤

3 θ = −1}

{θ : x⊤

3 θ = 1}

ˆ θ

Toy visualization example: n “ 2, p “ 3

13 / 22

slide-24
SLIDE 24

Aparté: Duality for the Lasso

ˆ θpλq “ arg max

θP∆X

1 2 y2 ´ λ2 2 y{λ ´ θ2 loooooooooooooomoooooooooooooon

Dλpθq

∆X “ tθ P Rn : @j P rps, |xJ

j θ| ď 1u: dual feasible set

y λ

∆X

{ θ : x⊤

1 θ = 1}

{ θ : x⊤

1 θ = −1}

{ θ : x

⊤ 2

θ = − 1 } { θ : x

⊤ 2

θ = 1 } {θ : x⊤

3 θ = −1}

{θ : x⊤

3 θ = 1}

ˆ θ

Projection problem: ˆ θpλq “ Π∆Xpy{λq

13 / 22

slide-25
SLIDE 25

Duality gap as a stopping criterion

For any primal-dual pair pβ, θq P Rp ˆ ∆X: (Dual) Dλpθq ď Dλpˆ θpλqq “ Pλpˆ βq ď Pλpβpλqq (Primal) Duality gap : gapλpβ, θq :“ Pλpβq ´ Dλpθq upper bound on suboptimality gap : Pλpβq ´ Pλpˆ βpλqq @β, pDθ P ∆X, gapλpβ, θq ď ǫq ñ Pλpβq ´ Pλpˆ βpλqq ď ǫ i.e., β is an ǫ-solution whenever gapλpβ, θq ď ǫ

14 / 22

slide-26
SLIDE 26

Approximate path: adaptive grid(7)

Start : fix grid upper (λmax) lower (λmin) bound Quadratic bound: helps get ǫ-accurate grid on rλmin, λmaxs Pλpβpλtqq ´ Pλpˆ βpλqq ď gapλpβpλtq, θpλtqq ď Qλt ˆ 1 ´ λ λt ˙ Rem: holds whenever f is strongly convex

(7)J. Giesen et al. “Approximating concavely parameterized optimization problems”. In: NIPS. 2012,

  • pp. 2105–2113.

15 / 22

slide-27
SLIDE 27

Approximate path: adaptive grid(7)

Start : fix grid upper (λmax) lower (λmin) bound Quadratic bound: helps get ǫ-accurate grid on rλmin, λmaxs Pλpβpλtqq ´ Pλpˆ βpλqq ď gapλpβpλtq, θpλtqq ď Qλt ˆ 1 ´ λ λt ˙ Rem: holds whenever f is strongly convex

λmax λ1 λ2 λ3 λ4 λ5 λmin ǫc ǫ

Upper Bound of the Duality Gap

(7)J. Giesen et al. “Approximating concavely parameterized optimization problems”. In: NIPS. 2012,

  • pp. 2105–2113.

15 / 22

slide-28
SLIDE 28

Approximation of the validation path

arg min

λPrλmin,λmaxs

Etestpˆ βpλqq “ Lpytest, Xtest ˆ βpλqq s.t. ˆ βpλq P arg min

βPRp

fpXtrainβq ` λΩpβq Bound the validation Gap(8),(9) ˇ ˇEtestpˆ βpλqq ´ Etestpβpλtqq ˇ ˇ ď max

βPBλ LpXtestβ, Xtestβpλtqq ,

(8)A. Shibagaki et al. “Regularization Path of Cross-Validation Error Lower Bounds”. In: NIPS. 2015,

  • pp. 1666–1674.

(9)E. Ndiaye et al. “Gap Safe screening rules for sparsity enforcing penalties”. In: J. Mach. Learn. Res. 18.128

(2017), pp. 1–33. 16 / 22

slide-29
SLIDE 29

Approximation of the validation path

arg min

λPrλmin,λmaxs

Etestpˆ βpλqq “ Lpytest, Xtest ˆ βpλqq s.t. ˆ βpλq P arg min

βPRp

fpXtrainβq ` λΩpβq Bound the validation Gap(8),(9) ˇ ˇEtestpˆ βpλqq ´ Etestpβpλtqq ˇ ˇ ď max

βPBλ LpXtestβ, Xtestβpλtqq ,

where Bλ “ Ball ´ βpλtq, rt ¯ Q ˆ βpλq Rem: rt “ b

µ 2gappβpλtq, θpλtqq for µ-strongly convex Pλ (Enet)

(8)A. Shibagaki et al. “Regularization Path of Cross-Validation Error Lower Bounds”. In: NIPS. 2015,

  • pp. 1666–1674.

(9)E. Ndiaye et al. “Gap Safe screening rules for sparsity enforcing penalties”. In: J. Mach. Learn. Res. 18.128

(2017), pp. 1–33. 16 / 22

slide-30
SLIDE 30

Approximation of the validation path

arg min

λPrλmin,λmaxs

Etestpˆ βpλqq “ Lpytest, Xtest ˆ βpλqq s.t. ˆ βpλq P arg min

βPRp

fpXtrainβq ` λΩpβq Bound the validation Gap(8),(9) ˇ ˇEtestpˆ βpλqq ´ Etestpβpλtqq ˇ ˇ ď max

βPBλ LpXtestβ, Xtestβpλtqq ,

where Bλ “ Ball ´ βpλtq, rt ¯ Q ˆ βpλq Rem: rt “ b

µ 2gappβpλtq, θpλtqq for µ-strongly convex Pλ (Enet)

(8)A. Shibagaki et al. “Regularization Path of Cross-Validation Error Lower Bounds”. In: NIPS. 2015,

  • pp. 1666–1674.

(9)E. Ndiaye et al. “Gap Safe screening rules for sparsity enforcing penalties”. In: J. Mach. Learn. Res. 18.128

(2017), pp. 1–33. 16 / 22

slide-31
SLIDE 31

Testing (Validation) control

Motivation: fix a precision level ǫv on the testing (or validation) set; then calibrate the optimization accuracy needed ǫ to target this precision. Theorem When Pµ is a µ-strongly convex function, with the grid construction provided before @λ P rλmin, λmaxs, Dλt P grid, ˇ ˇEtestpˆ βpλqq ´ Etestpβpλtqq ˇ ˇ ď ǫv provided the algorithm is run up to precision ǫ at training, with ǫ “ µ 2 ˆ ǫv Xtest ˙2

17 / 22

slide-32
SLIDE 32

Approximation of the optimal hyperparameter

λmin λmax 2.4 2.6 2.8 3.0 3.2

y′ − X′β(λ)2

Validation curve at machine precision Low precision δv × 10 High precision δv/10

ǫv

18 / 22

slide-33
SLIDE 33

Conclusion

§ Extension to GLM (more technical, but done) § Take home message: more connexions needed between

  • ptimization / statistics / learning

§ Future works: What about several parameters? How to handle vanilla CV & variants? Code: https://github.com/EugeneNdiaye/safe_grid_search ICML paper: https://arxiv.org/abs/1810.05471

19 / 22

Powered with MooseTeX

slide-34
SLIDE 34

One last word

“All models are wrong but some come with good open source implementation and good documentation so use those.”

  • A. Gramfort

20 / 22

slide-35
SLIDE 35

References I

§

Bickel, P. J., Y. Ritov, and A. B. Tsybakov. “Simultaneous analysis

  • f Lasso and Dantzig selector”. In: Ann. Statist. 37.4 (2009),
  • pp. 1705–1732.

§

Bottou, L. and O. Bousquet. “The tradeoffs of large scale learning”. In: NIPS. 2008, pp. 161–168.

§

Candès, E. J., M. B. Wakin, and S. P. Boyd. “Enhancing Sparsity by Reweighted l1 Minimization”. In: J. Fourier Anal. Applicat. 14.5-6 (2008), pp. 877–905.

§

Chen, S. S., D. L. Donoho, and M. A. Saunders. “Atomic decomposition by basis pursuit”. In: SIAM J. Sci. Comput. 20.1 (1998), pp. 33–61.

§

Efron, B. et al. “Least angle regression”. In: Ann. Statist. 32.2 (2004). With discussion, and a rejoinder by the authors,

  • pp. 407–499.

§

Giesen, J. et al. “Approximating concavely parameterized

  • ptimization problems”. In: NIPS. 2012, pp. 2105–2113.

21 / 22

slide-36
SLIDE 36

References II

§

Li, Y. and Y. Singer. “The Well Tempered Lasso”. In: ICML (2018), pp. 3030–3038.

§

Mairal, J. and B. Yu. “Complexity analysis of the Lasso regularization path”. In: ICML. 2012, pp. 353–360.

§

Ndiaye, E. et al. “Gap Safe screening rules for sparsity enforcing penalties”. In: J. Mach. Learn. Res. 18.128 (2017), pp. 1–33.

§

Shibagaki, A. et al. “Regularization Path of Cross-Validation Error Lower Bounds”. In: NIPS. 2015, pp. 1666–1674.

§

Sun, T. and Cun-Hui Zhang. “Scaled sparse linear regression”. In: Biometrika 99.4 (2012), pp. 879–898.

§

Tibshirani, R. “Regression Shrinkage and Selection via the Lasso”. In: J. R. Stat. Soc. Ser. B Stat. Methodol. 58.1 (1996),

  • pp. 267–288.

§

Zou, H. “The adaptive lasso and its oracle properties”. In: J. Amer.

  • Statist. Assoc. 101.476 (2006), pp. 1418–1429.

22 / 22