safe grid search with optimal complexity
play

Safe Grid Search with Optimal Complexity Joseph Salmon - PowerPoint PPT Presentation

Safe Grid Search with Optimal Complexity Joseph Salmon http://josephsalmon.eu IMAG, Univ Montpellier, CNRS Montpellier, France Joint work with: E. Ndiaye (RIKEN, Nagoya) T. Le (RIKEN, Tokyo) O. Fercoq (Institut Polytechnique de Paris) I.


  1. Safe Grid Search with Optimal Complexity Joseph Salmon http://josephsalmon.eu IMAG, Univ Montpellier, CNRS Montpellier, France Joint work with: E. Ndiaye (RIKEN, Nagoya) T. Le (RIKEN, Tokyo) O. Fercoq (Institut Polytechnique de Paris) I. Takeuchi (Nagoya Institute of Technology) 1 / 22

  2. Simplest model: standard sparse regression y P R n : a signal X “ r x 1 , . . . , x p s P R n ˆ p : dictionary of atoms/features Assumption : signal well approximated by a sparse combination β ˚ P R p : y « Xβ ˚ Objective(s): find ˆ β β ˚ » fi » fi » fi 1 § Estimation: ˆ β « ˆ β ˚ . . « ¨ – y – x 1 . . . x p — ffi . fl fl § Prediction: X ˆ β « X ˆ – fl β ˚ β ˚ p § Support recovery: lo omo on loooooooomoooooooon lo omo on y P R n X P R n ˆ p β P R p supp p ˆ β q « supp p β ˚ q p ÿ β ˚ y « j x j Constraints: large p , sparse β ˚ j “ 1 2 / 22

  3. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) 3 / 22

  4. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) § Need to tune/choose λ (standard is Cross-Validation) 3 / 22

  5. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009) 3 / 22

  6. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009) § Refinements: non-convex approaches Adaptive Lasso Zou (2006), scaled invariance Sun and Zhang (2012), etc. 3 / 22

  7. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009) § Refinements: non-convex approaches Adaptive Lasso Zou (2006), scaled invariance Sun and Zhang (2012), etc. 3 / 22

  8. Well... many Lassos are needed 1 β p λ q P arg min ˆ 2 } y ´ Xβ } 2 2 ` λ } β } 1 β P R p In practice: Step 1 compute T solutions on a grid, i.e., compute β p λ 0 q , . . . , β p λ T ´ 1 q approximating ˆ β p λ 0 q , . . . , ˆ β p λ T ´ 1 q , for some λ 0 ą ¨ ¨ ¨ ą λ T ´ 1 Step 2 pick the “best” parameter Questions : § performance criterion: how to pick a “best” λ ? § cross-validation (and variant) § SURE (Stein Unbiased Risk Estimation) § etc. § grid choice: how to design the grid itself? 4 / 22

  9. In practice: who does what? Standard grid: (R-glmnet / Python-sklearn): geometric grid p § λ 0 “ λ max : “ } X J y } 8 “ max j “ 1 x x j , y y (critical value) § λ t “ λ max ˆ 10 ´ δt {p T ´ 1 q , T “ 100 and δ “ 3 § λ T ´ 1 “ λ max { 10 3 : “ λ min Parameter’s choice: Python- sklearn : vanilla 5-fold Cross-Validation, get smallest mean squared error (averaged over folds) R- glmnet : vanilla 10-fold Cross-Validation, get largest λ such that the error is smaller than the mean squared error (averaged over folds) + 1 standard deviation 5 / 22

  10. Hold-out cross-validation From now on : hold-out cross-validation (one single split) Standard choice: 80 % train p n train q , 20 % test p n test q § X “ X train Y X test § y “ y train Y y test § Change the error on test (validation): � β p λ q � E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q : “ � y test ´ X test ˆ � � � ˆ 2 ˙ � β p λ q � � y test ´ X test ˆ or � � � 6 / 22

  11. Some practical examples § leukemia (1) : n “ 72 , p “ 7129 (genes expression) y (binary) measure of disease § diabetes (2) : n “ 442 , p “ 10 (Age, Sex, Body mass index, Average blood pressure, S1, S2, S3, S4, S5, S6) y a quantitative measure of disease progression one year after baseline (1) https://sklearn.org/modules/generated/sklearn.datasets.fetch_mldata.html (2) https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset 7 / 22

  12. Example: Training / Testing ( leukemia ) 1 . 0 0 . 8 P λ ( β ) /P λ (0) 0 . 6 0 . 4 Exact: P λ (ˆ β ( λ ) ) 0 . 2 Exact shifted: P λ (ˆ β ( λ ) ) + ǫ λ min λ max Training β λ � 2 / � y test � 2 Exact 1 . 5 � y test − X test ˆ 1 . 0 0 . 5 λ min λ max Testing 8 / 22

  13. Example: Training / Testing ( leukemia ) 1 . 0 0 . 8 P λ ( β ) /P λ (0) 0 . 6 Exact: P λ (ˆ β ( λ ) ) 0 . 4 Exact shifted: P λ (ˆ β ( λ ) ) + ǫ 0 . 2 Approximated: P λ ( β ( λ ) ) λ min λ max Training β λ � 2 / � y test � 2 Exact 1 . 5 Approx. � y test − X test ˆ 1 . 0 0 . 5 λ min λ max Testing 8 / 22

  14. Example: Training / Testing ( diabetes ) 1 . 10 1 . 05 P λ ( β ) /P λ (0) 1 . 00 0 . 95 Exact: P λ (ˆ β ( λ ) ) 0 . 90 Exact shifted: P λ (ˆ β ( λ ) ) + ǫ λ min λ max Training β λ � 2 / � y test � 2 1 . 04 Exact 1 . 02 � y test − X test ˆ 1 . 00 0 . 98 λ min λ max Testing 9 / 22

  15. Example: Training / Testing ( diabetes ) 1 . 10 1 . 05 P λ ( β ) /P λ (0) 1 . 00 Exact: P λ (ˆ β ( λ ) ) 0 . 95 Exact shifted: P λ (ˆ β ( λ ) ) + ǫ 0 . 90 Approximated: P λ ( β ( λ ) ) λ min λ max Training β λ � 2 / � y test � 2 1 . 04 Exact Approx. 1 . 02 � y test − X test ˆ 1 . 00 0 . 98 λ min λ max Testing 9 / 22

  16. Hyperparameter tuning β p λ q P arg min ˆ § Learning Task: f p X train β q ` λ Ω p β q β P R p looooomooooon lo omo on 1 � β � 1 2 � X train β ´ y train � 2 E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q § Evaluation: 3 . 6 Validation curve at 3 . 0 machine precision 3 . 4 2 . 9 � y test − X test β ( λ ) � 2 � y test − X test β ( λ ) � 2 3 . 2 3 . 0 2 . 8 2 . 8 2 . 7 2 . 6 2 . 6 2 . 4 Validation curve at machine precision 2 . 2 2 . 5 λ min λ max λ min λ max Regularization hyperparameter λ Regularization hyperparameter λ How to choose the grid of hyperparameter? 10 / 22

  17. Hyperparameter tuning β p λ q P arg min ˆ § Learning Task: f p X train β q ` λ Ω p β q β P R p looooomooooon lo omo on 1 � β � 1 2 � X train β ´ y train � 2 E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q § Evaluation: 3 . 6 Validation curve at 3 . 0 machine precision 3 . 4 2 . 9 � y test − X test β ( λ ) � 2 � y test − X test β ( λ ) � 2 3 . 2 3 . 0 2 . 8 2 . 8 2 . 7 2 . 6 2 . 6 2 . 4 Validation curve at machine precision 2 . 2 2 . 5 λ min λ max λ min λ max Regularization hyperparameter λ Regularization hyperparameter λ How to choose the grid of hyperparameter? 10 / 22

  18. Hyperparameter tuning as bilevel optimization The “optimal” hyperparameter is given by ˆ E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q λ P arg min λ Pr λ min ,λ max s β p λ q P arg min s.t. ˆ f p X train β q ` λ Ω p β q β P R p Challenges: § non-smooth and non-convex objective function § costly to evaluate E test p ˆ β p λ q q ( e.g., dense/continuous grid) 11 / 22

  19. Hyperparameter tuning as bilevel optimization The “optimal” hyperparameter is given by ˆ E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q λ P arg min λ Pr λ min ,λ max s β p λ q P arg min s.t. ˆ f p X train β q ` λ Ω p β q β P R p Challenges: § non-smooth and non-convex objective function § costly to evaluate E test p ˆ β p λ q q ( e.g., dense/continuous grid) 11 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend