high dimensional regression with unknown variance
play

High-dimensional regression with unknown variance Christophe Giraud - PowerPoint PPT Presentation

High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: i . i . d . N (0 , 2 ) Y i = f i + i with i f = ( f 1 , . . . , f n )


  1. High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012

  2. Setting Gaussian regression with unknown variance: i . i . d . ∼ N (0 , σ 2 ) ◮ Y i = f i + ε i with ε i ◮ f = ( f 1 , . . . , f n ) ∗ and σ 2 are unknown ◮ we want to estimate f Ex 1 : sparse linear regression ◮ f = X β with β ”sparse” in some sense and X ∈ R n × p with possibly p > n Ex 2 : non-parametric regression ◮ f i = F ( x i ) with F : X → R

  3. A plethora of estimators Sparse linear regression ◮ Coordinate sparsity: Lasso, Dantzig, Elastic-Net, Exponential-Weighting, Projection on subspaces { V λ : λ ∈ Λ } given by PCA, Random Forest, etc. ◮ Structured sparsity: Group-lasso, Fused-Lasso, Bayesian estimators, etc Non-parametric regression ◮ Spline smoothing, Nadaraya kernel smoothing, kernel ridge estimators, nearest neighbors, L 2 -basis projection, Sparse Additive Models, etc

  4. Important practical issues Which estimator should be used? ◮ Sparse regression : Lasso? Random-Forest? Exponential-Weighting? ◮ Non-parametric regression : Kernel regression? (which kernel?) Spline smoothing? Which ”tuning” parameter? ◮ which penalty level for the lasso? ◮ which bandwith for kernel regression? ◮ etc

  5. The objective Difficulties ◮ No procedure is universally better than the others ◮ A sensible choice of the tuning parameters depends on ◮ some unknown characteristics of f (sparsity, smoothness, etc) ◮ the unknown variance σ 2 . Ideal objective ◮ Select the ”best” estimator among a collection { ˆ f λ , λ ∈ Λ } . (alternative objective: combine at best the estimators)

  6. Impact of not knowing the variance

  7. Impact of the unknown variance? Case of coordinate-sparse linear regression σ known or k known σ unknown and k unknown Minimax risk n k Ultra-high dimension 2 k log( p/k ) ≥ n Minimax prediction risk over k -sparse signal as a function of k

  8. Ultra-high dimensional phenomenon Theorem (N. Verzelen EJS 2012) When σ 2 is unknown, there exist designs X of size n × p such that for any estimator � β , we have either � β − 0 p ) � 2 � > C 1 n σ 2 , � X ( � sup or E σ 2 > 0 � �� � β − β 0 ) � 2 � � p � � p k � X ( � σ 2 . sup > C 2 k log exp n log E C 3 k k β 0 k -sparse σ 2 > 0 Consequence When σ 2 unknown, the best we can expect to have is � β − β 0 ) � 2 � � 2 + � β � 0 log( p ) σ 2 � � X ( β − β 0 ) � 2 � X ( � E ≤ C inf β � =0 for any σ 2 > 0 and any β 0 fulfilling 1 ≤ � β 0 � 0 ≤ C ′ n / log( p ).

  9. Some generic selection schemes

  10. Cross-Validation ◮ Hold-out ◮ V -fold CV ◮ Leave- q -out Penalized empirical lost ◮ Penalized log-likelihood (AIC, BIC, etc) ◮ Plug-in criteria (with Mallows’ C p , etc) ◮ Slope heuristic Approximation versus complexity penalization ◮ LinSelect

  11. LinSelect (Y. Baraud, C. G. & S. Huet) Ingredients ◮ A collection S of linear spaces (for approximation) ◮ A weight function ∆ : S → R + (measure of complexity) Criterion: residuals + approximation + complexity � � f λ � 2 + 1 f λ � 2 + pen ∆ ( S ) � Crit ( � � Y − Π S � 2 � � f λ − Π S � σ 2 f λ ) = inf S S ∈ b S where ◮ � S ⊂ S , possibly data-dependent, ◮ Π S orthogonal projector onto S , ◮ pen ∆ ( S ) ≍ dim ( S ) ∨ 2∆( S ) when dim ( S ) ∨ 2∆( S ) ≤ 2 n / 3, S = � Y − Π S Y � 2 σ 2 ◮ � n − dim( S ) . 2

  12. Non-asymptotic risk bound Assumptions 1. 1 ≤ dim ( S ) ∨ 2∆( S ) ≤ 2 n / 3 for all S ∈ S , 2. � S ∈S e − ∆( S ) ≤ 1. Theorem (Y. Baraud, C.G., S. Huet) � λ � 2 � � f − � E f b ≤ � � f λ � 2 + [ dim ( S ) ∨ ∆( S )] σ 2 ��� � f λ � 2 + inf � f − � � � f λ − Π S � inf C E λ ∈ Λ S ∈ b S The bound also holds in deviation.

  13. Sparse linear regression

  14. Instantiation of LinSelect Estimators �� � f λ = X � Linear regressor: β λ : λ ∈ Λ . (e.g. Lasso, Exponential-Weighting, etc) Approximation and complexity � � ◮ S = range( X J ) : J ⊂ { 1 , . . . , p } , 1 ≤ |J | ≤ n / (3 log p ) � � p ◮ ∆( S ) = log + log(dim( S )) ≈ dim( S ) log( p ) . dim( S ) Subcollection � S � � We set � S λ = range X supp(ˆ and define β λ ) � � � � � S λ , λ ∈ � � where � λ ∈ Λ : � S = Λ , Λ = S λ ∈ S .

  15. Case of the Lasso estimators Lasso estimators � � � Y − X β � 2 + 2 λ � β � 1 � β λ = argmin , λ > 0 β Parameter tuning: theory For X with columns normalized to 1 � λ ≍ σ 2 log( p ) Parameter tuning: practice ◮ V -fold CV ◮ BIC criterion

  16. Recent criterions pivotal with respect to the variance ◮ ℓ 1 -penalized log-likelihood. (Stadler, Buhlmann, van de Geer) � � n log( σ ′ ) + � Y − X β � 2 + λ � β � 1 β LL � σ LL 2 λ , � λ := argmin . 2 σ ′ 2 σ ′ β ∈ R p ,σ ′ > 0 ◮ ℓ 1 -penalized Huber’s loss. (Belloni et al. , Antoniadis) � n σ ′ � 2 + � Y − X β � 2 β SR � σ SR 2 λ , � := argmin + λ � β � 1 . λ 2 σ ′ β ∈ R p ,σ ′ > 0 Equivalent to Square-Root Lasso (introduced before) �� � 2 + λ β SR � � Y − X β � 2 = argmin √ n � β � 1 . λ β ∈ R p Sun & Zhang : optimization with a single LARS-call

  17. The compatibility constant � � | T | 1 / 2 � X u � 2 / � u T � 1 κ [ ξ, T ] = min , u ∈C ( ξ, T ) where C ( ξ, T ) = { u : � u T c � 1 < ξ � u T � 1 } . Restricted eigenvalue For k ∗ = n / (3 log( p )) we set φ ∗ = sup {� Xu � 2 / � u � 2 : u k ∗ -sparse } Theorem for Square-Root Lasso (Sun & Zhang) � For λ = 2 2 log( p ), if we assume that ◮ � β 0 � 0 ≤ C 1 κ 2 [4 , supp ( β 0 )] × n log( p ) , then, with high probability, � � � β � 0 log( p ) � X ( � β − β 0 ) � 2 � X ( β 0 − β ) � 2 κ 2 [4 , supp ( β )] σ 2 ≤ inf 2 + C 2 . 2 β � =0

  18. The compatibility constant � � | T | 1 / 2 � X u � 2 / � u T � 1 κ [ ξ, T ] = min , u ∈C ( ξ, T ) where C ( ξ, T ) = { u : � u T c � 1 < ξ � u T � 1 } . Restricted eigenvalue For k ∗ = n / (3 log( p )) we set φ ∗ = sup {� Xu � 2 / � u � 2 : u k ∗ -sparse } Theorem for LinSelect Lasso If we assume that ◮ � β 0 � 0 ≤ C 1 κ 2 [4 , supp ( β 0 )] × n φ ∗ log( p ) , then, with high probability, � � � β � 0 log( p ) � X ( � β − β 0 ) � 2 � X ( β 0 − β ) � 2 φ ∗ κ 2 [4 , supp ( β )] σ 2 ≤ C inf 2 + C 2 . 2 β � =0

  19. Numerical experiments (1/2) Tuning the Lasso ◮ 165 examples extracted from the literature ◮ each example e is evaluated on the basis of 400 runs Comparison to the oracle � β λ ∗ procedure quantiles 0% 50% 75% 90% 95% Lasso 10-fold CV 1.03 1.11 1.15 1.19 1.24 Lasso LinSelect 0.97 1.03 1.06 1.19 2.52 Square-Root Lasso 1.32 2.61 3.37 11.2 17 � � � � � � For each procedure ℓ , quantiles of R β ˆ λ ℓ ; β 0 / R β λ ∗ ; β 0 , for e = 1 , . . . , 165.

  20. Numerical experiments (2/2) Computation time n p 10-fold CV LinSelect Square-Root 100 100 4 s 0.21 s 0.18 s 100 500 4.8 s 0.43 s 0.4 s 500 500 300 s 11 s 6.3 s Packages: ◮ enet for 10-fold CV and LinSelect ◮ lars for Square-Root Lasso (procedure of Sun & Zhang)

  21. Non-parametric regression

  22. An important class of estimators Linear estimators : � f λ = A λ Y with A λ ∈ R n × n ◮ spline smoothing or kernel ridge estimators with smoothing parameter λ ∈ R + ◮ Nadaraya estimators A λ with smoothing parameter λ ∈ R + ◮ λ -nearest neighbors, λ ∈ { 1 , . . . , k } ◮ L 2 -basis projection (on the λ first elements) ◮ etc Selection criterions (with σ 2 unknown) ◮ Cross-Validation schemes (including GCV) ◮ Mallows’ C L + plug-in / slope heuristic ◮ LinSelect

  23. An important class of estimators Linear estimators : � f λ = A λ Y with A λ ∈ R n × n ◮ spline smoothing or kernel ridge estimators with smoothing parameter λ ∈ R + ◮ Nadaraya estimators A λ with smoothing parameter λ ∈ R + ◮ λ -nearest neighbors, λ ∈ { 1 , . . . , k } ◮ L 2 -basis projection (on the λ first elements) ◮ etc Selection criterions (with σ 2 unknown) ◮ Cross-Validation schemes (including GCV) ◮ Mallows’ C L + plug-in / slope heuristic ◮ LinSelect

  24. Slope heuristic (Arlot & Bach) Procedure for � f λ = A λ Y � � f λ � 2 + σ ′ Tr(2 A λ − A ∗ 1. compute � � Y − � λ 0 ( σ ′ ) = argmin λ λ A λ ) 2. select � σ such that Tr( A ˆ σ ) ) ∈ [ n / 10 , n / 3] λ 0 (ˆ � � f λ � 2 + 2 � σ 2 Tr( A λ ) 3. select � � Y − � λ = argmin λ . Main assumptions ◮ A λ ≈ shrinkage or ”averaging” matrix (covers all classics) ◮ Bias assumption : ∃ λ 1 , Tr( A λ 1 ) ≤ √ n and � ( I − A λ 1 ) f � 2 ≤ σ 2 � n log( n ) Theorem (Arlot & Bach) λ − f � 2 ≤ (1 + ε ) inf λ � � f λ − f � 2 + C ε − 1 log( n ) σ 2 With high proba: � � f ˆ

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend