On the performance of the Lasso in terms of prediction loss Joint - PowerPoint PPT Presentation

On the performance of the Lasso in terms of prediction loss Joint work with M. Hebiri and J. Lederer Van Dantzig seminar, Amsterdam October 9, 2014 Arnak S. Dalalyan ENSAE / CREST / GENES

I. Overcomplete dictionaries and Lasso

Classical problem of regression ◮ Observations : feature-label pairs { ( z i , y i ); i = 1 , . . . , n } • z i ∈ R d multidimensional feature vector ; • y i ∈ R real valued label. ◮ Regression function : for some f ∗ : R d → R it holds that y i = f ∗ ( z i ) + ξ i ; with i.i.d. noise { ξ i } . We will always assume that E [ ξ 1 ] = 0 , Var [ ξ 1 ] = σ 2 . The feature vectors z i are assumed deterministic. ◮ Dictionary approach : for a given family (called dictionary) of functions { ϕ j } j ∈ [ p ] , it is assumed that for some ¯ β ∈ R p , p f ∗ ≈ f ¯ � ¯ β := β j ϕ j . j = 1 ◮ Sparsity : the dimensionality of ¯ β is large, possibly much larger than n , but it has only a few nonzero entries ( s = � ¯ β � 0 ≪ p ). Arnak S. Dalalyan October 9, 2014 3

Classical problem of regression ◮ Observations : feature-label pairs { ( z i , y i ); i = 1 , . . . , n } ◮ Regression function : for some f ∗ : R d → R it holds that y i = f ∗ ( z i ) + ξ i ; with i.i.d. noise { ξ i } . ◮ Dictionary approach : for a dictionary { ϕ j } j ∈ [ p ] , � p f ∗ ≈ f ¯ ¯ β := β j ϕ j . j = 1 ◮ Sparsity : the dimensionality of ¯ β is large, possibly much larger than n , but it has only a few nonzero entries ( s = � ¯ β � 0 ≪ p ). ◮ Prediction loss : the quality of recovery is measured by the normalized Euclidean norm : n f , f ∗ ) = 1 � 2 . ℓ n (ˆ � � ˆ f ( z i ) − f ∗ ( z i ) n i = 1 The goal is to propose an estimator ˆ β , f ∗ ) is small. β such that ℓ n ( f ˆ Arnak S. Dalalyan October 9, 2014 4

Equivalence with multiple linear regression � Set y = ( y 1 , . . . , y n ) ⊤ and ξ = ( ξ 1 , . . . , ξ n ) ⊤ . � Define the design matrix X = [ ϕ j ( z i )] i ∈ [ n ] , j ∈ [ p ] . � Assume, for notational convenience, that f ∗ = f β ∗ . We get then the regression model y = X β ∗ + ξ . � The prediction loss of an estimator ˆ β is then β , β ∗ ) := 1 ℓ n (ˆ n � X (ˆ β − β ∗ ) � 2 2 . n � X j � 2 � The columns of X (dictionary elements) satisfy 1 2 ≤ 1. β ∗ Y X ξ n × 1 n × p n × 1 p × 1 Arnak S. Dalalyan October 9, 2014 5

Lasso and its prediction error � Definition : Given λ > 0, the Lasso estimator is � 1 � ˆ β Lasso 2 n � y − X β � 2 ∈ arg min 2 + λ � β � 1 . λ β ∈ R p � 2 � 1 / 2 , then � Risk bound with “slow” rate : if λ ≥ σ n log ( p /δ ) � � ℓ n (ˆ ℓ n ( ¯ β , β ∗ ) + 4 λ � ¯ β Lasso , β ∗ ) ≤ min β � 1 , (1) λ ¯ β with probability at least 1 − δ (see, for instance, [Rigollet and Tsybakov, 2011]. � For fixed sparsity s , the remainder term is of order n − 1 / 2 , up to a log factor. This is called “slow” rate. � Slow-rate bound holds even if the columns of X are strongly correlated. Arnak S. Dalalyan October 9, 2014 6

Fast rates for the Lasso � Recall the Restricted Eigenvalue condition RE ( T , 5 ) : ∀ δ ∈ R p 1 n � X δ � 2 2 ≥ κ 2 T , 5 � δ T � 2 � δ T c � 1 ≤ 5 � δ T � 1 ⇒ 2 . � Risk bound with “fast” rate : according to [Koltchinskii, Lounici and Tsybakov, AoS, 2011], if for some T ⊂ [ p ] the matrix X satisfies RE ( T , 5 ) and the noise distribution is Gaussian, then � 2 log ( p /δ ) � 1 / 2 leads to λ = 3 σ n β T c � 1 + σ 2 � ¯ � β � 0 18 log ( p /δ ) � ℓ n (ˆ , β ∗ ) ≤ inf ℓ n (¯ β , β ∗ ) + 4 λ � ¯ β Lasso , λ κ 2 n ¯ β ∈ R p T , 5 with probability at least 1 − δ (see also [Sun and Zhang, 2012]). � The remainder term above is of order s / n , called fast rate, if κ T , 5 is bounded away from zero. This constrains the correlations between the columns of X . Arnak S. Dalalyan October 9, 2014 7

II. Some questions

Question 1 � For really sparse vectors (for example, s is fixed and n → ∞ ), there are methods that satisfy fast rate bounds for prediction irrespective of the correlations between the covariates [BTW07a, DT07, RT11]. � Fast rate bounds for Lasso prediction, in contrast, usually rely on assumptions on the correlations of the covariates such as low coherence [CP09], restricted eigenvalues [BRT09, RWY10], restricted isometry [CT07], compatibility [vdG07], etc. � Question : is it possible to establish fast rate bounds for the Lasso that are valid irrespective of the correlations between the covariates. This question is open even if we allow for oracle choices of the tuning parameter λ , that is, if we allow for λ that depends on the true regression vector β ∗ , the noise vector ξ , and the noise level σ . Arnak S. Dalalyan October 9, 2014 9

Question 2 � Known results imply fast rates for prediction with the Lasso in the following two extreme cases : First, when the covariates are mutually orthogonal, and second, when the covariates are all collinear. � Question : how far from these two extreme cases can a design be such that it still permits fast rates for prediction with the Lasso ? � For the first case, the case of mutually orthogonal covariates, this question has been thoroughly studied [BRT09, BTW07b, Zha09, vdGB09, Wai09, CWX10, JN11]. � For the second case, the case of collinear covariates, this question has received much less attention and is therefore one of our main topics. Arnak S. Dalalyan October 9, 2014 10

Question 3 A particular case of the Lasso is the least squares estimator with the total variation penalty : � 1 f TV ∈ arg min � ˆ n � y − f � 2 2 + λ � f � TV , (2) f ∈ R n which corresponds to the Lasso estimator for the design matrix 1 0 . . . 0   1 1 . . . 0 X = f = X β , � f � TV = � β � 1 . . . . ...   . . . . . . 1 1 . . . 1 � It is known that if f ∗ is piecewise constant, then the minimax rate of estimation is parametric O ( n − 1 ) . � According to [MvdG97], the risk of the TV-estimator is O ( n − 2 / 3 ) . � Question : Is the TV-estimator indeed suboptimal for estimating piece-wise constant functions or this gap is just an artifact of the proof ? Arnak S. Dalalyan October 9, 2014 11

III. A counter-example

Fast rates : a negative result √ 2 n ⌋ . Define the design matrix X ∈ R n × 2 m � Let n ≥ 2 and m = ⌊ by 1 1 1 1 . . . 1 1   1 − 1 0 0 . . . 0 0 � n 0 0 1 − 1 . . . 0 0 X =  .   . . . . . . ... 2  . . . . . . . . . . . . 0 0 0 0 1 − 1 . . . � We assume in this example that ξ is composed of i.i.d. Rademacher random variables. � Let β ∗ ∈ R 2 m such that β ∗ 1 = β ∗ 2 = 1 and β ∗ j = 0 for every j > 2. Proposition For any λ > 0, the prediction loss of ˆ β Lasso satisfies λ ≥ 1 � , β ∗ ) ≥ ( 8 n ) − 1 / 2 � ℓ n (ˆ β Lasso P 2 . λ Arnak S. Dalalyan October 9, 2014 13

Fast rates : a negative result Other negative results can be found in [CP09], but the specificities of the last proposition are that : √ � the sparsity is fixed and small : s = 2, while p ≈ 8 n . � the correlations are fixed and bounded away from zero and one : � X j , X j ′ � = 1 / 2 for most j , j ′ . � the result is true for all values of λ . Conclusion The statistical complexity of the Lasso is definitely worse than that of the Exponential Screening [RT11] and Exponentially Weighted Aggre- gate with sparsity prior [DT07]. Arnak S. Dalalyan October 9, 2014 14

IV. Taking advantage of correlations : intermediate rates

A measure of (high) correlations and a sharp OI � 2 � 1 / 2 , then w.p. ≥ 1 − δ , Recall “slow” rate : if λ ≥ σ n log ( p /δ ) � � ℓ n (ˆ ℓ n ( ¯ β , β ∗ ) + 4 λ � ¯ , β ∗ ) ≤ min β Lasso β � 1 . (3) λ ¯ β This bound can be substantially improved when some columns of X are nearly collinear (very strongly correlated). For every set T ⊂ [ p ] , we introduce the quantity ρ T = n − 1 / 2 max j ∈ [ p ] � ( I n − Π T ) X j � 2 , where Π T is the projector onto span ( X T ) . Theorem 1 � 2 � 1 / 2 , with prob. ≥ 1 − 2 δ the Lasso fulfills If λ ≥ ρ T σ n log ( p /δ ) + 2 σ 2 ( | T | + 2 log ( 1 /δ )) � � ℓ n (ˆ , β ∗ ) ≤ inf ℓ n ( ¯ β , β ∗ ) + 4 λ � ¯ β Lasso β � 1 . λ n ¯ β ∈ R p Arnak S. Dalalyan October 9, 2014 16

Discussion � “Slow” rates meet “fast” rates when the quantity ρ T is O ( n − 1 / 2 ) . � For designs containing highly correlated covariates (as in the case of the TV-estimator), choosing the tuning parameter � 2 � 1 / 2 substantially smaller than the universal value σ n log ( p /δ ) may considerably improve the rate. � Applying Theorem 1 in the case of the TV-estimator, we get sharp OI’s with a minimax-rate-optimal remainder term in the case of Hölder continuous and monotone functions f . Arnak S. Dalalyan October 9, 2014 17

V. Fast rates and weighted compatibility

On the performance of the Lasso in terms of prediction loss Joint - PowerPoint PPT Presentation

On the performance of the Lasso in terms of prediction loss Joint work with M. Hebiri and J. Lederer Van Dantzig seminar, Amsterdam October 9, 2014 Arnak S. Dalalyan ENSAE / CREST / GENES I. Overcomplete dictionaries and Lasso Classical

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

Omitted variable bias of Lasso-based inference methods: A finite sample analysis uthrich

Complexity Analysis of the Lasso Regularization Path Julien Mairal and Bin Yu Inria, UC Berkeley

On Model Selection Consistency Of Lasso Yewon Kim 12/08/2015 Introduction Model selection is a

On the Distribution of the Adaptive LASSO Estimator U. Schneider (joint with B. M. P otscher)

Using the lasso in Stata for inference in high-dimensional models David M. Drukker Executive

Using the lasso in Stata for inference in high-dimensional models David M. Drukker Executive

Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007

Adaptive Lasso for correlated predictors Keith Knight Department of Statistics University of

Holger Stark Max-Planck-Institute for Biophysical Chemistry and University of Gttingen 37077

ALT 8 The Dynamics of Probabilistic Eighth Biennial Conference of the Association for Grammar:

Craig Gentry IBM Watson Bar-Ilan University Dept. of Computer Science Homomorphic Encryption

Describing History with Data in the Digital Humanities Using Register of Chinese Immigrants to

Early warning of climate tipping points Tim Lenton With thanks to John Schellnhuber, Valerie

Memory Management Strategies in CPU/GPU Database Systems: A Survey Iya Arefyeva, David

Joint variable and rank selection for parsimonious estimation of high dimensional matrices

Alexis Brandeker Stockholm University Duy Cuong Nguyen ,

On the performance of the Lasso in terms of prediction loss Joint - PowerPoint PPT Presentation

On the performance of the Lasso in terms of prediction loss Joint work with M. Hebiri and J. Lederer Van Dantzig seminar, Amsterdam October 9, 2014 Arnak S. Dalalyan ENSAE / CREST / GENES I. Overcomplete dictionaries and Lasso Classical

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

Sparse CCA using Lasso Anastasia Lykou &amp; Joe Whittaker Department of Mathematics and

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

Omitted variable bias of Lasso-based inference methods: A finite sample analysis uthrich

Complexity Analysis of the Lasso Regularization Path Julien Mairal and Bin Yu Inria, UC Berkeley

On Model Selection Consistency Of Lasso Yewon Kim 12/08/2015 Introduction Model selection is a

On the Distribution of the Adaptive LASSO Estimator U. Schneider (joint with B. M. P otscher)

Using the lasso in Stata for inference in high-dimensional models David M. Drukker Executive

Using the lasso in Stata for inference in high-dimensional models David M. Drukker Executive

Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007

Adaptive Lasso for correlated predictors Keith Knight Department of Statistics University of

Holger Stark Max-Planck-Institute for Biophysical Chemistry and University of Gttingen 37077

ALT 8 The Dynamics of Probabilistic Eighth Biennial Conference of the Association for Grammar:

Craig Gentry IBM Watson Bar-Ilan University Dept. of Computer Science Homomorphic Encryption

Describing History with Data in the Digital Humanities Using Register of Chinese Immigrants to

Early warning of climate tipping points Tim Lenton With thanks to John Schellnhuber, Valerie

Memory Management Strategies in CPU/GPU Database Systems: A Survey Iya Arefyeva, David

Joint variable and rank selection for parsimonious estimation of high dimensional matrices

Alexis Brandeker Stockholm University Duy Cuong Nguyen ,

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and