statistics for high dimensional data selected topics
play

Statistics for High-Dimensional Data: Selected Topics Peter B - PowerPoint PPT Presentation

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur Statistik, ETH Z urich June 2014 High-dimensional linear model: basic concepts in methodology, theory and computation High-dimensional data


  1. Statistics for High-Dimensional Data: Selected Topics Peter B¨ uhlmann Seminar f¨ ur Statistik, ETH Z¨ urich June 2014

  2. High-dimensional linear model: basic concepts in methodology, theory and computation

  3. High-dimensional data Riboflavin production with Bacillus Subtilis (in collaboration with DSM (Switzerland)) goal: improve riboflavin production rate of Bacillus Subtilis using clever genetic engineering response variables Y ∈ R : riboflavin (log-) production rate covariates X ∈ R p : expressions from p = 4088 genes sample size n = 115, p ≫ n Y versus 9 “reasonable” genes gene expression data −6 −6 −6 −7 −7 −7 y y y −8 −8 −8 −9 −9 −9 −10 −10 −10 7.6 8.0 8.4 8.8 7 8 9 10 11 12 6 7 8 9 10 xsel xsel xsel −6 −6 −6 −7 −7 −7 y −8 y −8 y −8 −9 −9 −9 −10 −10 −10 7.5 8.0 8.5 9.0 7.0 7.5 8.0 8.5 9.0 8 9 10 11 xsel xsel xsel −6 −6 −6 −7 −7 −7 y −8 y −8 y −8 −9 −9 −9 −10 −10 −10 8 9 10 11 12 7.5 8.0 8.5 9.0 8.5 9.0 9.5 10.0 11.0 xsel xsel xsel

  4. general framework: Z 1 , . . . , Z n (with some ”i.i.d. components”) dim ( Z i ) ≫ n for example: Z i = ( X i , Y i ) , X i ∈ X ⊆ R p , Y i ∈ Y ⊆∈ R : regression with p ≫ n Z i = ( X i , Y i ) , X i ∈ X ⊆∈ R p , Y i ∈ { 0 , 1 } : classification for p ≫ n numerous applications: biology, imaging, economy, environmental sciences, ...

  5. High-dimensional linear models p � j X ( j ) β 0 Y i = + ε i , i = 1 , . . . , n i j = 1 p ≫ n in short: Y = X β + ε goals: ◮ prediction, e.g. w.r.t. squared prediction error ◮ estimation of β 0 , e.g. w.r.t. � ˆ β − β 0 � q ( q = 1 , 2 ) ◮ variable selection i.e. estimating the active set with the effective variables (having corresponding coefficient � = 0)

  6. we need to regularize... and there are many proposals ◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...

  7. we need to regularize... and there are many proposals ◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...

  8. we need to regularize... and there are many proposals ◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...

  9. Penalty-based methods if true β 0 is sparse w.r.t. ◮ � β 0 � 0 0 = number of non-zero coefficients ❀ regularize with the � · � 0 -penalty: argmin β ( n − 1 � Y − X β � 2 + λ � β � 0 0 ) , e.g. AIC, BIC ❀ computationally infeasible if p is large (2 p sub-models) ◮ � β 0 � 1 = � p j = 1 | β 0 j | ❀ penalize with the � · � 1 -norm, i.e. Lasso: argmin β ( n − 1 � Y − X β � 2 + λ � β � 1 ) ❀ convex optimization: computationally feasible and very fast for large p

  10. as we will see: regularization with the ℓ 1 -norm is good (“nearly optimal”) even if truth is sparse w.r.t. � β 0 � 0 0 ❀ can exploit computational advantage of � . � 1 -norm regularization even if the problem has a different sparsity structure

  11. The Lasso ( Tibshirani, 1996 ) Lasso for linear models β ( λ ) = argmin β ( n − 1 � Y − X β � 2 + ˆ λ � β � 1 ) ���� ���� ≥ 0 � p j = 1 | β j | ❀ convex optimization problem ◮ Lasso does variable selection some of the ˆ β j ( λ ) = 0 (because of “ ℓ 1 -geometry”) ◮ ˆ β ( λ ) is a shrunken LS-estimate

  12. more about “ ℓ 1 -geometry” equivalence to primal problem β primal ( R ) = argmin β ; � β � 1 ≤ R � Y − X β � 2 ˆ 2 / n , with a correspondence between λ and R which depends on the data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) [such an equivalence holds since ◮ � Y − X β � 2 2 / n is convex in β ◮ convex constraint � β � 1 ≤ R see e.g. Bertsekas (1995) ]

  13. p=2 left: ℓ 1 -“world” residual sum of squares reaches a minimal value (for certain constellations of the data) if its contour lines hit the ℓ 1 -ball in its corner ❀ ˆ β 1 = 0

  14. ℓ 2 -“world” is different Ridge regression, � � ˆ � Y − X β � 2 2 / n + λ � β � 2 β Ridge ( λ ) = argmin β , 2 equivalent primal equivalent solution β Ridge ; primal ( R ) = argmin β ; � β � 2 ≤ R � Y − X β � 2 ˆ 2 / n , with a one-to-one correspondence between λ and R

  15. A note on the Bayesian approach model: β 1 , . . . , β p i.i.d. ∼ p ( β ) d β, given β : Y ∼ N n ( X β, σ 2 I n ) with density f ( y | σ 2 , β ) posterior density: f ( Y | β, σ 2 ) p ( β ) p ( β | Y , σ 2 ) = f ( Y | β, σ 2 ) p ( β ) d β ∝ f ( Y | β, σ 2 ) p ( β ) � and hence for the MAP (Maximum A-Posteriori) estimator: � � ˆ argmax β p ( β | Y , σ 2 ) = argmin β − log f ( Y | β, σ 2 ) p ( β ) β MAP =   p  1 � 2 σ 2 � Y − X β � 2 2 − = argmin β log ( p ( β j ))  j = 1

  16. examples: 1. Double-Exponential prior DExp( ξ ): p ( β ) = τ 2 exp ( − τβ ) ❀ ˆ β MAP equals the Lasso with penalty parameter λ = n − 1 2 σ 2 τ 2. Gaussian prior N ( 0 , τ 2 ) : 1 2 πτ exp ( − β 2 / ( 2 τ 2 )) p ( β ) = √ ❀ ˆ β MAP equals the Ridge estimator with penalty parameter λ = n − 1 σ 2 /τ 2 but we will argue that Lasso (i.e., the MAP estimator) is also good if the truth is sparse with respect to � β 0 � 0 0 , e.g. if prior is (much) more spiky around zero than Double-Exponential distribution

  17. Orthonormal design Y = X β + ε, n − 1 X T X = I Lasso = soft-thresholding estimator ˆ = ( n − 1 X T Y ) j , β j ( λ ) = sign ( Z j )( | Z j | − λ/ 2 ) + , Z j ���� = OLS ˆ β j ( λ ) = g soft ( Z j ) , [Exercise] threshold functions Adaptive Lasso 3 Hard−thresholding Soft−thresholding 2 1 0 −1 −2 −3 −3 −2 −1 0 1 2 3 z

  18. Prediction goal: predict a new observation Y new consider expected (w.r.t. new data; and random X ) squared error loss: β ) 2 ] = σ 2 + E X new [( X new ( β 0 − ˆ E X new , Y new [( Y new − X new ˆ β )) 2 ] σ 2 + (ˆ (ˆ β − β 0 ) T β − β 0 ) = Σ ���� Cov ( X ) ❀ terminology “prediction error”: for random design X : (ˆ β − β 0 ) T Σ(ˆ β − β 0 ) = E X new [( X new (ˆ β − β 0 )) 2 ] β − β 0 ) T ˆ for fixed design X : (ˆ Σ(ˆ β − β 0 ) = � X (ˆ β − β 0 ) � 2 2 / n

  19. binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P [ Y = 1 | X = x ] = E [ Y | X = x ] ˆ p ( x ) via linear model; can then do classification ❀ cross-validated misclassification error (2/3 training; 1/3 test) Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% best 200 genes (Wilcoxon test) with variable selection no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!

  20. binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P [ Y = 1 | X = x ] = E [ Y | X = x ] ˆ p ( x ) via linear model; can then do classification ❀ cross-validated misclassification error (2/3 training; 1/3 test) Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% best 200 genes (Wilcoxon test) with variable selection no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!

  21. and in fact: we will hear that ◮ Lasso is consistent for prediction assuming “essentially nothing” ◮ Lasso is optimal for prediction assuming the “compatibility condition” for X

  22. Estimation of regression coefficients Y = X β 0 + ε, p ≫ n with fixed (deterministic) design X problem of identifiability: for p > n : X β 0 = X θ for any θ = β 0 + ξ , ξ in the null-space of X ❀ cannot say anything about � ˆ β − β 0 � without further assumptions! ❀ we will work with the compatibility assumption (see later) and we will explain: under compatibility condition β − β 0 � 1 ≤ C s 0 � � ˆ log ( p ) / n , φ 2 0 s 0 = | supp ( β 0 ) | = |{ j ; β 0 j � = 0 }|

  23. ⇒ let’s turn to the blackboard! =

  24. various conditions and their relations ( van de Geer & PB, 2009 ) oracle inequalities for prediction and estimation 7 3 RIP weak (S,2s) - RIP adaptive (S, 2s) - (S,2s) -restricted 2 restricted regression eigenvalue S -compatibility 8 4 3 adaptive (S, s) - (S,s) -restricted |S \S| ≤ s coherence * restricted regression eigenvalue 6 5 9 6 6 6 6 |S \S| =0 weak (S, 2s) - (S,2s)- irrepresentable (S,s) -uniform * S =S irrepresentable irrepresentable 6 *

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend