statistics for high dimensional data p values and
play

Statistics for high-dimensional data: p-values and confidence - PowerPoint PPT Presentation

Statistics for high-dimensional data: p-values and confidence intervals Peter B uhlmann Seminar f ur Statistik, ETH Z urich June 2014 High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U. Zurich) n = 1


  1. Statistics for high-dimensional data: p-values and confidence intervals Peter B¨ uhlmann Seminar f¨ ur Statistik, ETH Z¨ urich June 2014

  2. High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U. Zurich) ◮ n = 1 ′ 525 persons ◮ genetic information (SNPs): p ≈ 10 6 ◮ 79 response variables, measuring “behavior” p ≫ n Number of significant target SNPs per phenotype ● 700 goal: find significant associations 600 ● Number of significant target SNPs 500 ● between behavioral responses 400 ● 300 ● and genetic markers ● 200 ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Phenotype index

  3. in high-dimensional statistics: a lot of progress has been achieved over the last 8-10 years for ◮ point estimation ◮ rates of convergence but very little work on assigning measures of uncertainty, p-values, confidence intervals

  4. we need uncertainty quantification! (the core of statistics)

  5. goal (regarding the title of the talk): p-values/confidence interval for a high-dimensional linear model (and we can then generalize to other models)

  6. Motif regression and variable selection for finding HIF1 α transcription factor binding sites in DNA seq. M¨ uller, Meier, PB & Ricci for coarse DNA segments i = 1 , ..., n : ◮ predictor X i = ( X ( 1 ) , . . . , X ( p ) ) ∈ R p : i i abundance score of candidate motifs j = 1 , ..., p in DNA segment i (using sequence data and computational biology algorithms, e.g. MDSCAN) ◮ univariate response Y i ∈ R : binding intensity of HIF1 α to coarse DNA segment (from CHIP-chip experiments)

  7. question: relation between the binding intensity Y and the abundance of short candidate motifs? ❀ linear model is often reasonable “motif regression” ( Conlon, X.S. Liu, Lieb & J.S. Liu, 2003 ) p � j X ( j ) β 0 Y i = + ε i i j = 1 i = 1 , . . . , n = 143 , p = 195 goal: variable selection and significance of variables ❀ find the relevant motifs among the p = 195 candidates

  8. Lasso for variable selection: ˆ { j ; ˆ β j ( λ ) � = 0 } S ( λ ) = { j ; β 0 estimate for S 0 = j � = 0 } no significance testing involved it’s convex optimization only! and it’s very popular ( Meinshausen & PB, 2006;Zhao & Yu, 2006; Wainwright, 2009;... )

  9. for motif regression (finding HIF1 α transcription factor binding sites) n=143, p=195 ❀ Lasso selects 26 covariates when choosing λ = ˆ λ CV via cross-validation and resulting R 2 ≈ 50 % i.e. 26 interesting candidate motifs how significant are the findings?

  10. for motif regression (finding HIF1 α transcription factor binding sites) n=143, p=195 ❀ Lasso selects 26 covariates when choosing λ = ˆ λ CV via cross-validation and resulting R 2 ≈ 50 % i.e. 26 interesting candidate motifs how significant are the findings?

  11. estimated coefficients ˆ β (ˆ λ CV ) original data 0.20 0.15 coefficients 0.10 0.05 0.00 0 50 100 150 200 variables p-values for H 0 , j : β 0 j = 0 ?

  12. P-values for high-dimensional linear models Y = X β 0 + ε goal: statistical hypothesis testing H 0 , j : β 0 j = 0 or H 0 , G : β 0 j = 0 for all j ∈ G ⊆ { 1 , . . . , p } background: if we could handle the asymptotic distribution of the Lasso ˆ β ( λ ) under the null-hypothesis ❀ could construct p-values this is very difficult! asymptotic distribution of ˆ β has some point mass at zero,... Knight and Fu (2000) for p < ∞ and n → ∞

  13. ❀ standard bootstrapping and subsampling cannot be used either but there are recent proposals when using adaptations of standard resampling methods ( Chatterjee & Lahiri, 2013; Liu & Yu, 2013 ) ❀ non-uniformity/super-efficiency issues remain...

  14. Low-dimensional projections and bias correction Or de-sparsifying the Lasso estimator related work by Zhang and Zhang (2011) motivation: ˆ γ ( j ) β OLS , j = projection of Y onto residuals ( X j − X − j ˆ OLS ) projection not well defined if p > n ❀ use “regularized” residuals from Lasso on X -variables γ ( j ) Z j = X j − X − j ˆ Lasso γ ( j ) = argmin γ � X j − X − j γ � + λ j � γ � 1 ˆ

  15. using Y = X β 0 + ε ❀ � Z T j Y = Z T j X j β 0 Z T j X k + Z T j + j ε k � = j and hence Z T Z T Z T j Y j X k j ε � = β 0 β 0 j + + k Z T Z T Z T j X j j X j j X j k � = j � �� � � �� � noise component bias ❀ Z T Z T j Y j X k � ˆ ˆ − b j = β Lasso ; k Z T Z T j X j j X j k � = j � �� � Lasso-estim. bias corr.

  16. ˆ b j is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later) ◮ target: low-dimensional component β 0 j ◮ η := { β 0 k ; k � = j } is a high-dimensional nuisance parameter ❀ exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)

  17. ˆ b j is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later) ◮ target: low-dimensional component β 0 j ◮ η := { β 0 k ; k � = j } is a high-dimensional nuisance parameter ❀ exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)

  18. ⇒ let’s turn to the blackboard! =

  19. A general principle: de-sparsifying via “inversion” of KKT KKT conditions: sub-differential of objective function � Y − X β � 2 2 / n + λ � β � 1 − X T ( Y − X ˆ β ) / n + λ ˆ τ = 0 ���� X β 0 + ε τ j = sign (ˆ β j ) if ˆ � ˆ τ � ∞ ≤ 1 , ˆ β j � = 0 . with ˆ Σ = X T X / n ❀ ˆ Σ(ˆ β − β 0 ) + λ ˆ τ = X T ε/ n “regularized inverse” of ˆ Σ , denoted by ˆ Θ ( not e.g. GLasso) β − β 0 + ˆ ˆ τ = ˆ Θ X T ε/ n − ∆ , Θ λ ˆ ∆ = (ˆ Θˆ Σ − I )(ˆ β − β 0 ) new estimator: ˆ b = ˆ β + ˆ τ = ˆ β + ˆ Θ X T ( Y − X ˆ Θ λ ˆ β ) / n

  20. ❀ ˆ b is exactly the same estimator as before (based on low-dimensional projection using residual vectors Z j ) ... when taking ˆ Θ (“regularized inverse of ˆ Σ ”) having rows using the (“nodewise”) Lasso-estimated coefficients from X j versus X − j : γ j = argmin γ ∈ R p − 1 � X j − X − j γ � 2 ˆ 2 / n + 2 λ j � γ � 1 Denote by   1 − ˆ γ 1 , 2 · · · − ˆ γ 1 , p − ˆ · · · − ˆ γ 2 , 1 1 γ 2 , p   ˆ   C = . . . ...   . . . . . .   − ˆ γ p , 1 − ˆ γ p , 2 · · · 1 Let T 2 = diag (ˆ ˆ τ 2 τ 2 τ 2 γ j � 2 j = � X j − X − j ˆ 2 / n + λ j � ˆ γ j � 1 , 1 , . . . , ˆ p ) , ˆ T − 2 ˆ Θ Lasso = ˆ ˆ C not symmetric... !

  21. “inverting” the KKT conditions is a very general principle ❀ the principle can be used for GLMs and many other models

  22. Asymptotic pivot and optimality Theorem (van de Geer, PB & Ritov, 2013) √ n (ˆ b j − β 0 j ) ⇒ N ( 0 , σ 2 ε Ω jj ) ( j = 1 , . . . , p ) Ω jj explicit expression ∼ (Σ − 1 ) jj optimal! reaching semiparametric information bound ❀ asympt. optimal p-values and confidence intervals if we assume: ◮ sub-Gaussian design (i.i.d. rows of X sub-Gaussian) with population Cov ( X ) = Σ has minimal eigenvalue ≥ M > 0 √ ◮ sparsity for regr. Y vs. X : s 0 = o ( √ n / log ( p )) “quite sparse” ◮ sparsity of design: Σ − 1 sparse i.e. sparse regressions X j vs. X − j : s j = o ( n / log ( p )) “maybe restrictive”

  23. It is optimal! Cramer-Rao

  24. for design with Σ − 1 non-sparse: ◮ Ridge projection ( PB, 2013 ): good type I error control but not optimal in terms of power ◮ convex program instead of Lasso for Z j ( Javanmard & Montanari, 2013; MSc. thesis Dezeure, 2013 ) Javanmard & Montanari prove optimality so far: no convincing empirical evidence that we can deal well with such scenarios ( Σ − 1 non-sparse)

  25. Uniform convergence: √ n (ˆ b j − β 0 j ) ⇒ N ( 0 , σ 2 ε Ω jj ) ( j = 1 , . . . , p ) convergence is uniform over B ( s 0 ) = { β ; � β � 0 ≤ s 0 } ❀ honest tests and confidence regions! and we can avoid post model selection inference (cf. P¨ otscher and Leeb )

  26. Simultaneous inference over all components: √ n (ˆ b − β 0 ) ≈ ( W 1 , . . . , W p ) ∼ N p ( 0 , σ 2 ε Ω) ❀ can construct P-values for: H 0 , G with any G : test-statistics max j ∈ G | ˆ b j | since covariance structure Ω is known and can easily do efficient multiple testing adjustment since covariance structure Ω is known!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend