Statistics for high-dimensional data: p-values and confidence - PowerPoint PPT Presentation

Statistics for high-dimensional data: p-values and confidence intervals Peter B¨ uhlmann Seminar f¨ ur Statistik, ETH Z¨ urich June 2014

High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U. Zurich) ◮ n = 1 ′ 525 persons ◮ genetic information (SNPs): p ≈ 10 6 ◮ 79 response variables, measuring “behavior” p ≫ n Number of significant target SNPs per phenotype ● 700 goal: find significant associations 600 ● Number of significant target SNPs 500 ● between behavioral responses 400 ● 300 ● and genetic markers ● 200 ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Phenotype index

in high-dimensional statistics: a lot of progress has been achieved over the last 8-10 years for ◮ point estimation ◮ rates of convergence but very little work on assigning measures of uncertainty, p-values, confidence intervals

we need uncertainty quantification! (the core of statistics)

goal (regarding the title of the talk): p-values/confidence interval for a high-dimensional linear model (and we can then generalize to other models)

Motif regression and variable selection for finding HIF1 α transcription factor binding sites in DNA seq. M¨ uller, Meier, PB & Ricci for coarse DNA segments i = 1 , ..., n : ◮ predictor X i = ( X ( 1 ) , . . . , X ( p ) ) ∈ R p : i i abundance score of candidate motifs j = 1 , ..., p in DNA segment i (using sequence data and computational biology algorithms, e.g. MDSCAN) ◮ univariate response Y i ∈ R : binding intensity of HIF1 α to coarse DNA segment (from CHIP-chip experiments)

question: relation between the binding intensity Y and the abundance of short candidate motifs? ❀ linear model is often reasonable “motif regression” ( Conlon, X.S. Liu, Lieb & J.S. Liu, 2003 ) p � j X ( j ) β 0 Y i = + ε i i j = 1 i = 1 , . . . , n = 143 , p = 195 goal: variable selection and significance of variables ❀ find the relevant motifs among the p = 195 candidates

Lasso for variable selection: ˆ { j ; ˆ β j ( λ ) � = 0 } S ( λ ) = { j ; β 0 estimate for S 0 = j � = 0 } no significance testing involved it’s convex optimization only! and it’s very popular ( Meinshausen & PB, 2006;Zhao & Yu, 2006; Wainwright, 2009;... )

for motif regression (finding HIF1 α transcription factor binding sites) n=143, p=195 ❀ Lasso selects 26 covariates when choosing λ = ˆ λ CV via cross-validation and resulting R 2 ≈ 50 % i.e. 26 interesting candidate motifs how significant are the findings?

estimated coefficients ˆ β (ˆ λ CV ) original data 0.20 0.15 coefficients 0.10 0.05 0.00 0 50 100 150 200 variables p-values for H 0 , j : β 0 j = 0 ?

P-values for high-dimensional linear models Y = X β 0 + ε goal: statistical hypothesis testing H 0 , j : β 0 j = 0 or H 0 , G : β 0 j = 0 for all j ∈ G ⊆ { 1 , . . . , p } background: if we could handle the asymptotic distribution of the Lasso ˆ β ( λ ) under the null-hypothesis ❀ could construct p-values this is very difficult! asymptotic distribution of ˆ β has some point mass at zero,... Knight and Fu (2000) for p < ∞ and n → ∞

❀ standard bootstrapping and subsampling cannot be used either but there are recent proposals when using adaptations of standard resampling methods ( Chatterjee & Lahiri, 2013; Liu & Yu, 2013 ) ❀ non-uniformity/super-efficiency issues remain...

Low-dimensional projections and bias correction Or de-sparsifying the Lasso estimator related work by Zhang and Zhang (2011) motivation: ˆ γ ( j ) β OLS , j = projection of Y onto residuals ( X j − X − j ˆ OLS ) projection not well defined if p > n ❀ use “regularized” residuals from Lasso on X -variables γ ( j ) Z j = X j − X − j ˆ Lasso γ ( j ) = argmin γ � X j − X − j γ � + λ j � γ � 1 ˆ

using Y = X β 0 + ε ❀ � Z T j Y = Z T j X j β 0 Z T j X k + Z T j + j ε k � = j and hence Z T Z T Z T j Y j X k j ε � = β 0 β 0 j + + k Z T Z T Z T j X j j X j j X j k � = j � �� noise component bias ❀ Z T Z T j Y j X k � ˆ ˆ − b j = β Lasso ; k Z T Z T j X j j X j k � = j � �� Lasso-estim. bias corr.

ˆ b j is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later) ◮ target: low-dimensional component β 0 j ◮ η := { β 0 k ; k � = j } is a high-dimensional nuisance parameter ❀ exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)

⇒ let’s turn to the blackboard! =

A general principle: de-sparsifying via “inversion” of KKT KKT conditions: sub-differential of objective function � Y − X β � 2 2 / n + λ � β � 1 − X T ( Y − X ˆ β ) / n + λ ˆ τ = 0 �� X β 0 + ε τ j = sign (ˆ β j ) if ˆ � ˆ τ � ∞ ≤ 1 , ˆ β j � = 0 . with ˆ Σ = X T X / n ❀ ˆ Σ(ˆ β − β 0 ) + λ ˆ τ = X T ε/ n “regularized inverse” of ˆ Σ , denoted by ˆ Θ ( not e.g. GLasso) β − β 0 + ˆ ˆ τ = ˆ Θ X T ε/ n − ∆ , Θ λ ˆ ∆ = (ˆ Θˆ Σ − I )(ˆ β − β 0 ) new estimator: ˆ b = ˆ β + ˆ τ = ˆ β + ˆ Θ X T ( Y − X ˆ Θ λ ˆ β ) / n

❀ ˆ b is exactly the same estimator as before (based on low-dimensional projection using residual vectors Z j ) ... when taking ˆ Θ (“regularized inverse of ˆ Σ ”) having rows using the (“nodewise”) Lasso-estimated coefficients from X j versus X − j : γ j = argmin γ ∈ R p − 1 � X j − X − j γ � 2 ˆ 2 / n + 2 λ j � γ � 1 Denote by   1 − ˆ γ 1 , 2 · · · − ˆ γ 1 , p − ˆ · · · − ˆ γ 2 , 1 1 γ 2 , p   ˆ   C = . . . ...   . . . . . .   − ˆ γ p , 1 − ˆ γ p , 2 · · · 1 Let T 2 = diag (ˆ ˆ τ 2 τ 2 τ 2 γ j � 2 j = � X j − X − j ˆ 2 / n + λ j � ˆ γ j � 1 , 1 , . . . , ˆ p ) , ˆ T − 2 ˆ Θ Lasso = ˆ ˆ C not symmetric... !

“inverting” the KKT conditions is a very general principle ❀ the principle can be used for GLMs and many other models

Asymptotic pivot and optimality Theorem (van de Geer, PB & Ritov, 2013) √ n (ˆ b j − β 0 j ) ⇒ N ( 0 , σ 2 ε Ω jj ) ( j = 1 , . . . , p ) Ω jj explicit expression ∼ (Σ − 1 ) jj optimal! reaching semiparametric information bound ❀ asympt. optimal p-values and confidence intervals if we assume: ◮ sub-Gaussian design (i.i.d. rows of X sub-Gaussian) with population Cov ( X ) = Σ has minimal eigenvalue ≥ M > 0 √ ◮ sparsity for regr. Y vs. X : s 0 = o ( √ n / log ( p )) “quite sparse” ◮ sparsity of design: Σ − 1 sparse i.e. sparse regressions X j vs. X − j : s j = o ( n / log ( p )) “maybe restrictive”

It is optimal! Cramer-Rao

for design with Σ − 1 non-sparse: ◮ Ridge projection ( PB, 2013 ): good type I error control but not optimal in terms of power ◮ convex program instead of Lasso for Z j ( Javanmard & Montanari, 2013; MSc. thesis Dezeure, 2013 ) Javanmard & Montanari prove optimality so far: no convincing empirical evidence that we can deal well with such scenarios ( Σ − 1 non-sparse)

Uniform convergence: √ n (ˆ b j − β 0 j ) ⇒ N ( 0 , σ 2 ε Ω jj ) ( j = 1 , . . . , p ) convergence is uniform over B ( s 0 ) = { β ; � β � 0 ≤ s 0 } ❀ honest tests and confidence regions! and we can avoid post model selection inference (cf. P¨ otscher and Leeb )

Simultaneous inference over all components: √ n (ˆ b − β 0 ) ≈ ( W 1 , . . . , W p ) ∼ N p ( 0 , σ 2 ε Ω) ❀ can construct P-values for: H 0 , G with any G : test-statistics max j ∈ G | ˆ b j | since covariance structure Ω is known and can easily do efficient multiple testing adjustment since covariance structure Ω is known!

Statistics for high-dimensional data: p-values and confidence - PowerPoint PPT Presentation

Statistics for high-dimensional data: p-values and confidence intervals Peter B uhlmann Seminar f ur Statistik, ETH Z urich June 2014 High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U. Zurich) n = 1

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Values Learning Outcomes Define what values are Identify your personal values Relate

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

High Dimensional Data PCA So far we ve considered scalar data values f i (or We have n

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

High-dimensional statistics and probability Christophe Giraud Universit e Paris Saclay M2

Mean field asymptotics in high-dimensional statistics: A few references Andrea Montanari July

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

Values in Mental Health Legislation 1 Values and Values-based Practice 12.10.2016 Bill Fulford

Welcome to OUR VALUES AND MISSION Mission Values To provide the highest values of service to

MISSION Leadership OnDemand AGENDA Personal Values Overview Identifying your Personal

Friable values of polynomials Greg Martin Introduction Friable integers How often do the values

So many genes, so little time: targeting systems for gene destruction Thomas Jacobs Center for

Genetic Algorithms for Simultaneous Equation Models Jos J. Lpez Universidad Miguel

LOW RAD 2 0 0 8 LI SBON, NOVEMBER 2 7 -2 9 , 2 0 0 8 Preliminary Oral Presentation Session 1 -

Improved Genetic Algorithm: Channel Allocation in Mobile Computing D. P. Vidyarthi School of

JMSCR Vol||04||Issue||07||Page 11533-11538||July 2016 www.jmscr.igmpublication.org Impact Factor

Determination of genetic loci in the control network of grapevine flowering Anna Schwandner 1 ,

Using MAGIC to dissect the genetics of below- (and above-) ground wheat phenotypes Yeorgia

USING GPU AND POWER8 TO EXPLORE HOW GENOMES FOLD Ido Machol Aiden Lab Baylor College of

Sambuz

Useful Links

Newsletter

Mail Us