Statistics for high-dimensional data: p-values and confidence - - PowerPoint PPT Presentation

statistics for high dimensional data p values and
SMART_READER_LITE
LIVE PREVIEW

Statistics for high-dimensional data: p-values and confidence - - PowerPoint PPT Presentation

Statistics for high-dimensional data: p-values and confidence intervals Peter B uhlmann Seminar f ur Statistik, ETH Z urich June 2014 High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U. Zurich) n = 1


slide-1
SLIDE 1

Statistics for high-dimensional data: p-values and confidence intervals

Peter B¨ uhlmann

Seminar f¨ ur Statistik, ETH Z¨ urich

June 2014

slide-2
SLIDE 2

High-dimensional data

Behavioral economics and genetics (with Ernst Fehr, U. Zurich)

◮ n = 1′525 persons ◮ genetic information (SNPs): p ≈ 106 ◮ 79 response variables, measuring “behavior”

p ≫ n

goal: find significant associations between behavioral responses and genetic markers

  • ● ● ● ● ● ● ● ●
  • ● ● ●
  • ● ● ● ●
  • ● ● ● ● ●
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
20 40 60 80 200 400 600 Number of significant target SNPs per phenotype Phenotype index Number of significant target SNPs 5 10 15 25 30 35 45 50 55 65 70 75 100 300 500 700
slide-3
SLIDE 3

in high-dimensional statistics: a lot of progress has been achieved over the last 8-10 years for

◮ point estimation ◮ rates of convergence

but very little work on assigning measures of uncertainty, p-values, confidence intervals

slide-4
SLIDE 4

we need uncertainty quantification! (the core of statistics)

slide-5
SLIDE 5

goal (regarding the title of the talk):

p-values/confidence interval for a high-dimensional linear model

(and we can then generalize to other models)

slide-6
SLIDE 6

Motif regression and variable selection

for finding HIF1α transcription factor binding sites in DNA seq.

M¨ uller, Meier, PB & Ricci

for coarse DNA segments i = 1, ..., n :

◮ predictor Xi = (X (1) i

, . . . , X (p)

i

) ∈ Rp: abundance score of candidate motifs j = 1, ..., p in DNA segment i (using sequence data and computational biology algorithms, e.g. MDSCAN)

◮ univariate response Yi ∈ R: binding intensity of HIF1α to

coarse DNA segment (from CHIP-chip experiments)

slide-7
SLIDE 7

question: relation between the binding intensity Y and the abundance of short candidate motifs? ❀ linear model is often reasonable “motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003) Yi =

p

  • j=1

β0

j X (j) i

+ εi i = 1, . . . , n = 143, p = 195 goal: variable selection and significance of variables ❀ find the relevant motifs among the p = 195 candidates

slide-8
SLIDE 8

Lasso for variable selection: ˆ S(λ) = {j; ˆ βj(λ) = 0} estimate for S0 = {j; β0

j = 0}

no significance testing involved it’s convex optimization only! and it’s very popular (Meinshausen & PB, 2006;Zhao & Yu, 2006;

Wainwright, 2009;...)

slide-9
SLIDE 9

for motif regression (finding HIF1α transcription factor binding sites) n=143, p=195 ❀ Lasso selects 26 covariates when choosing λ = ˆ λCV via cross-validation and resulting R2 ≈ 50% i.e. 26 interesting candidate motifs how significant are the findings?

slide-10
SLIDE 10

for motif regression (finding HIF1α transcription factor binding sites) n=143, p=195 ❀ Lasso selects 26 covariates when choosing λ = ˆ λCV via cross-validation and resulting R2 ≈ 50% i.e. 26 interesting candidate motifs how significant are the findings?

slide-11
SLIDE 11

estimated coefficients ˆ β(ˆ λCV)

50 100 150 200 0.00 0.05 0.10 0.15 0.20

  • riginal data

variables coefficients

p-values for H0,j : β0

j = 0

?

slide-12
SLIDE 12

P-values for high-dimensional linear models

Y = Xβ0 + ε goal: statistical hypothesis testing H0,j : β0

j = 0 or H0,G : β0 j = 0 for all j ∈ G ⊆ {1, . . . , p}

background: if we could handle the asymptotic distribution of the Lasso ˆ β(λ) under the null-hypothesis ❀ could construct p-values this is very difficult! asymptotic distribution of ˆ β has some point mass at zero,...

Knight and Fu (2000) for p < ∞ and n → ∞

slide-13
SLIDE 13

❀ standard bootstrapping and subsampling cannot be used either but there are recent proposals when using adaptations of standard resampling methods (Chatterjee & Lahiri, 2013; Liu & Yu, 2013) ❀ non-uniformity/super-efficiency issues remain...

slide-14
SLIDE 14

Low-dimensional projections and bias correction

Or de-sparsifying the Lasso estimator related work by Zhang and Zhang (2011)

motivation: ˆ βOLS,j = projection of Y onto residuals (Xj − X−jˆ γ(j)

OLS)

projection not well defined if p > n ❀ use “regularized” residuals from Lasso on X-variables Zj = Xj − X−jˆ γ(j)

Lasso

ˆ γ(j) = argminγXj − X−jγ + λjγ1

slide-15
SLIDE 15

using Y = Xβ0 + ε ❀ Z T

j Y = Z T j Xjβ0 j +

  • k=j

Z T

j Xk + Z T j ε

and hence Z T

j Y

Z T

j Xj

= β0

j +

  • k=j

Z T

j Xk

Z T

j Xj

β0

k

  • bias

+ Z T

j ε

Z T

j Xj

noise component ❀ ˆ bj = Z T

j Y

Z T

j Xj

  • k=j

Z T

j Xk

Z T

j Xj

ˆ βLasso;k

  • Lasso-estim. bias corr.
slide-16
SLIDE 16

ˆ bj is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later)

◮ target: low-dimensional component β0 j ◮ η := {β0 k; k = j} is a high-dimensional nuisance parameter

❀ exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)

slide-17
SLIDE 17

ˆ bj is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later)

◮ target: low-dimensional component β0 j ◮ η := {β0 k; k = j} is a high-dimensional nuisance parameter

❀ exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)

slide-18
SLIDE 18

= ⇒ let’s turn to the blackboard!

slide-19
SLIDE 19

A general principle: de-sparsifying via “inversion” of KKT KKT conditions: sub-differential of objective function Y − Xβ2

2/n + λβ1

−X T( Y

  • Xβ0+ε

−X ˆ β)/n + λˆ τ = 0 ˆ τ∞ ≤ 1, ˆ τj = sign(ˆ βj) if ˆ βj = 0. with ˆ Σ = X TX/n ❀ ˆ Σ(ˆ β − β0) + λˆ τ = X Tε/n “regularized inverse” of ˆ Σ, denoted by ˆ Θ (not e.g. GLasso) ˆ β − β0 + ˆ Θλˆ τ = ˆ ΘX Tε/n − ∆, ∆ = (ˆ Θˆ Σ − I)(ˆ β − β0) new estimator: ˆ b = ˆ β + ˆ Θλˆ τ = ˆ β + ˆ ΘX T(Y − X ˆ β)/n

slide-20
SLIDE 20

❀ ˆ b is exactly the same estimator as before (based on low-dimensional projection using residual vectors Zj) ... when taking ˆ Θ (“regularized inverse of ˆ Σ”) having rows using the (“nodewise”) Lasso-estimated coefficients from Xj versus X−j: ˆ γj = argminγ∈Rp−1Xj − X−jγ2

2/n + 2λjγ1

Denote by ˆ C =      1 −ˆ γ1,2 · · · −ˆ γ1,p −ˆ γ2,1 1 · · · −ˆ γ2,p . . . . . . ... . . . −ˆ γp,1 −ˆ γp,2 · · · 1      Let ˆ T 2 = diag(ˆ τ 2

1 , . . . , ˆ

τ 2

p ), ˆ

τ 2

j = Xj − X−jˆ

γj2

2/n + λjˆ

γj1, ˆ ΘLasso = ˆ T −2 ˆ C not symmetric... !

slide-21
SLIDE 21

“inverting” the KKT conditions is a very general principle ❀ the principle can be used for GLMs and many other models

slide-22
SLIDE 22

Asymptotic pivot and optimality Theorem (van de Geer, PB & Ritov, 2013) √ n(ˆ bj − β0

j ) ⇒ N(0, σ2 εΩjj) (j = 1, . . . , p)

Ωjj explicit expression ∼ (Σ−1)jj optimal! reaching semiparametric information bound ❀ asympt. optimal p-values and confidence intervals if we assume:

◮ sub-Gaussian design (i.i.d. rows of X sub-Gaussian) with

population Cov(X) = Σ has minimal eigenvalue ≥ M > 0√

◮ sparsity for regr. Y vs. X: s0 = o(√n/ log(p))“quite sparse” ◮ sparsity of design: Σ−1 sparse

i.e. sparse regressions Xj vs. X−j: sj = o(n/ log(p)) “maybe restrictive”

slide-23
SLIDE 23

It is optimal! Cramer-Rao

slide-24
SLIDE 24

for design with Σ−1 non-sparse:

◮ Ridge projection (PB, 2013): good type I error control but

not optimal in terms of power

◮ convex program instead of Lasso for Zj (Javanmard &

Montanari, 2013; MSc. thesis Dezeure, 2013)

Javanmard & Montanari prove optimality so far: no convincing empirical evidence that we can deal well with such scenarios (Σ−1 non-sparse)

slide-25
SLIDE 25

Uniform convergence: √ n(ˆ bj − β0

j ) ⇒ N(0, σ2 εΩjj) (j = 1, . . . , p)

convergence is uniform over B(s0) = {β; β0 ≤ s0} ❀ honest tests and confidence regions! and we can avoid post model selection inference (cf. P¨

  • tscher and Leeb)
slide-26
SLIDE 26

Simultaneous inference over all components:

√ n(ˆ b − β0) ≈ (W1, . . . , Wp) ∼ Np(0, σ2

εΩ)

❀ can construct P-values for:

H0,G with any G: test-statistics maxj∈G |ˆ bj| since covariance structure Ω is known and can easily do efficient multiple testing adjustment since covariance structure Ω is known!

slide-27
SLIDE 27

Alternatives?

◮ versions of bootstrapping (Chatterjee & Lahiri, 2013)

❀ super-efficiency phenomenon! i.e. non-uniform convergence Joe Hodges

  • good for estimating the zeroes (i.e., j ∈ Sc

0 with β0 j = 0)

  • bad for estim. the non-zeroes (i.e., j ∈ S0 with β0

j = 0) ◮ multiple sample splitting (Meinshausen, Meier & PB, 2009)

split the sample repeatedly in two halfs:

  • select variables on first half
  • p-values using second half, based on selected variables

❀ avoids (because of sample splitting) over-optmistic p-values, but potentially suffers in terms of power

◮ covariance test (Lockhart, Taylor, Tibshirani, Tibshirani, 2014) ◮ no sparsity ass. on Σ−1 (Javanmard and Montanari, 2014)

slide-28
SLIDE 28

Some empirical results (Dezeure, PB, Meier & Meinshausen, in progress) compare power and control of familywise error rate (FWER) always p = 500, n = 100 and s0 = 15

0.0 0.2 0.4 0.6 0.8 1.0 Covtest JM MS−Split Ridge Despars−Lasso FWER 0.0 0.2 0.4 0.6 0.8 1.0 Power

slide-29
SLIDE 29

confidence intervals

◮ for β0 j (j ∈ S0) ◮ for β0 j = 0 (j ∈ Sc 0) where the intervals exhibit the worst

coverage (for each method)

Jm2013

45 65 69 78 80 82 83 84 84 85 85 94 86 86 86 86 87 87 88

liuyu

91 92 43 98 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99

Res−Boot

92 80 53 96 97 98 98 98 99 99 99 99 99 99 99 99 99 99 99

MS−Split

69 100 99 100 100 100 100 100 100 100 100 94 100 100 100 100 100 100 100

Ridge

86 97 95 98 98 98 98 98 98 98 99 99 99 99 99 99 99 99 99

Lasso−Pro Z&Z

82 86 91 86 87 88 88 88 88 89 89 95 89 89 89 90 90 90 90

Lasso−Pro

78 82 80 83 84 86 87 87 87 87 87 95 88 88 88 88 89 89 89 Toeplitz s0=3 U[0,2]

slide-30
SLIDE 30

Motif regression example

  • ne significant variable with

both “de-sparsified Lasso” and multi sample splitting

  • 50

100 150 200 0.00 0.05 0.10 0.15 0.20

motif regression

variables coefficients

  • : variable/motif with FWER-adjusted p-value 0.006
  • : p-value clearly larger than 0.05

(this variable corresponds to known true motif)

slide-31
SLIDE 31

for data-sets with p ≈ 4′000 − 10′000 and n ≈ 100 ❀ often no significant variable because it is a too extreme ratio log(p)/n

slide-32
SLIDE 32

Behavioral economics and genomewide association with Ernst Fehr, University of Zurich

◮ n = 1525 probands (all students!) ◮ m = 79 response variables measuring various behavioral

characteristics (e.g. risk aversion) from well-designed experiments

◮ 460 Target SNPs (as a proxy for ≈ 106 SNPs):

1380 parameters per response (but only 1341 meaningful parameters) model: multivariate linear model Yn×m responses = Xn×p SNP data βp×m + εn×m error although p < n, the design matrix X (with categorical values ∈ {1, 2, 3}) does not have full rank

slide-33
SLIDE 33

Yn×m = Xn×pβp×m + εn×m interested in p-values for H0,jk : βjk = 0 versus HA,jk : βjk = 0, H0,G : βjk = 0 for all j, k ∈ G versus HA,G = Hc

0,G

adjusted to control the familywise error rate (i.e. conservative criterion) in total: we consider 110’857 hypotheses we test for non-marginal regression coefficients ❀ “predictive” GWAS

slide-34
SLIDE 34

there is structure!

◮ 79 response experiments ◮ 23 chromosomes per response experiment ◮ 20 Target SNPs per chromosome = 460 Target SNPs

. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 1 23 1 23 1 2 20 global 79

slide-35
SLIDE 35

do hierarchical FWER adjustment (Meinshausen, 2008)

. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 1 23 1 23 1 2 20 global 79 significant not significant

  • 1. test global hypothesis
  • 2. if significant: test all single response hypotheses
  • 3. for the significant responses: test all single chromosome hyp.
  • 4. for the significant chromosomes: test all TargetSNPs

❀ powerful multiple testing with data dependent adaptation of the resolution level (our analysis with 20 TagetSNPs per chromosome is ad-hoc)

  • cf. general sequential testing principle (Goeman & Solari, 2010)
slide-36
SLIDE 36

testing a group-hypothesis: H0,G : β0

j ≡ 0 for all j ∈ G

test-statistics: max

j∈G |ˆ

bj| since und H0,G: √ nˆ bG = NG(0, σ2

εΩG) + ∆G,

∆G = (∆j; j ∈ G), √ n∆∞ = oP(1) thus: max

j∈G

√ n|ˆ bj| ⇒ σε max

j∈G |Wj|,

(W1, . . . , Wp) ∼ Np(0, Ω) and can easily simulate maxj∈G |Wj|

slide-37
SLIDE 37

number of significant SNP parameters per response

  • ● ● ● ● ● ● ● ●
  • ● ● ●
  • ● ● ● ●
  • ● ● ● ● ●
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 40 60 80 200 400 600

Number of significant target SNPs per phenotype

Phenotype index Number of significant target SNPs 5 10 15 25 30 35 45 50 55 65 70 75 100 300 500 700

response 40 has most significant (levels of) Target SNPs

slide-38
SLIDE 38

Conclusions

can construct asymptotically optimal p-values and confidence intervals for low-dimensional targets in high-dimensional models R-package hdi

  • high-dimensional inference

(Meier, 2013) assuming/based on suitable conditions:

  • sparsity of Y vs X: s0 = o(√n/ log(p))
  • sparsity of Xj vs X−j (j = 1, . . . , p): maxj sj ≤ o(n/ log(p))
  • design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance) these conditions are typically uncheckable... ❀ confirmatory high-dimensional inference remains challenging

Thank you!

slide-39
SLIDE 39

Conclusions

can construct asymptotically optimal p-values and confidence intervals for low-dimensional targets in high-dimensional models R-package hdi

  • high-dimensional inference

(Meier, 2013) assuming/based on suitable conditions:

  • sparsity of Y vs X: s0 = o(√n/ log(p))
  • sparsity of Xj vs X−j (j = 1, . . . , p): maxj sj ≤ o(n/ log(p))
  • design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance) these conditions are typically uncheckable... ❀ confirmatory high-dimensional inference remains challenging

Thank you!

slide-40
SLIDE 40

Conclusions

can construct asymptotically optimal p-values and confidence intervals for low-dimensional targets in high-dimensional models R-package hdi

  • high-dimensional inference

(Meier, 2013) assuming/based on suitable conditions:

  • sparsity of Y vs X: s0 = o(√n/ log(p))
  • sparsity of Xj vs X−j (j = 1, . . . , p): maxj sj ≤ o(n/ log(p))
  • design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance) these conditions are typically uncheckable... ❀ confirmatory high-dimensional inference remains challenging

Thank you!

slide-41
SLIDE 41

R-package: hdi (Meier, 2013)

References:

◮ B¨

uhlmann, P . and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methodology, Theory and Applications. Springer.

◮ Meinshausen, N., Meier, L. and B¨

uhlmann, P . (2009). P-values for high-dimensional regression. Journal of the American Statistical Association 104, 1671-1681.

◮ B¨

uhlmann, P . (2013). Statistical significance in high-dimensional linear models. Bernoulli 19, 1212-1242.

◮ van de Geer, S., B¨

uhlmann, P . and Ritov, Y. (2013). On asymptotically optimal confidence regions and tests for high-dimensional models. Preprint arXiv:1303.0518v1

◮ Meier, L. (2013). hdi: High-dimensional inference. R-package

available from R-Forge.