Statistics for high-dimensional data: p-values and confidence intervals
Peter B¨ uhlmann
Seminar f¨ ur Statistik, ETH Z¨ urich
Statistics for high-dimensional data: p-values and confidence - - PowerPoint PPT Presentation
Statistics for high-dimensional data: p-values and confidence intervals Peter B uhlmann Seminar f ur Statistik, ETH Z urich June 2014 High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U. Zurich) n = 1
Seminar f¨ ur Statistik, ETH Z¨ urich
◮ n = 1′525 persons ◮ genetic information (SNPs): p ≈ 106 ◮ 79 response variables, measuring “behavior”
◮ point estimation ◮ rates of convergence
◮ predictor Xi = (X (1) i
i
◮ univariate response Yi ∈ R: binding intensity of HIF1α to
p
j X (j) i
j = 0}
50 100 150 200 0.00 0.05 0.10 0.15 0.20
variables coefficients
j = 0
j = 0 or H0,G : β0 j = 0 for all j ∈ G ⊆ {1, . . . , p}
OLS)
Lasso
j Y = Z T j Xjβ0 j +
j Xk + Z T j ε
j Y
j Xj
j +
j Xk
j Xj
k
j ε
j Xj
j Y
j Xj
j Xk
j Xj
◮ target: low-dimensional component β0 j ◮ η := {β0 k; k = j} is a high-dimensional nuisance parameter
◮ target: low-dimensional component β0 j ◮ η := {β0 k; k = j} is a high-dimensional nuisance parameter
2/n + λβ1
2/n + 2λjγ1
1 , . . . , ˆ
p ), ˆ
j = Xj − X−jˆ
2/n + λjˆ
j ) ⇒ N(0, σ2 εΩjj) (j = 1, . . . , p)
◮ sub-Gaussian design (i.i.d. rows of X sub-Gaussian) with
◮ sparsity for regr. Y vs. X: s0 = o(√n/ log(p))“quite sparse” ◮ sparsity of design: Σ−1 sparse
◮ Ridge projection (PB, 2013): good type I error control but
◮ convex program instead of Lasso for Zj (Javanmard &
j ) ⇒ N(0, σ2 εΩjj) (j = 1, . . . , p)
εΩ)
◮ versions of bootstrapping (Chatterjee & Lahiri, 2013)
0 with β0 j = 0)
j = 0) ◮ multiple sample splitting (Meinshausen, Meier & PB, 2009)
◮ covariance test (Lockhart, Taylor, Tibshirani, Tibshirani, 2014) ◮ no sparsity ass. on Σ−1 (Javanmard and Montanari, 2014)
0.0 0.2 0.4 0.6 0.8 1.0 Covtest JM MS−Split Ridge Despars−Lasso FWER 0.0 0.2 0.4 0.6 0.8 1.0 Power
◮ for β0 j (j ∈ S0) ◮ for β0 j = 0 (j ∈ Sc 0) where the intervals exhibit the worst
Jm2013
45 65 69 78 80 82 83 84 84 85 85 94 86 86 86 86 87 87 88
liuyu
91 92 43 98 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99
Res−Boot
92 80 53 96 97 98 98 98 99 99 99 99 99 99 99 99 99 99 99
MS−Split
69 100 99 100 100 100 100 100 100 100 100 94 100 100 100 100 100 100 100
Ridge
86 97 95 98 98 98 98 98 98 98 99 99 99 99 99 99 99 99 99
Lasso−Pro Z&Z
82 86 91 86 87 88 88 88 88 89 89 95 89 89 89 90 90 90 90
Lasso−Pro
78 82 80 83 84 86 87 87 87 87 87 95 88 88 88 88 89 89 89 Toeplitz s0=3 U[0,2]
100 150 200 0.00 0.05 0.10 0.15 0.20
motif regression
variables coefficients
◮ n = 1525 probands (all students!) ◮ m = 79 response variables measuring various behavioral
◮ 460 Target SNPs (as a proxy for ≈ 106 SNPs):
0,G
◮ 79 response experiments ◮ 23 chromosomes per response experiment ◮ 20 Target SNPs per chromosome = 460 Target SNPs
1 2 1 23 1 23 1 2 20 global 79
1 2 1 23 1 23 1 2 20 global 79 significant not significant
j ≡ 0 for all j ∈ G
j∈G |ˆ
εΩG) + ∆G,
j∈G
j∈G |Wj|,
20 40 60 80 200 400 600
Number of significant target SNPs per phenotype
Phenotype index Number of significant target SNPs 5 10 15 25 30 35 45 50 55 65 70 75 100 300 500 700
◮ B¨
◮ Meinshausen, N., Meier, L. and B¨
◮ B¨
◮ van de Geer, S., B¨
◮ Meier, L. (2013). hdi: High-dimensional inference. R-package