Statistics for High-Dimensional Data: Selected Topics Peter B - - PowerPoint PPT Presentation
Statistics for High-Dimensional Data: Selected Topics Peter B - - PowerPoint PPT Presentation
Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur Statistik, ETH Z urich June 2014 High-dimensional linear model: basic concepts in methodology, theory and computation High-dimensional data
High-dimensional linear model: basic concepts in methodology, theory and computation
High-dimensional data
Riboflavin production with Bacillus Subtilis
(in collaboration with DSM (Switzerland))
goal: improve riboflavin production rate of Bacillus Subtilis using clever genetic engineering response variables Y ∈ R: riboflavin (log-) production rate covariates X ∈ Rp: expressions from p = 4088 genes sample size n = 115, p ≫ n gene expression data Y versus 9 “reasonable” genes
7.6 8.0 8.4 8.8 −10 −9 −8 −7 −6 xsel y 7 8 9 10 11 12 −10 −9 −8 −7 −6 xsel y 6 7 8 9 10 −10 −9 −8 −7 −6 xsel y 7.5 8.0 8.5 9.0 −10 −9 −8 −7 −6 xsel y 7.0 7.5 8.0 8.5 9.0 −10 −9 −8 −7 −6 xsel y 8 9 10 11 −10 −9 −8 −7 −6 xsel y 8 9 10 11 12 −10 −9 −8 −7 −6 xsel y 7.5 8.0 8.5 9.0 −10 −9 −8 −7 −6 xsel y 8.5 9.0 9.5 10.0 11.0 −10 −9 −8 −7 −6 xsel y
general framework: Z1, . . . , Zn (with some ”i.i.d. components”) dim(Zi) ≫ n for example: Zi = (Xi, Yi), Xi ∈ X ⊆ Rp, Yi ∈ Y ⊆∈ R: regression with p ≫ n Zi = (Xi, Yi), Xi ∈ X ⊆∈ Rp, Yi ∈ {0, 1}: classification for p ≫ n numerous applications: biology, imaging, economy, environmental sciences, ...
High-dimensional linear models
Yi =
p
- j=1
β0
j X (j) i
+ εi, i = 1, . . . , n p ≫ n in short: Y = Xβ + ε goals:
◮ prediction, e.g. w.r.t. squared prediction error ◮ estimation of β0, e.g. w.r.t. ˆ
β − β0q (q = 1, 2)
◮ variable selection
i.e. estimating the active set with the effective variables (having corresponding coefficient = 0)
we need to regularize... and there are many proposals
◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ...
e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...
we need to regularize... and there are many proposals
◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ...
e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...
we need to regularize... and there are many proposals
◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ...
e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...
Penalty-based methods if true β0 is sparse w.r.t.
◮ β00 0 = number of non-zero coefficients
❀ regularize with the · 0-penalty: argminβ(n−1Y − Xβ2 + λβ0
0), e.g. AIC, BIC
❀ computationally infeasible if p is large (2p sub-models)
◮ β01 = p j=1 |β0 j |
❀ penalize with the · 1-norm, i.e. Lasso: argminβ(n−1Y − Xβ2 + λβ1) ❀ convex optimization: computationally feasible and very fast for large p
as we will see: regularization with the ℓ1-norm is good (“nearly optimal”) even if truth is sparse w.r.t. β00 ❀ can exploit computational advantage of .1-norm regularization even if the problem has a different sparsity structure
The Lasso (Tibshirani, 1996)
Lasso for linear models ˆ β(λ) = argminβ(n−1Y − Xβ2 + λ
- ≥0
β1
- p
j=1 |βj|
) ❀ convex optimization problem
◮ Lasso does variable selection
some of the ˆ βj(λ) = 0 (because of “ℓ1-geometry”)
◮ ˆ
β(λ) is a shrunken LS-estimate
more about “ℓ1-geometry” equivalence to primal problem ˆ βprimal(R) = argminβ;β1≤RY − Xβ2
2/n,
with a correspondence between λ and R which depends on the data (X1, Y1), . . . , (Xn, Yn) [such an equivalence holds since
◮ Y − Xβ2 2/n is convex in β ◮ convex constraint β1 ≤ R
see e.g. Bertsekas (1995)]
p=2 left: ℓ1-“world” residual sum of squares reaches a minimal value (for certain constellations of the data) if its contour lines hit the ℓ1-ball in its corner ❀ ˆ β1 = 0
ℓ2-“world” is different Ridge regression, ˆ βRidge(λ) = argminβ
- Y − Xβ2
2/n + λβ2 2
- ,
equivalent primal equivalent solution ˆ βRidge;primal(R) = argminβ;β2≤RY − Xβ2
2/n,
with a one-to-one correspondence between λ and R
A note on the Bayesian approach model: β1, . . . , βp i.i.d. ∼ p(β)dβ, given β : Y ∼ Nn(Xβ, σ2In) with density f(y|σ2, β) posterior density: p(β|Y, σ2) = f(Y|β, σ2)p(β)
- f(Y|β, σ2)p(β)dβ ∝ f(Y|β, σ2)p(β)
and hence for the MAP (Maximum A-Posteriori) estimator: ˆ βMAP = argmaxβp(β|Y, σ2) = argminβ − log
- f(Y|β, σ2)p(β)
- =
argminβ 1 2σ2 Y − Xβ2
2 − p
- j=1
log(p(βj))
examples:
- 1. Double-Exponential prior DExp(ξ):
p(β) = τ
2 exp(−τβ)
❀ ˆ βMAP equals the Lasso with penalty parameter λ = n−12σ2τ
- 2. Gaussian prior N(0, τ 2):
p(β) =
1 √ 2πτ exp(−β2/(2τ 2))
❀ ˆ βMAP equals the Ridge estimator with penalty parameter λ = n−1σ2/τ 2 but we will argue that Lasso (i.e., the MAP estimator) is also good if the truth is sparse with respect to β00
0, e.g. if prior is
(much) more spiky around zero than Double-Exponential distribution
Orthonormal design Y = Xβ + ε, n−1XTX = I Lasso = soft-thresholding estimator ˆ βj(λ) = sign(Zj)(|Zj| − λ/2)+, Zj
- =OLS
= (n−1XTY)j, ˆ βj(λ) = gsoft(Zj), [Exercise]
−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 threshold functions z Adaptive Lasso Hard−thresholding Soft−thresholding
Prediction
goal: predict a new observation Ynew consider expected (w.r.t. new data; and random X) squared error loss: EXnew,Ynew[(Ynew − Xnew ˆ β)2] = σ2 + EXnew[(Xnew(β0 − ˆ β))2] = σ2 + (ˆ β − β0)T Σ
- Cov(X)
(ˆ β − β0) ❀ terminology “prediction error”: for random design X: (ˆ β − β0)TΣ(ˆ β − β0) = EXnew[(Xnew(ˆ β − β0))2] for fixed design X: (ˆ β − β0)T ˆ Σ(ˆ β − β0) = X(ˆ β − β0)2
2/n
binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P[Y = 1|X = x] = E[Y|X = x] ❀ ˆ p(x) via linear model; can then do classification cross-validated misclassification error (2/3 training; 1/3 test)
Lasso L2Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%
with variable selection best 200 genes (Wilcoxon test) no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!
binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P[Y = 1|X = x] = E[Y|X = x] ❀ ˆ p(x) via linear model; can then do classification cross-validated misclassification error (2/3 training; 1/3 test)
Lasso L2Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%
with variable selection best 200 genes (Wilcoxon test) no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!
and in fact: we will hear that
◮ Lasso is consistent for prediction assuming “essentially
nothing”
◮ Lasso is optimal for prediction assuming the “compatibility
condition” for X
Estimation of regression coefficients
Y = Xβ0 + ε, p ≫ n with fixed (deterministic) design X problem of identifiability: for p > n: Xβ0 = Xθ for any θ = β0 + ξ, ξ in the null-space of X ❀ cannot say anything about ˆ β − β0 without further assumptions! ❀ we will work with the compatibility assumption (see later) and we will explain: under compatibility condition ˆ β − β01 ≤ C s0 φ2
- log(p)/n,
s0 = |supp(β0)| = |{j; β0
j = 0}|
= ⇒ let’s turn to the blackboard!
various conditions and their relations (van de Geer & PB, 2009)
weak (S,2s)- RIP RIP adaptive (S, 2s)- restricted regression (S,2s)-restricted eigenvalue S-compatibility coherence adaptive (S, s)- restricted regression (S,s)-restricted eigenvalue (S,s)-uniform irrepresentable (S,2s)-irrepresentable weak (S, 2s)- irrepresentable |S \S| =0
*
|S \S| ≤ s
*
- racle inequalities for prediction and estimation
8 6 6 4 3 3 2 6 5 7 9 6 S =S
*
6 6
Variable selection
Example: Motif regression for finding HIF1α transcription factor binding sites in DNA seq.
M¨ uller, Meier, PB & Ricci (never published...)
Yi ∈ R: univariate response measuring binding intensity of HIF1α on coarse DNA segment i (from CHIP-chip experiments) Xi = (X (1)
i
, . . . , X (p)
i
) ∈ Rp: X (j)
i
= abundance score of candidate motif j in DNA segment i (using sequence data and computational biology algorithms, e.g. MDSCAN)
question: relation between the binding intensity Y and the abundance of short candidate motifs? ❀ linear model is often reasonable “motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003) Y = Xβ + ǫ, n = 287, p = 195 goal: variable selection ❀ find the relevant motifs among the p = 195 candidates
Lasso for variable selection ˆ S(λ) = {j; ˆ βj(λ) = 0} for S0 = {j; β0
j = 0}
no significance testing involved it’s convex optimization only! (and that can be a problem... see later)
Motif regression for finding HIF1α transcription factor binding sites in DNA seq. Yi ∈ R: univariate response measuring binding intensity on coarse DNA segment i (from CHIP-chip experiments) X (j)
i
= abundance score of candidate motif j in DNA segment i variable selection in linear model Yi = β0 +
p
- j=1
βjX (j)
i
+ εi, i = 1, . . . , n = 287, p = 195 ❀ Lasso selects 26 covariates and R2 ≈ 50% i.e. 26 interesting candidate motifs
motif regression: estimated coefficients ˆ β(ˆ λCV)
50 100 150 200 0.00 0.05 0.10 0.15 0.20
- riginal data
variables coefficients
“Theory” for variable selection with Lasso for (fixed design) linear model Y = Xβ0 + ε with active set S0 = {j; β0
j = 0}
two key assumptions
- 1. neighborhood stability condition for design X
⇔ irrepresentable condition for design X
- 2. beta-min condition
min
j∈S0
|β0
j |
≥ C
- s0 log(p)/n, C suitably large
both conditions are sufficient and “essentially” necessary for ˆ S(λ) = S0 with high probability, λ ≫
- log(p)/n
- larger than for pred.
already proved in Meinshausen & PB, 2004 (publ: 2006) and both assumptions are restrictive!
“Theory” for variable selection with Lasso for (fixed design) linear model Y = Xβ0 + ε with active set S0 = {j; β0
j = 0}
two key assumptions
- 1. neighborhood stability condition for design X
⇔ irrepresentable condition for design X
- 2. beta-min condition
min
j∈S0
|β0
j |
≥ C
- s0 log(p)/n, C suitably large
both conditions are sufficient and “essentially” necessary for ˆ S(λ) = S0 with high probability, λ ≫
- log(p)/n
- larger than for pred.
already proved in Meinshausen & PB, 2004 (publ: 2006) and both assumptions are restrictive!
neighborhood stability condition ⇔ irrepresentable condition (Zhao & Yu, 2006)
n−1XTX = ˆ Σ active set S0 = {j; βj = 0} = {1, . . . , s0} consists of the first s0 variables; partition ˆ Σ = ˆ ΣS0,S0 ˆ ΣS0,Sc ˆ ΣSc
0,S0
ˆ ΣSc
0,Sc
- irrep. condition :
ˆ ΣSc
0,S0 ˆ
Σ−1
S0,S0sign(β0 1, . . . , β0 s0)T∞ < 1
not very realistic assumptions... what can we expect? recall: under compatibility condition ˆ β − β01 ≤ C s0 φ2
- log(p)/n
consider the relevant active variables Srelev = {j; |β0
j | > C s0
φ2
- log(p)/n}
then, clearly, ˆ S ⊇ Srelev with high probability screening for detecting the relevant variables is possible! without beta-min condition and assuming compatibility condition only
in addition: assuming beta-min condition min
j∈S0
|β0
j | > C s0
φ2
- log(p)/n
ˆ S ⊇ S0 with high probability screening for detecting the true variables
Tibshirani (1996):
LASSO = Least Absolute Shrinkage and Selection Operator new translation: LASSO = Least Absolute Shrinkage and Screening Operator
Practical perspective choice of λ: ˆ λCV from cross-validation empirical and theoretical indications (Meinshausen & PB, 2006) that ˆ S(ˆ λCV) ⊇ S0 (or Srelev) moreover |ˆ S(ˆ λCV)| ≤ min(n, p)(= n if p ≫ n) ❀ huge dimensionality reduction (in the original covariates)
motif regression: estimated coefficients ˆ β(ˆ λCV)
50 100 150 200 0.00 0.05 0.10 0.15 0.20
- riginal data
variables coefficients
which variables in ˆ S are false positives? (p-values would be very useful!)
recall: ˆ S(ˆ λCV) ⊇ S0 (or Srelev) and we would then use a second-stage to reduce the number
- f false positive selections
❀ re-estimation on much smaller model with variables from ˆ S
◮ OLS on ˆ
S with e.g. BIC variable selection
◮ thresholding coefficients and OLS re-estimation ◮ adaptive Lasso (Zou, 2006) ◮ ...
recall: ˆ S(ˆ λCV) ⊇ S0 (or Srelev) and we would then use a second-stage to reduce the number
- f false positive selections
❀ re-estimation on much smaller model with variables from ˆ S
◮ OLS on ˆ
S with e.g. BIC variable selection
◮ thresholding coefficients and OLS re-estimation ◮ adaptive Lasso (Zou, 2006) ◮ ...
Adaptive Lasso (Zou, 2006)
re-weighting the penalty function ˆ β = argminβ(Y − Xβ2
2/n + λ p
- j=1
|βj| |ˆ βinit,j| ), ˆ βinit,j from Lasso in first stage (or OLS if p < n)
- Zou (2006)
for orthogonal design, if ˆ βinit = OLS:
Adaptive Lasso = NN-garrote
❀ less bias than Lasso
−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 threshold functions z Adaptive Lasso Hard−thresholding Soft−thresholding
motif regression
50 100 150 200 −0.05 0.00 0.05 0.10 0.15 0.20 0.25
Lasso
variables coefficients 50 100 150 200 −0.05 0.00 0.05 0.10 0.15 0.20 0.25
Adaptive Lasso
variables coefficients
Lasso selects 26 variables Adaptive Lasso selects 16 variables
KKT conditions and Computation
characterization of solution(s) ˆ β as minimizer of the criterion function Qλ(β) = Y − Xβ2
2/n + λβ1
since Qλ(·) is a convex function: necessary and sufficient that subdifferential of ∂Qλ(β)/∂β at ˆ β contains the zero element Lemma (first part) denote by G(β) = −2XT(Y − Xβ)/n the gradient vector of Y − Xβ2
2/n
Then: ˆ β is a solution if and only if Gj(ˆ β) = −sign(ˆ βj)λ if ˆ βj = 0, |Gj(ˆ β)| ≤ λ if ˆ βj = 0
Lemma (second part) If the solution of argminβQλ(β) is not unique (e.g. if p > n), and if Gj(ˆ β) < λ for some solution ˆ β, then ˆ βj = 0 for all (other) solutions ˆ β in argminβQλ(β). The zeroes are “essentially” unique (“essentially” refers to the situation: ˆ βj = 0 and Gj(ˆ β) = λ) Proof: Exercise (optional), or see in the book by B¨ uhlmann and van de Geer (2011; Lemma 2.1)
Coordinate descent algorithm for computation general idea is to compute a solution ˆ β(λgrid,k) and use it as a starting value for the computation of ˆ β(λgrid,k−1
- <λgrid,k
) β(0) ∈ Rp an initial parameter vector. Set m = 0. REPEAT: Increase m by one: m ← m + 1. For j = 1, . . . , p: if |Gj(β(m−1)
−j
)| ≤ λ : set β(m)
j
= 0,
- therwise: β(m)
j
= argminβjQλ(β(m−1)
+j
), β−j: parameter vector setting jth component to zero β(m−1)
+j
: parameter vector which equals β(m−1) except for jth component equalling βj UNTIL numerical convergence
for squared error loss: explicit up-dating formulae (Exercise) Gj(β) = −2XT
j (Y − Xβ)/n
β(m)
j
= sign(Zj)(|Zj| − λ/2)+ ˆ Σjj , Zj = XT
j (Y − Xβ−j)/n,
ˆ Σ = n−1XTX. ❀ componentwise soft-thresholding this is very fast if true problem is sparse active set strategy: can do non-systematic cycling, visiting mainly the active (non-zero) components riboflavin example, n=71, p=4088 0.33 secs. CPU using glmnet-package in R (Friedman, Hastie & Tibshirani, 2008)
coordinate descent algorithm converges to a stationary point (Paul Tseng ≈ 2000) ❀ convergence to a global optimum, due to convexity of the problem main assumption:
- bjective function = smooth function + penalty
separable
here: “separable” means “additive”, i.e., pen(β) = p
j=1 pj(βj)
failure of coordinate descent algorithm: Fused Lasso ˆ β = argminβY − Xβ2
2/n + λ p
- j=2
|βj − βj−1| + λ2β1 but p
j=2 |βj − βj−1| is non-separable
contour lines of penalties for p = 2
|beta1 − beta2|
beta1 beta2
0.5 0.5 1 1 1.5 1.5 2 2 2 . 5 2 . 5 3 3 3 . 5 3 . 5
−2 −1 1 2 −2 −1 1 2
|beta1| + |beta2|
beta1 beta2
0.5 1 1.5 2 2 . 5 2 . 5 2 . 5 2.5 3 3 3 3 3 . 5 3 . 5 3 . 5 3.5