Statistics for High-Dimensional Data: Selected Topics Peter B - - PowerPoint PPT Presentation

statistics for high dimensional data selected topics
SMART_READER_LITE
LIVE PREVIEW

Statistics for High-Dimensional Data: Selected Topics Peter B - - PowerPoint PPT Presentation

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur Statistik, ETH Z urich June 2014 High-dimensional linear model: basic concepts in methodology, theory and computation High-dimensional data


slide-1
SLIDE 1

Statistics for High-Dimensional Data: Selected Topics

Peter B¨ uhlmann

Seminar f¨ ur Statistik, ETH Z¨ urich

June 2014

slide-2
SLIDE 2

High-dimensional linear model: basic concepts in methodology, theory and computation

slide-3
SLIDE 3

High-dimensional data

Riboflavin production with Bacillus Subtilis

(in collaboration with DSM (Switzerland))

goal: improve riboflavin production rate of Bacillus Subtilis using clever genetic engineering response variables Y ∈ R: riboflavin (log-) production rate covariates X ∈ Rp: expressions from p = 4088 genes sample size n = 115, p ≫ n gene expression data Y versus 9 “reasonable” genes

7.6 8.0 8.4 8.8 −10 −9 −8 −7 −6 xsel y 7 8 9 10 11 12 −10 −9 −8 −7 −6 xsel y 6 7 8 9 10 −10 −9 −8 −7 −6 xsel y 7.5 8.0 8.5 9.0 −10 −9 −8 −7 −6 xsel y 7.0 7.5 8.0 8.5 9.0 −10 −9 −8 −7 −6 xsel y 8 9 10 11 −10 −9 −8 −7 −6 xsel y 8 9 10 11 12 −10 −9 −8 −7 −6 xsel y 7.5 8.0 8.5 9.0 −10 −9 −8 −7 −6 xsel y 8.5 9.0 9.5 10.0 11.0 −10 −9 −8 −7 −6 xsel y

slide-4
SLIDE 4

general framework: Z1, . . . , Zn (with some ”i.i.d. components”) dim(Zi) ≫ n for example: Zi = (Xi, Yi), Xi ∈ X ⊆ Rp, Yi ∈ Y ⊆∈ R: regression with p ≫ n Zi = (Xi, Yi), Xi ∈ X ⊆∈ Rp, Yi ∈ {0, 1}: classification for p ≫ n numerous applications: biology, imaging, economy, environmental sciences, ...

slide-5
SLIDE 5

High-dimensional linear models

Yi =

p

  • j=1

β0

j X (j) i

+ εi, i = 1, . . . , n p ≫ n in short: Y = Xβ + ε goals:

◮ prediction, e.g. w.r.t. squared prediction error ◮ estimation of β0, e.g. w.r.t. ˆ

β − β0q (q = 1, 2)

◮ variable selection

i.e. estimating the active set with the effective variables (having corresponding coefficient = 0)

slide-6
SLIDE 6

we need to regularize... and there are many proposals

◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ...

e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...

slide-7
SLIDE 7

we need to regularize... and there are many proposals

◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ...

e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...

slide-8
SLIDE 8

we need to regularize... and there are many proposals

◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ...

e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...

slide-9
SLIDE 9

Penalty-based methods if true β0 is sparse w.r.t.

◮ β00 0 = number of non-zero coefficients

❀ regularize with the · 0-penalty: argminβ(n−1Y − Xβ2 + λβ0

0), e.g. AIC, BIC

❀ computationally infeasible if p is large (2p sub-models)

◮ β01 = p j=1 |β0 j |

❀ penalize with the · 1-norm, i.e. Lasso: argminβ(n−1Y − Xβ2 + λβ1) ❀ convex optimization: computationally feasible and very fast for large p

slide-10
SLIDE 10

as we will see: regularization with the ℓ1-norm is good (“nearly optimal”) even if truth is sparse w.r.t. β00 ❀ can exploit computational advantage of .1-norm regularization even if the problem has a different sparsity structure

slide-11
SLIDE 11

The Lasso (Tibshirani, 1996)

Lasso for linear models ˆ β(λ) = argminβ(n−1Y − Xβ2 + λ

  • ≥0

β1

  • p

j=1 |βj|

) ❀ convex optimization problem

◮ Lasso does variable selection

some of the ˆ βj(λ) = 0 (because of “ℓ1-geometry”)

◮ ˆ

β(λ) is a shrunken LS-estimate

slide-12
SLIDE 12

more about “ℓ1-geometry” equivalence to primal problem ˆ βprimal(R) = argminβ;β1≤RY − Xβ2

2/n,

with a correspondence between λ and R which depends on the data (X1, Y1), . . . , (Xn, Yn) [such an equivalence holds since

◮ Y − Xβ2 2/n is convex in β ◮ convex constraint β1 ≤ R

see e.g. Bertsekas (1995)]

slide-13
SLIDE 13

p=2 left: ℓ1-“world” residual sum of squares reaches a minimal value (for certain constellations of the data) if its contour lines hit the ℓ1-ball in its corner ❀ ˆ β1 = 0

slide-14
SLIDE 14

ℓ2-“world” is different Ridge regression, ˆ βRidge(λ) = argminβ

  • Y − Xβ2

2/n + λβ2 2

  • ,

equivalent primal equivalent solution ˆ βRidge;primal(R) = argminβ;β2≤RY − Xβ2

2/n,

with a one-to-one correspondence between λ and R

slide-15
SLIDE 15

A note on the Bayesian approach model: β1, . . . , βp i.i.d. ∼ p(β)dβ, given β : Y ∼ Nn(Xβ, σ2In) with density f(y|σ2, β) posterior density: p(β|Y, σ2) = f(Y|β, σ2)p(β)

  • f(Y|β, σ2)p(β)dβ ∝ f(Y|β, σ2)p(β)

and hence for the MAP (Maximum A-Posteriori) estimator: ˆ βMAP = argmaxβp(β|Y, σ2) = argminβ − log

  • f(Y|β, σ2)p(β)
  • =

argminβ   1 2σ2 Y − Xβ2

2 − p

  • j=1

log(p(βj))  

slide-16
SLIDE 16

examples:

  • 1. Double-Exponential prior DExp(ξ):

p(β) = τ

2 exp(−τβ)

❀ ˆ βMAP equals the Lasso with penalty parameter λ = n−12σ2τ

  • 2. Gaussian prior N(0, τ 2):

p(β) =

1 √ 2πτ exp(−β2/(2τ 2))

❀ ˆ βMAP equals the Ridge estimator with penalty parameter λ = n−1σ2/τ 2 but we will argue that Lasso (i.e., the MAP estimator) is also good if the truth is sparse with respect to β00

0, e.g. if prior is

(much) more spiky around zero than Double-Exponential distribution

slide-17
SLIDE 17

Orthonormal design Y = Xβ + ε, n−1XTX = I Lasso = soft-thresholding estimator ˆ βj(λ) = sign(Zj)(|Zj| − λ/2)+, Zj

  • =OLS

= (n−1XTY)j, ˆ βj(λ) = gsoft(Zj), [Exercise]

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 threshold functions z Adaptive Lasso Hard−thresholding Soft−thresholding

slide-18
SLIDE 18

Prediction

goal: predict a new observation Ynew consider expected (w.r.t. new data; and random X) squared error loss: EXnew,Ynew[(Ynew − Xnew ˆ β)2] = σ2 + EXnew[(Xnew(β0 − ˆ β))2] = σ2 + (ˆ β − β0)T Σ

  • Cov(X)

(ˆ β − β0) ❀ terminology “prediction error”: for random design X: (ˆ β − β0)TΣ(ˆ β − β0) = EXnew[(Xnew(ˆ β − β0))2] for fixed design X: (ˆ β − β0)T ˆ Σ(ˆ β − β0) = X(ˆ β − β0)2

2/n

slide-19
SLIDE 19

binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P[Y = 1|X = x] = E[Y|X = x] ❀ ˆ p(x) via linear model; can then do classification cross-validated misclassification error (2/3 training; 1/3 test)

Lasso L2Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%

with variable selection best 200 genes (Wilcoxon test) no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!

slide-20
SLIDE 20

binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P[Y = 1|X = x] = E[Y|X = x] ❀ ˆ p(x) via linear model; can then do classification cross-validated misclassification error (2/3 training; 1/3 test)

Lasso L2Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%

with variable selection best 200 genes (Wilcoxon test) no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!

slide-21
SLIDE 21

and in fact: we will hear that

◮ Lasso is consistent for prediction assuming “essentially

nothing”

◮ Lasso is optimal for prediction assuming the “compatibility

condition” for X

slide-22
SLIDE 22

Estimation of regression coefficients

Y = Xβ0 + ε, p ≫ n with fixed (deterministic) design X problem of identifiability: for p > n: Xβ0 = Xθ for any θ = β0 + ξ, ξ in the null-space of X ❀ cannot say anything about ˆ β − β0 without further assumptions! ❀ we will work with the compatibility assumption (see later) and we will explain: under compatibility condition ˆ β − β01 ≤ C s0 φ2

  • log(p)/n,

s0 = |supp(β0)| = |{j; β0

j = 0}|

slide-23
SLIDE 23

= ⇒ let’s turn to the blackboard!

slide-24
SLIDE 24

various conditions and their relations (van de Geer & PB, 2009)

weak (S,2s)- RIP RIP adaptive (S, 2s)- restricted regression (S,2s)-restricted eigenvalue S-compatibility coherence adaptive (S, s)- restricted regression (S,s)-restricted eigenvalue (S,s)-uniform irrepresentable (S,2s)-irrepresentable weak (S, 2s)- irrepresentable |S \S| =0

*

|S \S| ≤ s

*

  • racle inequalities for prediction and estimation

8 6 6 4 3 3 2 6 5 7 9 6 S =S

*

6 6

slide-25
SLIDE 25

Variable selection

Example: Motif regression for finding HIF1α transcription factor binding sites in DNA seq.

M¨ uller, Meier, PB & Ricci (never published...)

Yi ∈ R: univariate response measuring binding intensity of HIF1α on coarse DNA segment i (from CHIP-chip experiments) Xi = (X (1)

i

, . . . , X (p)

i

) ∈ Rp: X (j)

i

= abundance score of candidate motif j in DNA segment i (using sequence data and computational biology algorithms, e.g. MDSCAN)

slide-26
SLIDE 26

question: relation between the binding intensity Y and the abundance of short candidate motifs? ❀ linear model is often reasonable “motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003) Y = Xβ + ǫ, n = 287, p = 195 goal: variable selection ❀ find the relevant motifs among the p = 195 candidates

slide-27
SLIDE 27

Lasso for variable selection ˆ S(λ) = {j; ˆ βj(λ) = 0} for S0 = {j; β0

j = 0}

no significance testing involved it’s convex optimization only! (and that can be a problem... see later)

slide-28
SLIDE 28

Motif regression for finding HIF1α transcription factor binding sites in DNA seq. Yi ∈ R: univariate response measuring binding intensity on coarse DNA segment i (from CHIP-chip experiments) X (j)

i

= abundance score of candidate motif j in DNA segment i variable selection in linear model Yi = β0 +

p

  • j=1

βjX (j)

i

+ εi, i = 1, . . . , n = 287, p = 195 ❀ Lasso selects 26 covariates and R2 ≈ 50% i.e. 26 interesting candidate motifs

slide-29
SLIDE 29

motif regression: estimated coefficients ˆ β(ˆ λCV)

50 100 150 200 0.00 0.05 0.10 0.15 0.20

  • riginal data

variables coefficients

slide-30
SLIDE 30

“Theory” for variable selection with Lasso for (fixed design) linear model Y = Xβ0 + ε with active set S0 = {j; β0

j = 0}

two key assumptions

  • 1. neighborhood stability condition for design X

⇔ irrepresentable condition for design X

  • 2. beta-min condition

min

j∈S0

|β0

j |

≥ C

  • s0 log(p)/n, C suitably large

both conditions are sufficient and “essentially” necessary for ˆ S(λ) = S0 with high probability, λ ≫

  • log(p)/n
  • larger than for pred.

already proved in Meinshausen & PB, 2004 (publ: 2006) and both assumptions are restrictive!

slide-31
SLIDE 31

“Theory” for variable selection with Lasso for (fixed design) linear model Y = Xβ0 + ε with active set S0 = {j; β0

j = 0}

two key assumptions

  • 1. neighborhood stability condition for design X

⇔ irrepresentable condition for design X

  • 2. beta-min condition

min

j∈S0

|β0

j |

≥ C

  • s0 log(p)/n, C suitably large

both conditions are sufficient and “essentially” necessary for ˆ S(λ) = S0 with high probability, λ ≫

  • log(p)/n
  • larger than for pred.

already proved in Meinshausen & PB, 2004 (publ: 2006) and both assumptions are restrictive!

slide-32
SLIDE 32

neighborhood stability condition ⇔ irrepresentable condition (Zhao & Yu, 2006)

n−1XTX = ˆ Σ active set S0 = {j; βj = 0} = {1, . . . , s0} consists of the first s0 variables; partition ˆ Σ = ˆ ΣS0,S0 ˆ ΣS0,Sc ˆ ΣSc

0,S0

ˆ ΣSc

0,Sc

  • irrep. condition :

ˆ ΣSc

0,S0 ˆ

Σ−1

S0,S0sign(β0 1, . . . , β0 s0)T∞ < 1

slide-33
SLIDE 33

not very realistic assumptions... what can we expect? recall: under compatibility condition ˆ β − β01 ≤ C s0 φ2

  • log(p)/n

consider the relevant active variables Srelev = {j; |β0

j | > C s0

φ2

  • log(p)/n}

then, clearly, ˆ S ⊇ Srelev with high probability screening for detecting the relevant variables is possible! without beta-min condition and assuming compatibility condition only

slide-34
SLIDE 34

in addition: assuming beta-min condition min

j∈S0

|β0

j | > C s0

φ2

  • log(p)/n

ˆ S ⊇ S0 with high probability screening for detecting the true variables

slide-35
SLIDE 35

Tibshirani (1996):

LASSO = Least Absolute Shrinkage and Selection Operator new translation: LASSO = Least Absolute Shrinkage and Screening Operator

slide-36
SLIDE 36

Practical perspective choice of λ: ˆ λCV from cross-validation empirical and theoretical indications (Meinshausen & PB, 2006) that ˆ S(ˆ λCV) ⊇ S0 (or Srelev) moreover |ˆ S(ˆ λCV)| ≤ min(n, p)(= n if p ≫ n) ❀ huge dimensionality reduction (in the original covariates)

slide-37
SLIDE 37

motif regression: estimated coefficients ˆ β(ˆ λCV)

50 100 150 200 0.00 0.05 0.10 0.15 0.20

  • riginal data

variables coefficients

which variables in ˆ S are false positives? (p-values would be very useful!)

slide-38
SLIDE 38

recall: ˆ S(ˆ λCV) ⊇ S0 (or Srelev) and we would then use a second-stage to reduce the number

  • f false positive selections

❀ re-estimation on much smaller model with variables from ˆ S

◮ OLS on ˆ

S with e.g. BIC variable selection

◮ thresholding coefficients and OLS re-estimation ◮ adaptive Lasso (Zou, 2006) ◮ ...

slide-39
SLIDE 39

recall: ˆ S(ˆ λCV) ⊇ S0 (or Srelev) and we would then use a second-stage to reduce the number

  • f false positive selections

❀ re-estimation on much smaller model with variables from ˆ S

◮ OLS on ˆ

S with e.g. BIC variable selection

◮ thresholding coefficients and OLS re-estimation ◮ adaptive Lasso (Zou, 2006) ◮ ...

slide-40
SLIDE 40

Adaptive Lasso (Zou, 2006)

re-weighting the penalty function ˆ β = argminβ(Y − Xβ2

2/n + λ p

  • j=1

|βj| |ˆ βinit,j| ), ˆ βinit,j from Lasso in first stage (or OLS if p < n)

  • Zou (2006)

for orthogonal design, if ˆ βinit = OLS:

Adaptive Lasso = NN-garrote

❀ less bias than Lasso

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 threshold functions z Adaptive Lasso Hard−thresholding Soft−thresholding

slide-41
SLIDE 41

motif regression

50 100 150 200 −0.05 0.00 0.05 0.10 0.15 0.20 0.25

Lasso

variables coefficients 50 100 150 200 −0.05 0.00 0.05 0.10 0.15 0.20 0.25

Adaptive Lasso

variables coefficients

Lasso selects 26 variables Adaptive Lasso selects 16 variables

slide-42
SLIDE 42

KKT conditions and Computation

characterization of solution(s) ˆ β as minimizer of the criterion function Qλ(β) = Y − Xβ2

2/n + λβ1

since Qλ(·) is a convex function: necessary and sufficient that subdifferential of ∂Qλ(β)/∂β at ˆ β contains the zero element Lemma (first part) denote by G(β) = −2XT(Y − Xβ)/n the gradient vector of Y − Xβ2

2/n

Then: ˆ β is a solution if and only if Gj(ˆ β) = −sign(ˆ βj)λ if ˆ βj = 0, |Gj(ˆ β)| ≤ λ if ˆ βj = 0

slide-43
SLIDE 43

Lemma (second part) If the solution of argminβQλ(β) is not unique (e.g. if p > n), and if Gj(ˆ β) < λ for some solution ˆ β, then ˆ βj = 0 for all (other) solutions ˆ β in argminβQλ(β). The zeroes are “essentially” unique (“essentially” refers to the situation: ˆ βj = 0 and Gj(ˆ β) = λ) Proof: Exercise (optional), or see in the book by B¨ uhlmann and van de Geer (2011; Lemma 2.1)

slide-44
SLIDE 44

Coordinate descent algorithm for computation general idea is to compute a solution ˆ β(λgrid,k) and use it as a starting value for the computation of ˆ β(λgrid,k−1

  • <λgrid,k

) β(0) ∈ Rp an initial parameter vector. Set m = 0. REPEAT: Increase m by one: m ← m + 1. For j = 1, . . . , p: if |Gj(β(m−1)

−j

)| ≤ λ : set β(m)

j

= 0,

  • therwise: β(m)

j

= argminβjQλ(β(m−1)

+j

), β−j: parameter vector setting jth component to zero β(m−1)

+j

: parameter vector which equals β(m−1) except for jth component equalling βj UNTIL numerical convergence

slide-45
SLIDE 45

for squared error loss: explicit up-dating formulae (Exercise) Gj(β) = −2XT

j (Y − Xβ)/n

β(m)

j

= sign(Zj)(|Zj| − λ/2)+ ˆ Σjj , Zj = XT

j (Y − Xβ−j)/n,

ˆ Σ = n−1XTX. ❀ componentwise soft-thresholding this is very fast if true problem is sparse active set strategy: can do non-systematic cycling, visiting mainly the active (non-zero) components riboflavin example, n=71, p=4088 0.33 secs. CPU using glmnet-package in R (Friedman, Hastie & Tibshirani, 2008)

slide-46
SLIDE 46

coordinate descent algorithm converges to a stationary point (Paul Tseng ≈ 2000) ❀ convergence to a global optimum, due to convexity of the problem main assumption:

  • bjective function = smooth function + penalty

separable

here: “separable” means “additive”, i.e., pen(β) = p

j=1 pj(βj)

slide-47
SLIDE 47

failure of coordinate descent algorithm: Fused Lasso ˆ β = argminβY − Xβ2

2/n + λ p

  • j=2

|βj − βj−1| + λ2β1 but p

j=2 |βj − βj−1| is non-separable

contour lines of penalties for p = 2

|beta1 − beta2|

beta1 beta2

0.5 0.5 1 1 1.5 1.5 2 2 2 . 5 2 . 5 3 3 3 . 5 3 . 5

−2 −1 1 2 −2 −1 1 2

|beta1| + |beta2|

beta1 beta2

0.5 1 1.5 2 2 . 5 2 . 5 2 . 5 2.5 3 3 3 3 3 . 5 3 . 5 3 . 5 3.5

−2 −1 1 2 −2 −1 1 2

slide-48
SLIDE 48

Thank you!

References: most of the material is covered in B¨ uhlmann, P . and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methodology, Theory and Applications. Springer.