SLIDE 1 High-dimensional statistics: Some progress and challenges ahead
Martin Wainwright
UC Berkeley Departments of Statistics, and EECS
University College, London Master Class: Lecture 2 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.
SLIDE 2 High-level overview
Last lecture: least-squares loss and ℓ1-regularization. The big picture: Lots of other estimators with same basic form:
∈ arg min
θ∈Ω
1 )
+ λn R(θ) Regularizer
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29
SLIDE 3 High-level overview
Last lecture: least-squares loss and ℓ1-regularization. The big picture: Lots of other estimators with same basic form:
∈ arg min
θ∈Ω
1 )
+ λn R(θ) Regularizer
Past years have witnessed an explosion of results (compressed sensing, covariance estimation, block-sparsity, graphical models, matrix completion...) Question: Is there a common set of underlying principles?
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29
SLIDE 4 Last lecture: Sparse linear regression
=
+
n S w y X θ∗ Sc n × p Set-up: noisy observations y = Xθ∗ + w with sparse θ∗ Estimator: Lasso program
θ
1 n
n
(yi − xT
i θ)2 + λn p
|θj|
SLIDE 5
Block-structured extension
=
+
r r r n n p S W Y X Θ∗ Sc n × p Signal Θ∗ is a p × r matrix: partitioned into non-zero rows S and zero rows Sc Various applications: multiple-view imaging, gene array prediction, graphical model fitting.
SLIDE 6 Block-structured extension
=
+
r r r n n p S W Y X Θ∗ Sc n × p Row-wise ℓ1/ℓ2-norm | | |Θ| | |1,2 =
p
Θj2
SLIDE 7 Block-structured extension
=
+
r r r n n p S W Y X Θ∗ Sc n × p Row-wise ℓ1/ℓ2-norm | | |Θ| | |1,2 =
p
Θj2 More complicated group structure:
(Obozinski et al., 2009)
| | |Θ∗| | |G =
Θg2
SLIDE 8 Example: Low-rank matrix approximation
Θ∗ U D V T r × r p1 × p2 p1 × r r × p2 Set-up: Matrix Θ∗ ∈ Rp1×p2 with rank r ≪ min{p1, p2}. Estimator:
Θ
1 n
n
(yi − Xi, Θ )2 + λn
min{p1,p2}
σj(Θ)
- Some past work: Fazel, 2001; Srebro et al., 2004; Recht, Fazel & Parillo, 2007; Bach, 2008;
Candes & Recht, 2008; Keshavan et al., 2009; Rohde & Tsybakov, 2010; Recht, 2009; Negahban & W., 2010 ...
SLIDE 9
Application: Collaborative filtering
. . . . . . 4 ∗ 3 . . . . . . ∗ 3 5 ∗ . . . . . . 2 5 4 3 . . . . . . 3 2 ∗ ∗ . . . . . . 1 Universe of p1 individuals and p2 films Observe n ≪ p2p2 ratings (e.g., Srebro, Alon & Jaakkola, 2004)
SLIDE 10
Security and robustness issues
Spiritual guide Break-down of Amazon recommendation system, 2002.
SLIDE 11
Security and robustness issues
Spiritual guide Sex manual Break-down of Amazon recommendation system, 2002.
SLIDE 12 Matrix decomposition: Low-rank plus sparse
Matrix Y can be (approximately) decomposed into sum: Y U V T D p1 × p2
p1 × r r × r r × p2
≈ + Y = Θ∗
+ Γ∗
SLIDE 13 Matrix decomposition: Low-rank plus sparse
Matrix Y can be (approximately) decomposed into sum: Y U V T D p1 × p2
p1 × r r × r r × p2
≈ + Y = Θ∗
+ Γ∗
exact decomposition: initially studied by Chandrasekaran, Sanghavi, Parillo & Willsky, 2009 subsequent work: Candes et al., 2010; Xu et al., 2010 Hsu et al., 2010;
Agarwal et al., 2011
Various applications:
◮ robust collaborative filtering ◮ robust PCA ◮ graphical model selection with hidden variables
SLIDE 14
Gauss-Markov models with hidden variables
X1 X2 X3 X4 Z Problems with hidden variables: conditioned on hidden Z, vector X = (X1, X2, X3, X4) is Gauss-Markov.
SLIDE 15
Gauss-Markov models with hidden variables
X1 X2 X3 X4 Z Problems with hidden variables: conditioned on hidden Z, vector X = (X1, X2, X3, X4) is Gauss-Markov. Inverse covariance of X satisfies {sparse, low-rank} decomposition: 1 − µ µ µ µ µ 1 − µ µ µ µ µ 1 − µ µ µ µ µ 1 − µ = I4×4 − µ11T . (Chandrasekaran, Parrilo & Willsky, 2010)
SLIDE 16 Example: Sparse principal components analysis
=
Σ ZZT D Set-up: Covariance matrix Σ = ZZT + D, where leading eigenspace Z has sparse columns. Estimator:
Θ
Θ, Σ + λn
|Θjk|
- Some past work: Johnstone, 2001; Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al.,
2004; d’Aspr´ emont et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008
SLIDE 17 Motivation and roadmap
many results on different high-dimensional models all based on estimators of the type:
∈ arg min
θ∈Ω
1 )
+ λn R(θ) Regularizer
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
SLIDE 18 Motivation and roadmap
many results on different high-dimensional models all based on estimators of the type:
∈ arg min
θ∈Ω
1 )
+ λn R(θ) Regularizer
Question: Is there a common set of underlying principles?
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
SLIDE 19 Motivation and roadmap
many results on different high-dimensional models all based on estimators of the type:
∈ arg min
θ∈Ω
1 )
+ λn R(θ) Regularizer
Question: Is there a common set of underlying principles? Answer: Yes, two essential ingredients. (I) Restricted strong convexity of loss function (II) Decomposability of the regularizer
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
SLIDE 20 (I) Role of curvature
1 Curvature controls difficulty of estimation:
δL
θ ∆ δL
θ ∆ High curvature: easy to estimate (b) Low curvature: harder
SLIDE 21 (I) Role of curvature
1 Curvature controls difficulty of estimation:
δL
θ ∆ δL
θ ∆ High curvature: easy to estimate (b) Low curvature: harder
2 captured by lower bound on Taylor series error TL(∆; θ∗)
L(θ∗ + ∆) − L(θ∗) − ∇L(θ∗), ∆
≥ γ2∆2 for all ∆ around θ∗.
SLIDE 22 High dimensions: no strong convexity!
−1 −0.5 0.5 1 −1 −0.5 0.5 1 0.2 0.4 0.6 0.8 1
When p > n, the Hessian ∇2L(θ; Zn
1 ) has nullspace of dimension p − n.
SLIDE 23 Restricted strong convexity
Definition Loss function Ln satisfies restricted strong convexity (RSC) with respect to regularizer R if Ln(θ∗ + ∆) −
- Ln(θ∗) + ∇Ln(θ∗), ∆
- Taylor error TL(∆; θ∗)
≥ γ2
ℓ ∆2 e
Lower curvature − τ 2
ℓ R2(∆)
for all ∆ in a suitable neighborhood of θ∗.
SLIDE 24 Restricted strong convexity
Definition Loss function Ln satisfies restricted strong convexity (RSC) with respect to regularizer R if Ln(θ∗ + ∆) −
- Ln(θ∗) + ∇Ln(θ∗), ∆
- Taylor error TL(∆; θ∗)
≥ γ2
ℓ ∆2 e
Lower curvature − τ 2
ℓ R2(∆)
for all ∆ in a suitable neighborhood of θ∗.
- rdinary strong convexity:
◮ special case with tolerance τℓ = 0 ◮ does not hold for most loss functions when p > n
SLIDE 25 Restricted strong convexity
Definition Loss function Ln satisfies restricted strong convexity (RSC) with respect to regularizer R if Ln(θ∗ + ∆) −
- Ln(θ∗) + ∇Ln(θ∗), ∆
- Taylor error TL(∆; θ∗)
≥ γ2
ℓ ∆2 e
Lower curvature − τ 2
ℓ R2(∆)
for all ∆ in a suitable neighborhood of θ∗.
- rdinary strong convexity:
◮ special case with tolerance τℓ = 0 ◮ does not hold for most loss functions when p > n
RSC enforces a lower bound on curvature, but only when R2(∆) ≪ ∆2
e
SLIDE 26 Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) =
1 2ny − Xθ2 2:
TL(∆; θ∗) = Ln(θ∗ + ∆) −
1 2nX∆2
2.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29
SLIDE 27 Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) =
1 2ny − Xθ2 2:
TL(∆; θ∗) = Ln(θ∗ + ∆) −
1 2nX∆2
2.
Restricted eigenvalue (RE) condition
(van de Geer, 2007; Bickel et al., 2009):
X∆2
2
2n ≥ γ2 ∆2
2
for all ∆Sc1 ≤ ∆S1.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29
SLIDE 28 Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) =
1 2ny − Xθ2 2:
TL(∆; θ∗) = Ln(θ∗ + ∆) −
1 2nX∆2
2.
Restricted eigenvalue (RE) condition
(van de Geer, 2007; Bickel et al., 2009):
X∆2
2
2n ≥ γ2 ∆2
2
for all ∆ ∈ Rp with ∆1 ≤ 2√s∆2.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29
SLIDE 29 Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) =
1 2ny − Xθ2 2:
TL(∆; θ∗) = Ln(θ∗ + ∆) −
1 2nX∆2
2.
Restricted eigenvalue (RE) condition
(van de Geer, 2007; Bickel et al., 2009):
X∆2
2
2n ≥ γ2 ∆2
2
for all ∆ ∈ Rp with ∆1 ≤ 2√s∆2. holds with high probability for various sub-Gaussian designs when n s log p/s. fairly strong dependency between covariates is possible
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29
SLIDE 30 Restricted strong convexity for GLMS
generalized linear model linking covariates x ∈ Rp to output y ∈ R: Pθ(y | x, θ∗) ∝ exp
SLIDE 31 Restricted strong convexity for GLMS
generalized linear model linking covariates x ∈ Rp to output y ∈ R: Pθ(y | x, θ∗) ∝ exp
Taylor series expansion involves random Hessian H(θ) = 1 n
n
Φ′′ θ, Xi
i
∈ Rp.
SLIDE 32 Restricted strong convexity for GLMS
generalized linear model linking covariates x ∈ Rp to output y ∈ R: Pθ(y | x, θ∗) ∝ exp
Taylor series expansion involves random Hessian H(θ) = 1 n
n
Φ′′ θ, Xi
i
∈ Rp. Proposition (Negahban, W., Ravikumar & Yu, 2010) For zero-mean sub-Gaussian covariates Xi with covariance Σ and any GLM, TL(∆; θ∗) ≥ c3Σ
1 2 ∆2
1 2 ∆2 − c4κ(Σ)
log p n 1/2∆1
with probability at least 1 − c1 exp(−c2n). Here κ(Σ) = maxj Σjj.
SLIDE 33 (II) Decomposable regularizers
M M⊥ Subspace M: Approximation to model parameters Complementary subspace M⊥: Undesirable deviations.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29
SLIDE 34 (II) Decomposable regularizers
M M⊥ Subspace M: Approximation to model parameters Complementary subspace M⊥: Undesirable deviations. Regularizer R decomposes across (M, M⊥) if R(α + β) = R(α) + R(β) for all α ∈ M, and β ∈ M⊥.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29
SLIDE 35 (II) Decomposable regularizers
M M⊥ Regularizer R decomposes across (M, M⊥) if R(α + β) = R(α) + R(β) for all α ∈ M, and β ∈ M⊥. Includes:
- (weighted) ℓ1-norms
- nuclear norm
- group-sparse norms
- sums of decomposable norms
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29
SLIDE 36 Main theorem
Estimator
∈ arg min
θ∈Rp
1 ) + λnR(θ)
where L satisfies RSC(γ, τ) w.r.t regularizer R.
SLIDE 37 Main theorem
Estimator
∈ arg min
θ∈Rp
1 ) + λnR(θ)
where L satisfies RSC(γ, τ) w.r.t regularizer R. Theorem (Negahban, Ravikumar, W., & Yu, 2012) Suppose that θ∗ ∈ M. For any regularization parameter λn ≥ 2R∗ ∇L(θ∗; Zn
1 )
θλn satisfies
1 γ2(L)
n + τ 2(L)
Quantities that control rates:
curvature in RSC: γℓ tolerance in RSC: τ dual norm of regularizer: R∗(v) := sup
R(u)≤1
v, u.
- ptimal subspace const.: Ψ(M) =
sup
θ∈M\{0}
R(θ)/θ
SLIDE 38 Main theorem
Estimator
∈ arg min
θ∈Rp
1 ) + λnR(θ)
Theorem (Oracle version) For any regularization parameter λn ≥ 2R∗ ∇L(θ∗; Zn
1 )
θ satisfies
(λ′
n)2
γ2(L)Ψ2(M)
+ λ′
n
γ(L) R(ΠM⊥(θ∗))
where ρ′ = max{ρ, τ Quantities that control rates:
curvature in RSC: γℓ tolerance in RSC: τ dual norm of regularizer: R∗(v) := sup
R(u)≤1
v, u.
- ptimal subspace const.: Ψ(M) =
sup
θ∈M\{0}
R(θ)/θ
SLIDE 39 Example: Linear regression (exact sparsity)
Lasso program: min
θ∈Rp
2 + λnθ1}
RSC corresponds to lower bound on restricted eigenvalues of XT X ∈ Rp×p for a s-sparse vector, we have θ1 ≤ √s θ2.
Corollary Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with λn ≥ 2 XT w
n
∞, then any Lasso solution satisfies θ − θ∗2
2 ≤ 4 γ2 s λ2 n.
SLIDE 40 Example: Linear regression (exact sparsity)
Lasso program: min
θ∈Rp
2 + λnθ1}
RSC corresponds to lower bound on restricted eigenvalues of XT X ∈ Rp×p for a s-sparse vector, we have θ1 ≤ √s θ2.
Corollary Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with λn ≥ 2 XT w
n
∞, then any Lasso solution satisfies θ − θ∗2
2 ≤ 4 γ2 s λ2 n.
Some stochastic instances: recover known results
Compressed sensing: Xij ∼ N(0, 1) and bounded noise w2 ≤ σ√n Deterministic design: X with bounded columns and wi ∼ N(0, σ2) XT w n ∞ ≤
n w.h.p. = ⇒ θ − θ∗2
2 ≤ 12σ2
γ2 s log p n .
(e.g., Candes & Tao, 2007; Huang & Zhang, 2008; Meinshausen & Yu, 2008; Bickel et al., 2008)
SLIDE 41 Example: Linear regression (weak sparsity)
for some q ∈ [0, 1], say θ∗ belongs to ℓq-“ball” Bq(Rq) :=
p
|θj|q ≤ Rq
Corollary For θ∗ ∈ Bq(Rq), any Lasso solution satisfies (w.h.p.)
2
log p n 1−q/2 .
rate known to be minimax optimal (Raskutti, W. & Yu, 2011)
SLIDE 42 Example: Group-structured regularizers
Many applications exhibit sparsity with more structure..... G1 G2 G3 divide index set {1, 2, . . . , p} into groups G = {G1, G2, . . . , GT } for parameters νi ∈ [1, ∞], define block-norm θν,G :=
T
θGtνt group/block Lasso program
θ∈Rp
1 2ny − Xθ2
2 + λnθν,G
different versions studied by various authors
(Wright et al., 2005; Tropp et al., 2006; Yuan & Li, 2006; Baraniuk, 2008; Obozinski et al., 2008; Zhao et al., 2008; Bach et al., 2009; Lounici et al., 2009)
SLIDE 43 Convergence rates for general group Lasso
Corollary Say Θ∗ is supported on sG groups, and X satisfies RSC. Then for regularization parameter λn ≥ 2 max
t=1,2,...,T
n
t ,
where
1 ν∗
t = 1 − 1
νt ,
any solution θλn satisfies
γℓ Ψν(SG) λn, where Ψν(SG) = sup
θ∈M(SG)\{0} θν,G θ2 .
SLIDE 44 Convergence rates for general group Lasso
Corollary Say Θ∗ is supported on sG groups, and X satisfies RSC. Then for regularization parameter λn ≥ 2 max
t=1,2,...,T
n
t ,
where
1 ν∗
t = 1 − 1
νt ,
any solution θλn satisfies
γℓ Ψν(SG) λn, where Ψν(SG) = sup
θ∈M(SG)\{0} θν,G θ2 .
Some special cases with m ≡ max. group size
1 ℓ1/ℓ2 regularization: Group norm with ν = 2
2 = O
sGm n + sG log T n
SLIDE 45 Convergence rates for general group Lasso
Corollary Say Θ∗ is supported on sG groups, and X satisfies RSC. Then for regularization parameter λn ≥ 2 max
t=1,2,...,T
n
t ,
where
1 ν∗
t = 1 − 1
νt ,
any solution θλn satisfies
γℓ Ψν(SG) λn, where Ψν(SG) = sup
θ∈M(SG)\{0} θν,G θ2 .
Some special cases with m ≡ max. group size
1 ℓ1/ℓ∞ regularization: Group norm with ν = ∞
2 = O
sGm2 n + sG log T n
SLIDE 46 Example: Low-rank matrices and nuclear norm
low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank noisy/partial observations of the form yi =
+ wi, i = 1, . . . , n, wi i.i.d. noise estimate by solving semi-definite program (SDP):
Θ
1 n
n
(yi − Xi, Θ )2 + λn
min{p1,p2}
σj(Θ)
| |Θ| | |1
SLIDE 47 Example: Low-rank matrices and nuclear norm
low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank noisy/partial observations of the form yi =
+ wi, i = 1, . . . , n, wi i.i.d. noise estimate by solving semi-definite program (SDP):
Θ
1 n
n
(yi − Xi, Θ )2 + λn
min{p1,p2}
σj(Θ)
| |Θ| | |1
◮ matrix completion ◮ rank-reduced multivariate regression ◮ time-series modeling (vector autoregressions) ◮ ...
SLIDE 48 Example: Low-rank matrices and nuclear norm
low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank noisy/partial observations of the form yi =
+ wi, i = 1, . . . , n, wi i.i.d. noise estimate by solving semi-definite program (SDP):
Θ
1 n
n
(yi − Xi, Θ )2 + λn
min{p1,p2}
σj(Θ)
| |Θ| | |1
◮ matrix completion ◮ rank-reduced multivariate regression ◮ time-series modeling (vector autoregressions) ◮ ...
Some past work: Fazel, 2011; Srebro et al., 2004; Recht et al., 2007, Candes & Recht, 2008; Recht, 2009; Negahbah & W., 2010; Rohde & Tsybakov, 2010.
SLIDE 49 Rates for (near) low-rank estimation
For parameter q ∈ [0, 1], set of near low-rank matrices: Bq(Rq) =
min{p1,p2}
|σj(Θ∗)|q ≤ Rq
Corollary (Negahban & W., 2011) Under RSC condition, with regularization parameter λn ≥ 16σ p1
n +
p2
n
we have w.h.p. | | | Θ − Θ∗| | |2
F
≤ c0 Rq γ2
ℓ
σ2 (p1 + p2) n 1− q
2
SLIDE 50 Rates for (near) low-rank estimation
For parameter q ∈ [0, 1], set of near low-rank matrices: Bq(Rq) =
min{p1,p2}
|σj(Θ∗)|q ≤ Rq
Corollary (Negahban & W., 2011) Under RSC condition, with regularization parameter λn ≥ 16σ p1
n +
p2
n
we have w.h.p. | | | Θ − Θ∗| | |2
F
≤ c0 Rq γ2
ℓ
σ2 (p1 + p2) n 1− q
2
for a rank r matrix M | | |M| | |1 =
r
σj(M) ≤ √r
σ2
j (M) = √r |
| |M| | |F solve nuclear norm regularized program with λn ≥ 2
n|
| | n
i=1 wiXi|
| |2
SLIDE 51 Noisy matrix completion (unrescaled)
1000 2000 3000 4000 5000 6000 0.1 0.2 0.3 0.4 0.5 0.6 Sample size
MSE versus raw sample size (q = 0) d2 = 402 d2 = 602 d2 = 802 d2 = 1002
SLIDE 52 Noisy matrix completion (rescaled)
0.5 1 1.5 2 2.5 3 3.5 4 0.1 0.2 0.3 0.4 0.5 0.6 Rescaled sample size
MSE versus rescaled sample size (q = 0) d2 = 402 d2 = 602 d2 = 802 d2 = 1002
SLIDE 53 Summary
unified framework for high-dimensional M-estimators
◮ decomposability of regularizer R ◮ restricted strong convexity of loss functions
actual rates determined by:
◮ noise measured in dual function R∗ ◮ subspace constant Ψ in moving from R to error norm · ◮ restricted strong convexity constant Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 27 / 29
SLIDE 54 Summary
unified framework for high-dimensional M-estimators
◮ decomposability of regularizer R ◮ restricted strong convexity of loss functions
actual rates determined by:
◮ noise measured in dual function R∗ ◮ subspace constant Ψ in moving from R to error norm · ◮ restricted strong convexity constant
Looking ahead to tomorrow: From parametric to non-parametric problems: using kernel-based methods in high-dimensions.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 27 / 29
SLIDE 55 Some papers (www.eecs.berkeley.edu/wainwrig)
1 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A
unified framework for high-dimensional analysis of M-estimators with decomposable regularizers, Statistical Science, December 2012.
2 S. Negahban and M. J. Wainwright (2011). Estimation rates of (near)
low-rank matrices with noise and high-dimensional scaling. Annals of Statistics, 39(1):1069–1097.
3 S. Negahban and M. J. Wainwright (2012). Restricted strong convexity
and (weighted) matrix completion: Optimal bounds with noise. Journal
- f Machine Learning Research, May 2012.
4 G. Raskutti, M. J. Wainwright and B. Yu (2011) Minimax rates for linear
regression over ℓq-balls. IEEE Transactions on Information Theory, 57(10): 6976–6994.
SLIDE 56 Significance of decomposability
R(ΠM⊥(∆)) R(ΠM(∆)) R(ΠM(∆)) R(ΠM⊥(∆)) (a) C for exact model (cone) (b) C for approximate model (star-shaped) Lemma Suppose that L is convex, and R is decomposable w.r.t. M. Then as long as λn ≥ 2R∗ ∇L(θ∗; Zn
1 )
∆ = θλn − θ∗ belongs to C(M, M; θ∗) :=
M(∆)) + 4R(ΠM⊥(θ∗))