SLIDE 1 High-dimensional statistics: Some progress and challenges ahead
Martin Wainwright
UC Berkeley Departments of Statistics, and EECS
University College, London Master Class: Lecture 1 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.
SLIDE 2 Introduction
classical asymptotic theory: sample size n → +∞ with number of parameters p fixed
◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation
SLIDE 3 Introduction
classical asymptotic theory: sample size n → +∞ with number of parameters p fixed
◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation
modern applications in science and engineering:
◮ large-scale problems: both p and n may be large (possibly p ≫ n) ◮ need for high-dimensional theory that provides non-asymptotic results for
(n, p)
SLIDE 4 Introduction
classical asymptotic theory: sample size n → +∞ with number of parameters p fixed
◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation
modern applications in science and engineering:
◮ large-scale problems: both p and n may be large (possibly p ≫ n) ◮ need for high-dimensional theory that provides non-asymptotic results for
(n, p)
curses and blessings of high dimensionality
◮ exponential explosions in computational complexity ◮ statistical curses (sample complexity) ◮ concentration of measure
SLIDE 5 Introduction
modern applications in science and engineering:
◮ large-scale problems: both p and n may be large (possibly p ≫ n) ◮ need for high-dimensional theory that provides non-asymptotic results for
(n, p)
curses and blessings of high dimensionality
◮ exponential explosions in computational complexity ◮ statistical curses (sample complexity) ◮ concentration of measure
Key ideas: what embedded low-dimensional structures are present in data? how can they can be exploited algorithmically?
SLIDE 6
Vignette I: High-dimensional matrix estimation
want to estimate a covariance matrix Σ ∈ Rp×p given i.i.d. samples Xi ∼ N(0, Σ), for i = 1, 2, . . . , n
SLIDE 7 Vignette I: High-dimensional matrix estimation
want to estimate a covariance matrix Σ ∈ Rp×p given i.i.d. samples Xi ∼ N(0, Σ), for i = 1, 2, . . . , n Classical approach: Estimate Σ via sample covariance matrix:
1 n
n
XiXT
i
- average of p × p rank one matrices
SLIDE 8 Vignette I: High-dimensional matrix estimation
want to estimate a covariance matrix Σ ∈ Rp×p given i.i.d. samples Xi ∼ N(0, Σ), for i = 1, 2, . . . , n Classical approach: Estimate Σ via sample covariance matrix:
1 n
n
XiXT
i
- average of p × p rank one matrices
Reasonable properties: (p fixed, n increasing) Unbiased: E[ Σn] = Σ Consistent: Σn
a.s.
− → Σ as n → +∞ Asymptotic distributional properties available
SLIDE 9 Vignette I: High-dimensional matrix estimation
want to estimate a covariance matrix Σ ∈ Rp×p given i.i.d. samples Xi ∼ N(0, Σ), for i = 1, 2, . . . , n Classical approach: Estimate Σ via sample covariance matrix:
1 n
n
XiXT
i
- average of p × p rank one matrices
An alternative experiment: Fix some α > 0 Study behavior over sequences with p
n = α
Does Σn(p) converge to anything reasonable?
SLIDE 10
0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Empirical vs MP law (α = 0.5) Eigenvalue Density (rescaled) Theory
Marcenko & Pastur, 1967.
SLIDE 11
0.5 1 1.5 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Empirical vs MP law (α = 0.2) Eigenvalue Density (rescaled) Theory
Marcenko & Pastur, 1967.
SLIDE 12 Low-dimensional structure: Gaussian graphical models
Zero pattern of inverse covariance 1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 P(x1, x2, . . . , xp) ∝ exp
2xT Θ∗x
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 5 / 1
SLIDE 13 Maximum-likelihood with ℓ1-regularization
Zero pattern of inverse covariance 1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ∗ ∈ Rp×p.
SLIDE 14 Maximum-likelihood with ℓ1-regularization
Zero pattern of inverse covariance 1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ∗ ∈ Rp×p. Estimator (for inverse covariance)
Θ
n
n
xixT
i , Θ
− log det(Θ) + λn
|Θjk|
- Some past work: Yuan & Lin, 2006; d’Aspr´
emont et al., 2007; Bickel & Levina, 2007; El Karoui, 2007; d’Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009
SLIDE 15
Gauss-Markov models with hidden variables
X1 X2 X3 X4 Z Problems with hidden variables: conditioned on hidden Z, vector X = (X1, X2, X3, X4) is Gauss-Markov.
SLIDE 16
Gauss-Markov models with hidden variables
X1 X2 X3 X4 Z Problems with hidden variables: conditioned on hidden Z, vector X = (X1, X2, X3, X4) is Gauss-Markov. Inverse covariance of X satisfies {sparse, low-rank} decomposition: 1 − µ µ µ µ µ 1 − µ µ µ µ µ 1 − µ µ µ µ µ 1 − µ = I4×4 − µ11T . (Chandrasekaran, Parrilo & Willsky, 2010)
SLIDE 17 Vignette II: High-dimensional sparse linear regression
=
+
n S w y X θ∗ Sc n × p Set-up: noisy observations y = Xθ∗ + w with sparse θ∗ Estimator: Lasso program
θ
1 n
n
(yi − xT
i θ)2 + λn p
|θj|
Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004; Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005; Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van
SLIDE 18 Application A: Compressed sensing
(Donoho, 2005; Candes & Tao, 2005)
=
y X n n × p β∗ p (a) Image: vectorize to β∗ ∈ Rp (b) Compute n random projections
SLIDE 19 Application A: Compressed sensing
(Donoho, 2005; Candes & Tao, 2005)
In practice, signals are sparse in a transform domain: θ∗ := Ψβ∗ is a sparse signal, where Ψ is an orthonormal matrix.
=
y X n n × p θ∗ p s ΨT Reconstruct θ∗ (and hence image β∗ = ΨT θ∗) based on finding a sparse solution to under-constrained linear system y = X θ where X = XΨT is another random matrix.
SLIDE 20 Noiseless ℓ1 recovery: Unrescaled sample size
50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Raw sample size n
- Prob. of exact recovery
- Prob. exact recovery vs. sample size (
µ = 0) p = 128 p = 256 p = 512
Probability of recovery versus sample size n.
SLIDE 21 Application B: Graph structure estimation
let G = (V, E) be an undirected graph on p = |V | vertices pairwise graphical model factorizes over edges of graph: P(x1, . . . , xp; θ) ∝ exp
(s,t)∈E
θst(xs, xt)
given n independent and identically distributed (i.i.d.) samples of X = (X1, . . . , Xp), identify the underlying graph structure
SLIDE 22 Pseudolikelihood and neighborhood regression
Markov properties encode neighborhood structure: (Xs | XV \s)
= (Xs | XN(s))
Condition on Markov blanket N(s) = {s, t, u, v, w} Xs Xs Xt Xu Xv Xw basis of pseudolikelihood method
(Besag, 1974)
basis of many graph learning algorithm
(Friedman et al., 1999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)
SLIDE 23
Graph selection via neighborhood regression
1 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 . . . . . . . . . . . . . . . 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1
Xs X\s Predict Xs based on X\s := {Xs, t = s}.
SLIDE 24 Graph selection via neighborhood regression
1 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 . . . . . . . . . . . . . . . 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1
Xs X\s Predict Xs based on X\s := {Xs, t = s}.
1 For each node s ∈ V , compute (regularized) max. likelihood estimate:
:= arg min
θ∈Rp−1
n
n
L(θ; Xi \s)
λn θ1
regularization
SLIDE 25 Graph selection via neighborhood regression
1 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 . . . . . . . . . . . . . . . 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1
Xs X\s Predict Xs based on X\s := {Xs, t = s}.
1 For each node s ∈ V , compute (regularized) max. likelihood estimate:
:= arg min
θ∈Rp−1
n
n
L(θ; Xi \s)
λn θ1
regularization
2 Estimate the local neighborhood
N(s) as support of regression vector
SLIDE 26
US Senate network (2004–2006 voting)
SLIDE 27 Outline
1 Lecture 1 (Today): Basics of sparse recovery
◮ Sparse linear systems: ℓ0/ℓ1 equivalence ◮ Noisy case: Lasso, ℓ2-bounds and variable selection
2 Lecture 2 (Tuesday): A more general theory
◮ A range of structured regularizers ⋆ Group sparsity ⋆ Low-rank matrices and nuclear norm regularization ⋆ Matrix decomposition and robust PCA ◮ Ingredients of a general understanding
3 Lecture 3 (Wednesday): High-dimensional kernel methods
◮ Curse-of-dimensionality for non-parametric regression ◮ Reproducing kernel Hilbert spaces ◮ A simple but optimal estimator Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 1
SLIDE 28 Noiseless linear models and basis pursuit
=
y S Sc n × p n X θ∗ under-determined linear system: unidentifiable without constraints say θ∗ ∈ Rp is sparse: supported on S ⊂ {1, 2, . . . , p}. ℓ0-optimization ℓ1-relaxation θ∗ = arg min
θ∈Rp θ0
θ∈Rp θ1
Xθ = y Xθ = y Computationally intractable Linear program (easy to solve) NP-hard Basis pursuit relaxation
SLIDE 29 Noiseless ℓ1 recovery: Unrescaled sample size
50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Raw sample size n
- Prob. of exact recovery
- Prob. exact recovery vs. sample size (
µ = 0) p = 128 p = 256 p = 512
Probability of recovery versus sample size n.
SLIDE 30 Noiseless ℓ1 recovery: Rescaled
0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rescaled sample size α
- Prob. of exact recovery
- Prob. exact recovery vs. sample size (
µ = 0) p = 128 p = 256 p = 512
Probabability of recovery versus rescaled sample size α :=
n s log(p/s).
SLIDE 31 Restricted nullspace: necessary and sufficient
Definition For a fixed S ⊂ {1, 2, . . . , p}, the matrix X ∈ Rn×p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if
∩
=
(Donoho & Xu, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)
SLIDE 32 Restricted nullspace: necessary and sufficient
Definition For a fixed S ⊂ {1, 2, . . . , p}, the matrix X ∈ Rn×p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if
∩
=
(Donoho & Xu, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)
Proposition Basis pursuit ℓ1-relaxation is exact for all S-sparse vectors ⇐ ⇒ X satisfies RN(S).
SLIDE 33 Restricted nullspace: necessary and sufficient
Definition For a fixed S ⊂ {1, 2, . . . , p}, the matrix X ∈ Rn×p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if
∩
=
(Donoho & Xu, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)
Proof (sufficiency): (1) Error vector ∆ = θ∗ − θ satisfies X ∆ = 0, and hence ∆ ∈ N(X). (2) Show that ∆ ∈ C(S) Optimality of θ:
S1.
Sparsity of θ∗:
∆1 = θ∗
S +
∆S1 + ∆Sc1. Triangle inequality: θ∗
S +
∆S1 + ∆Sc1 ≥ θ∗
S1 −
∆S1 + ∆Sc1. (3) Hence, ∆ ∈ N(X) ∩ C(S), and (RN) = ⇒
SLIDE 34 Illustration of restricted nullspace property
∆3 (∆1, ∆2) ∆3 (∆1, ∆2) consider θ∗ = (0, 0, θ∗
3), so that S = {3}.
error vector ∆ = θ − θ∗ belongs to the set C(S; 1) :=
- (∆1, ∆2, ∆3) ∈ R3 | |∆1| + |∆2| ≤ |∆3|
- .
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 20 / 1
SLIDE 35 Some sufficient conditions
How to verify RN property for a given sparsity s?
1 Elementwise incoherence condition
(Donoho & Xuo, 2001; Feuer & Nem., 2003)
max
j,k=1,...,p
n − Ip×p
s
x1 xj xk xp p n
SLIDE 36 Some sufficient conditions
How to verify RN property for a given sparsity s?
1 Elementwise incoherence condition
(Donoho & Xuo, 2001; Feuer & Nem., 2003)
max
j,k=1,...,p
n − Ip×p
s
x1 xj xk xp p n
2 Restricted isometry, or submatrix incoherence
(Candes & Tao, 2005)
max
|U|≤2s
n − Ip×p
≤ δ2s.
x1 U xp
SLIDE 37 Some sufficient conditions
How to verify RN property for a given sparsity s?
1 Elementwise incoherence condition
(Donoho & Xuo, 2001; Feuer & Nem., 2003)
max
j,k=1,...,p
n − Ip×p
s
x1 xj xk xp p n
Matrices with i.i.d. sub-Gaussian entries: holds w.h.p. for n = Ω(s2 log p)
2 Restricted isometry, or submatrix incoherence
(Candes & Tao, 2005)
max
|U|≤2s
n − Ip×p
≤ δ2s.
x1 U xp
SLIDE 38 Some sufficient conditions
How to verify RN property for a given sparsity s?
1 Elementwise incoherence condition
(Donoho & Xuo, 2001; Feuer & Nem., 2003)
max
j,k=1,...,p
n − Ip×p
s
x1 xj xk xp p n
Matrices with i.i.d. sub-Gaussian entries: holds w.h.p. for n = Ω(s2 log p)
2 Restricted isometry, or submatrix incoherence
(Candes & Tao, 2005)
max
|U|≤2s
n − Ip×p
≤ δ2s.
x1 U xp
Matrices with i.i.d. sub-Gaussian entries: holds w.h.p. for n = Ω(s log p
s)
SLIDE 39
Violating matrix incoherence (elementwise/RIP)
Important: Incoherence/RIP conditions imply RN, but are far from necessary. Very easy to violate them.....
SLIDE 40 Violating matrix incoherence (elementwise/RIP)
Form random design matrix X = x1 x2 . . . xp
= XT
1
XT
2
. . . XT
n
n rows ∈ Rn×p, each row Xi ∼ N(0, Σ), i.i.d. Example: For some µ ∈ (0, 1), consider the covariance matrix Σ = (1 − µ)Ip×p + µ11T .
SLIDE 41 Violating matrix incoherence (elementwise/RIP)
Form random design matrix X = x1 x2 . . . xp
= XT
1
XT
2
. . . XT
n
n rows ∈ Rn×p, each row Xi ∼ N(0, Σ), i.i.d. Example: For some µ ∈ (0, 1), consider the covariance matrix Σ = (1 − µ)Ip×p + µ11T . Elementwise incoherence violated: for any j = k P xj, xk n ≥ µ − ǫ
SLIDE 42 Violating matrix incoherence (elementwise/RIP)
Form random design matrix X = x1 x2 . . . xp
= XT
1
XT
2
. . . XT
n
n rows ∈ Rn×p, each row Xi ∼ N(0, Σ), i.i.d. Example: For some µ ∈ (0, 1), consider the covariance matrix Σ = (1 − µ)Ip×p + µ11T . Elementwise incoherence violated: for any j = k P xj, xk n ≥ µ − ǫ
RIP constants tend to infinity as (n, |S|) increases: P
S XS
n − Is×s
- 2 ≥ µ (s − 1) − 1 − ǫ
- ≥ 1 − c1 exp(−c2nǫ2).
SLIDE 43 Noiseless ℓ1 recovery for µ = 0.5
0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rescaled sample size α
- Prob. of exact recovery
- Prob. exact recovery vs. sample size (
µ = 0.5) p = 128 p = 256 p = 512
- Probab. versus rescaled sample size α :=
n s log(p/s).
SLIDE 44 Direct result for restricted nullspace/eigenvalues
Theorem (Raskutti, W., & Yu, 2010) Consider a random design X ∈ Rn×p with each row Xi ∼ N(0, Σ) i.i.d., and define κ(Σ) = max
j=1,2,...p Σjj. Then for universal constants c1, c2,
Xθ2 √n ≥ 1 2Σ1/2θ2 − 9κ(Σ)
n θ1 for all θ ∈ Rp with probability greater than 1 − c1 exp(−c2n).
SLIDE 45 Direct result for restricted nullspace/eigenvalues
Theorem (Raskutti, W., & Yu, 2010) Consider a random design X ∈ Rn×p with each row Xi ∼ N(0, Σ) i.i.d., and define κ(Σ) = max
j=1,2,...p Σjj. Then for universal constants c1, c2,
Xθ2 √n ≥ 1 2Σ1/2θ2 − 9κ(Σ)
n θ1 for all θ ∈ Rp with probability greater than 1 − c1 exp(−c2n). much less restrictive than incoherence/RIP conditions many interesting matrix families are covered
◮ Toeplitz dependency ◮ constant µ-correlation (previous example) ◮ covariance matrix Σ can even be degenerate ◮ extensions to sub-Gaussian matrices
(Rudelson & Zhou, 2012)
related results hold for generalized linear models
SLIDE 46 Easy verification of restricted nullspace
for any ∆ ∈ C(S), we have ∆1 = ∆S1 + ∆Sc1 ≤ 2∆S ≤ 2√s ∆2 applying previous result: X∆2 √n ≥
√ Σ) − 18κ(Σ)
n
∆2.
SLIDE 47 Easy verification of restricted nullspace
for any ∆ ∈ C(S), we have ∆1 = ∆S1 + ∆Sc1 ≤ 2∆S ≤ 2√s ∆2 applying previous result: X∆2 √n ≥
√ Σ) − 18κ(Σ)
n
∆2. have actually proven much more than restricted nullspace....
SLIDE 48 Easy verification of restricted nullspace
for any ∆ ∈ C(S), we have ∆1 = ∆S1 + ∆Sc1 ≤ 2∆S ≤ 2√s ∆2 applying previous result: X∆2 √n ≥
√ Σ) − 18κ(Σ)
n
∆2. have actually proven much more than restricted nullspace.... Definition A design matrix X ∈ Rn×p satisfies the restricted eigenvalue (RE) condition
- ver S (denote RE(S)) with parameters α ≥ 1 and γ > 0 if
X∆2 √n ≥ γ ∆2 for all ∆ ∈ Rp such that ∆Sc1 ≤ α∆S1.
(van de Geer, 2007; Bickel, Ritov & Tsybakov, 2008)
SLIDE 49 Lasso and restricted eigenvalues
Turning to noisy observations...
=
+
n S w y X θ∗ Sc n × p Estimator: Lasso program
θ∈Rp
1 2ny − Xθ2
2 + λnθ1
Goal: Obtain bounds on θλn − θ∗2 that hold with high probability.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 26 / 1
SLIDE 50 Lasso bounds: Four simple steps
Let’s analyze constrained version: min
θ∈Rp
1 2ny − Xθ2
2
such that θ1 ≤ R = θ∗1.
SLIDE 51 Lasso bounds: Four simple steps
Let’s analyze constrained version: min
θ∈Rp
1 2ny − Xθ2
2
such that θ1 ≤ R = θ∗1. (1) By optimality of θ and feasibility of θ∗: 1 2ny − X θ2
2 ≤ 1
2ny − Xθ∗2
2.
SLIDE 52 Lasso bounds: Four simple steps
Let’s analyze constrained version: min
θ∈Rp
1 2ny − Xθ2
2
such that θ1 ≤ R = θ∗1. (1) By optimality of θ and feasibility of θ∗: 1 2ny − X θ2
2 ≤ 1
2ny − Xθ∗2
2.
(2) Derive a basic inequality: re-arranging in terms of ∆ = θ − θ∗: 1 nX ∆2
2 ≤ 2
n ∆, XT w.
SLIDE 53 Lasso bounds: Four simple steps
Let’s analyze constrained version: min
θ∈Rp
1 2ny − Xθ2
2
such that θ1 ≤ R = θ∗1. (1) By optimality of θ and feasibility of θ∗: 1 2ny − X θ2
2 ≤ 1
2ny − Xθ∗2
2.
(2) Derive a basic inequality: re-arranging in terms of ∆ = θ − θ∗: 1 nX ∆2
2 ≤ 2
n ∆, XT w. (3) Restricted eigenvalue for LHS; H¨
- lder’s inequality for RHS
γ ∆2
2 ≤ 1
nX ∆2
2 ≤ 2
n ∆, XT w ≤ 2 ∆1
n
SLIDE 54 Lasso bounds: Four simple steps
Let’s analyze constrained version: min
θ∈Rp
1 2ny − Xθ2
2
such that θ1 ≤ R = θ∗1. (1) By optimality of θ and feasibility of θ∗: 1 2ny − X θ2
2 ≤ 1
2ny − Xθ∗2
2.
(2) Derive a basic inequality: re-arranging in terms of ∆ = θ − θ∗: 1 nX ∆2
2 ≤ 2
n ∆, XT w. (3) Restricted eigenvalue for LHS; H¨
- lder’s inequality for RHS
γ ∆2
2 ≤ 1
nX ∆2
2 ≤ 2
n ∆, XT w ≤ 2 ∆1
n
(4) As before, ∆ ∈ C(S), so that ∆1 ≤ 2√s ∆2, and hence
γ √s
n
SLIDE 55 Lasso error bounds for different models
Proposition Suppose that vector θ∗ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ∗1 or regularized Lasso with λn = 2XT w/n∞, any optimal solution θ satisfies the bound
γ
n ∞.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 28 / 1
SLIDE 56 Lasso error bounds for different models
Proposition Suppose that vector θ∗ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ∗1 or regularized Lasso with λn = 2XT w/n∞, any optimal solution θ satisfies the bound
γ
n ∞. this is a deterministic result on the set of optimizers various corollaries for specific statistical models
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 28 / 1
SLIDE 57 Lasso error bounds for different models
Proposition Suppose that vector θ∗ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ∗1 or regularized Lasso with λn = 2XT w/n∞, any optimal solution θ satisfies the bound
γ
n ∞. this is a deterministic result on the set of optimizers various corollaries for specific statistical models
◮ Compressed sensing: Xij ∼ N(0, 1) and bounded noise w2 ≤ σ√n ◮ Deterministic design: X with bounded columns and wi ∼ N(0, σ2)
XT w n ∞ ≤
n w.h.p. = ⇒ θ − θ∗2 ≤ 4σ γ(L)
n .
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 28 / 1
SLIDE 58 Look-ahead to Lecture 2: A more general theory
Recap: Thus far..... Derived error bounds for basis pursuit and Lasso (ℓ1-relaxation) Seen importance of restricted nullspace and restricted eigenvalues
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 29 / 1
SLIDE 59 Look-ahead to Lecture 2: A more general theory
The big picture: Lots of other estimators with same basic form:
∈ arg min
θ∈Ω
1 )
+ λn R(θ) Regularizer
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 29 / 1
SLIDE 60 Look-ahead to Lecture 2: A more general theory
The big picture: Lots of other estimators with same basic form:
∈ arg min
θ∈Ω
1 )
+ λn R(θ) Regularizer
Past years have witnessed an explosion of results (compressed sensing, covariance estimation, block-sparsity, graphical models, matrix completion...) Question: Is there a common set of underlying principles?
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 29 / 1
SLIDE 61 Some papers (www.eecs.berkeley.edu/wainwrig)
1 M. J. Wainwright (2009), Sharp thresholds for high-dimensional and noisy
sparsity recovery using ℓ1-constrained quadratic programming (Lasso)”, IEEE Trans. Information Theory, May 2009.
2 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A
unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science. December 2012.
3 G. Raskutti, M. J. Wainwright and B. Yu (2011) Minimax rates for linear
regression over ℓq-balls. IEEE Trans. Information Theory, October 2011.
4 G. Raskutti, M. J. Wainwright and B. Yu (2010). Restricted nullspace
and eigenvalue properties for correlated Gaussian designs. Journal of Machine Learning Research. August 2010.