High-dimensional statistics: Some progress and challenges ahead - - PowerPoint PPT Presentation

high dimensional statistics some progress and challenges
SMART_READER_LITE
LIVE PREVIEW

High-dimensional statistics: Some progress and challenges ahead - - PowerPoint PPT Presentation

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 1 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh,


slide-1
SLIDE 1

High-dimensional statistics: Some progress and challenges ahead

Martin Wainwright

UC Berkeley Departments of Statistics, and EECS

University College, London Master Class: Lecture 1 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

slide-2
SLIDE 2

Introduction

classical asymptotic theory: sample size n → +∞ with number of parameters p fixed

◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation

slide-3
SLIDE 3

Introduction

classical asymptotic theory: sample size n → +∞ with number of parameters p fixed

◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation

modern applications in science and engineering:

◮ large-scale problems: both p and n may be large (possibly p ≫ n) ◮ need for high-dimensional theory that provides non-asymptotic results for

(n, p)

slide-4
SLIDE 4

Introduction

classical asymptotic theory: sample size n → +∞ with number of parameters p fixed

◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation

modern applications in science and engineering:

◮ large-scale problems: both p and n may be large (possibly p ≫ n) ◮ need for high-dimensional theory that provides non-asymptotic results for

(n, p)

curses and blessings of high dimensionality

◮ exponential explosions in computational complexity ◮ statistical curses (sample complexity) ◮ concentration of measure

slide-5
SLIDE 5

Introduction

modern applications in science and engineering:

◮ large-scale problems: both p and n may be large (possibly p ≫ n) ◮ need for high-dimensional theory that provides non-asymptotic results for

(n, p)

curses and blessings of high dimensionality

◮ exponential explosions in computational complexity ◮ statistical curses (sample complexity) ◮ concentration of measure

Key ideas: what embedded low-dimensional structures are present in data? how can they can be exploited algorithmically?

slide-6
SLIDE 6

Vignette I: High-dimensional matrix estimation

want to estimate a covariance matrix Σ ∈ Rp×p given i.i.d. samples Xi ∼ N(0, Σ), for i = 1, 2, . . . , n

slide-7
SLIDE 7

Vignette I: High-dimensional matrix estimation

want to estimate a covariance matrix Σ ∈ Rp×p given i.i.d. samples Xi ∼ N(0, Σ), for i = 1, 2, . . . , n Classical approach: Estimate Σ via sample covariance matrix:

  • Σn :=

1 n

n

  • i=1

XiXT

i

  • average of p × p rank one matrices
slide-8
SLIDE 8

Vignette I: High-dimensional matrix estimation

want to estimate a covariance matrix Σ ∈ Rp×p given i.i.d. samples Xi ∼ N(0, Σ), for i = 1, 2, . . . , n Classical approach: Estimate Σ via sample covariance matrix:

  • Σn :=

1 n

n

  • i=1

XiXT

i

  • average of p × p rank one matrices

Reasonable properties: (p fixed, n increasing) Unbiased: E[ Σn] = Σ Consistent: Σn

a.s.

− → Σ as n → +∞ Asymptotic distributional properties available

slide-9
SLIDE 9

Vignette I: High-dimensional matrix estimation

want to estimate a covariance matrix Σ ∈ Rp×p given i.i.d. samples Xi ∼ N(0, Σ), for i = 1, 2, . . . , n Classical approach: Estimate Σ via sample covariance matrix:

  • Σn :=

1 n

n

  • i=1

XiXT

i

  • average of p × p rank one matrices

An alternative experiment: Fix some α > 0 Study behavior over sequences with p

n = α

Does Σn(p) converge to anything reasonable?

slide-10
SLIDE 10

0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Empirical vs MP law (α = 0.5) Eigenvalue Density (rescaled) Theory

Marcenko & Pastur, 1967.

slide-11
SLIDE 11

0.5 1 1.5 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Empirical vs MP law (α = 0.2) Eigenvalue Density (rescaled) Theory

Marcenko & Pastur, 1967.

slide-12
SLIDE 12

Low-dimensional structure: Gaussian graphical models

Zero pattern of inverse covariance 1 2 3 4 5 1 2 3 4 5

1 2 3 4 5 P(x1, x2, . . . , xp) ∝ exp

  • − 1

2xT Θ∗x

  • .

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 5 / 1

slide-13
SLIDE 13

Maximum-likelihood with ℓ1-regularization

Zero pattern of inverse covariance 1 2 3 4 5 1 2 3 4 5

1 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ∗ ∈ Rp×p.

slide-14
SLIDE 14

Maximum-likelihood with ℓ1-regularization

Zero pattern of inverse covariance 1 2 3 4 5 1 2 3 4 5

1 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ∗ ∈ Rp×p. Estimator (for inverse covariance)

  • Θ ∈ arg min

Θ

  • 1

n

n

  • i=1

xixT

i , Θ

− log det(Θ) + λn

  • j=k

|Θjk|

  • Some past work: Yuan & Lin, 2006; d’Aspr´

emont et al., 2007; Bickel & Levina, 2007; El Karoui, 2007; d’Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009

slide-15
SLIDE 15

Gauss-Markov models with hidden variables

X1 X2 X3 X4 Z Problems with hidden variables: conditioned on hidden Z, vector X = (X1, X2, X3, X4) is Gauss-Markov.

slide-16
SLIDE 16

Gauss-Markov models with hidden variables

X1 X2 X3 X4 Z Problems with hidden variables: conditioned on hidden Z, vector X = (X1, X2, X3, X4) is Gauss-Markov. Inverse covariance of X satisfies {sparse, low-rank} decomposition:     1 − µ µ µ µ µ 1 − µ µ µ µ µ 1 − µ µ µ µ µ 1 − µ     = I4×4 − µ11T . (Chandrasekaran, Parrilo & Willsky, 2010)

slide-17
SLIDE 17

Vignette II: High-dimensional sparse linear regression

=

+

n S w y X θ∗ Sc n × p Set-up: noisy observations y = Xθ∗ + w with sparse θ∗ Estimator: Lasso program

  • θ ∈ arg min

θ

1 n

n

  • i=1

(yi − xT

i θ)2 + λn p

  • j=1

|θj|

Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004; Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005; Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van

slide-18
SLIDE 18

Application A: Compressed sensing

(Donoho, 2005; Candes & Tao, 2005)

=

y X n n × p β∗ p (a) Image: vectorize to β∗ ∈ Rp (b) Compute n random projections

slide-19
SLIDE 19

Application A: Compressed sensing

(Donoho, 2005; Candes & Tao, 2005)

In practice, signals are sparse in a transform domain: θ∗ := Ψβ∗ is a sparse signal, where Ψ is an orthonormal matrix.

=

y X n n × p θ∗ p s ΨT Reconstruct θ∗ (and hence image β∗ = ΨT θ∗) based on finding a sparse solution to under-constrained linear system y = X θ where X = XΨT is another random matrix.

slide-20
SLIDE 20

Noiseless ℓ1 recovery: Unrescaled sample size

50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Raw sample size n

  • Prob. of exact recovery
  • Prob. exact recovery vs. sample size (

µ = 0) p = 128 p = 256 p = 512

Probability of recovery versus sample size n.

slide-21
SLIDE 21

Application B: Graph structure estimation

let G = (V, E) be an undirected graph on p = |V | vertices pairwise graphical model factorizes over edges of graph: P(x1, . . . , xp; θ) ∝ exp

(s,t)∈E

θst(xs, xt)

  • .

given n independent and identically distributed (i.i.d.) samples of X = (X1, . . . , Xp), identify the underlying graph structure

slide-22
SLIDE 22

Pseudolikelihood and neighborhood regression

Markov properties encode neighborhood structure: (Xs | XV \s)

  • d

= (Xs | XN(s))

  • Condition on full graph

Condition on Markov blanket N(s) = {s, t, u, v, w} Xs Xs Xt Xu Xv Xw basis of pseudolikelihood method

(Besag, 1974)

basis of many graph learning algorithm

(Friedman et al., 1999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)

slide-23
SLIDE 23

Graph selection via neighborhood regression

1 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 . . . . . . . . . . . . . . . 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1

Xs X\s Predict Xs based on X\s := {Xs, t = s}.

slide-24
SLIDE 24

Graph selection via neighborhood regression

1 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 . . . . . . . . . . . . . . . 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1

Xs X\s Predict Xs based on X\s := {Xs, t = s}.

1 For each node s ∈ V , compute (regularized) max. likelihood estimate:

  • θ[s]

:= arg min

θ∈Rp−1

  • − 1

n

n

  • i=1

L(θ; Xi \s)

  • +

λn θ1

  • local log. likelihood

regularization

slide-25
SLIDE 25

Graph selection via neighborhood regression

1 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 . . . . . . . . . . . . . . . 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1

Xs X\s Predict Xs based on X\s := {Xs, t = s}.

1 For each node s ∈ V , compute (regularized) max. likelihood estimate:

  • θ[s]

:= arg min

θ∈Rp−1

  • − 1

n

n

  • i=1

L(θ; Xi \s)

  • +

λn θ1

  • local log. likelihood

regularization

2 Estimate the local neighborhood

N(s) as support of regression vector

  • θ[s] ∈ Rp−1.
slide-26
SLIDE 26

US Senate network (2004–2006 voting)

slide-27
SLIDE 27

Outline

1 Lecture 1 (Today): Basics of sparse recovery

◮ Sparse linear systems: ℓ0/ℓ1 equivalence ◮ Noisy case: Lasso, ℓ2-bounds and variable selection

2 Lecture 2 (Tuesday): A more general theory

◮ A range of structured regularizers ⋆ Group sparsity ⋆ Low-rank matrices and nuclear norm regularization ⋆ Matrix decomposition and robust PCA ◮ Ingredients of a general understanding

3 Lecture 3 (Wednesday): High-dimensional kernel methods

◮ Curse-of-dimensionality for non-parametric regression ◮ Reproducing kernel Hilbert spaces ◮ A simple but optimal estimator Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 1

slide-28
SLIDE 28

Noiseless linear models and basis pursuit

=

y S Sc n × p n X θ∗ under-determined linear system: unidentifiable without constraints say θ∗ ∈ Rp is sparse: supported on S ⊂ {1, 2, . . . , p}. ℓ0-optimization ℓ1-relaxation θ∗ = arg min

θ∈Rp θ0

  • θ ∈ arg min

θ∈Rp θ1

Xθ = y Xθ = y Computationally intractable Linear program (easy to solve) NP-hard Basis pursuit relaxation

slide-29
SLIDE 29

Noiseless ℓ1 recovery: Unrescaled sample size

50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Raw sample size n

  • Prob. of exact recovery
  • Prob. exact recovery vs. sample size (

µ = 0) p = 128 p = 256 p = 512

Probability of recovery versus sample size n.

slide-30
SLIDE 30

Noiseless ℓ1 recovery: Rescaled

0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rescaled sample size α

  • Prob. of exact recovery
  • Prob. exact recovery vs. sample size (

µ = 0) p = 128 p = 256 p = 512

Probabability of recovery versus rescaled sample size α :=

n s log(p/s).

slide-31
SLIDE 31

Restricted nullspace: necessary and sufficient

Definition For a fixed S ⊂ {1, 2, . . . , p}, the matrix X ∈ Rn×p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if

  • ∆ ∈ Rp | X∆ = 0}
  • N(X)

  • ∆ ∈ Rp | ∆Sc1 ≤ ∆S1
  • C(S)

=

  • .

(Donoho & Xu, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)

slide-32
SLIDE 32

Restricted nullspace: necessary and sufficient

Definition For a fixed S ⊂ {1, 2, . . . , p}, the matrix X ∈ Rn×p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if

  • ∆ ∈ Rp | X∆ = 0}
  • N(X)

  • ∆ ∈ Rp | ∆Sc1 ≤ ∆S1
  • C(S)

=

  • .

(Donoho & Xu, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)

Proposition Basis pursuit ℓ1-relaxation is exact for all S-sparse vectors ⇐ ⇒ X satisfies RN(S).

slide-33
SLIDE 33

Restricted nullspace: necessary and sufficient

Definition For a fixed S ⊂ {1, 2, . . . , p}, the matrix X ∈ Rn×p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if

  • ∆ ∈ Rp | X∆ = 0}
  • N(X)

  • ∆ ∈ Rp | ∆Sc1 ≤ ∆S1
  • C(S)

=

  • .

(Donoho & Xu, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)

Proof (sufficiency): (1) Error vector ∆ = θ∗ − θ satisfies X ∆ = 0, and hence ∆ ∈ N(X). (2) Show that ∆ ∈ C(S) Optimality of θ:

  • θ1 ≤ θ∗1 = θ∗

S1.

Sparsity of θ∗:

  • θ1 = θ∗ +

∆1 = θ∗

S +

∆S1 + ∆Sc1. Triangle inequality: θ∗

S +

∆S1 + ∆Sc1 ≥ θ∗

S1 −

∆S1 + ∆Sc1. (3) Hence, ∆ ∈ N(X) ∩ C(S), and (RN) = ⇒

  • ∆ = 0.
slide-34
SLIDE 34

Illustration of restricted nullspace property

∆3 (∆1, ∆2) ∆3 (∆1, ∆2) consider θ∗ = (0, 0, θ∗

3), so that S = {3}.

error vector ∆ = θ − θ∗ belongs to the set C(S; 1) :=

  • (∆1, ∆2, ∆3) ∈ R3 | |∆1| + |∆2| ≤ |∆3|
  • .

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 20 / 1

slide-35
SLIDE 35

Some sufficient conditions

How to verify RN property for a given sparsity s?

1 Elementwise incoherence condition

(Donoho & Xuo, 2001; Feuer & Nem., 2003)

max

j,k=1,...,p

  • XT X

n − Ip×p

  • jk
  • ≤ δ1

s

x1 xj xk xp p n

slide-36
SLIDE 36

Some sufficient conditions

How to verify RN property for a given sparsity s?

1 Elementwise incoherence condition

(Donoho & Xuo, 2001; Feuer & Nem., 2003)

max

j,k=1,...,p

  • XT X

n − Ip×p

  • jk
  • ≤ δ1

s

x1 xj xk xp p n

2 Restricted isometry, or submatrix incoherence

(Candes & Tao, 2005)

max

|U|≤2s

  • XT X

n − Ip×p

  • UU
  • p

≤ δ2s.

x1 U xp

slide-37
SLIDE 37

Some sufficient conditions

How to verify RN property for a given sparsity s?

1 Elementwise incoherence condition

(Donoho & Xuo, 2001; Feuer & Nem., 2003)

max

j,k=1,...,p

  • XT X

n − Ip×p

  • jk
  • ≤ δ1

s

x1 xj xk xp p n

Matrices with i.i.d. sub-Gaussian entries: holds w.h.p. for n = Ω(s2 log p)

2 Restricted isometry, or submatrix incoherence

(Candes & Tao, 2005)

max

|U|≤2s

  • XT X

n − Ip×p

  • UU
  • p

≤ δ2s.

x1 U xp

slide-38
SLIDE 38

Some sufficient conditions

How to verify RN property for a given sparsity s?

1 Elementwise incoherence condition

(Donoho & Xuo, 2001; Feuer & Nem., 2003)

max

j,k=1,...,p

  • XT X

n − Ip×p

  • jk
  • ≤ δ1

s

x1 xj xk xp p n

Matrices with i.i.d. sub-Gaussian entries: holds w.h.p. for n = Ω(s2 log p)

2 Restricted isometry, or submatrix incoherence

(Candes & Tao, 2005)

max

|U|≤2s

  • XT X

n − Ip×p

  • UU
  • p

≤ δ2s.

x1 U xp

Matrices with i.i.d. sub-Gaussian entries: holds w.h.p. for n = Ω(s log p

s)

slide-39
SLIDE 39

Violating matrix incoherence (elementwise/RIP)

Important: Incoherence/RIP conditions imply RN, but are far from necessary. Very easy to violate them.....

slide-40
SLIDE 40

Violating matrix incoherence (elementwise/RIP)

Form random design matrix X = x1 x2 . . . xp

  • p columns

=      XT

1

XT

2

. . . XT

n

     n rows ∈ Rn×p, each row Xi ∼ N(0, Σ), i.i.d. Example: For some µ ∈ (0, 1), consider the covariance matrix Σ = (1 − µ)Ip×p + µ11T .

slide-41
SLIDE 41

Violating matrix incoherence (elementwise/RIP)

Form random design matrix X = x1 x2 . . . xp

  • p columns

=      XT

1

XT

2

. . . XT

n

     n rows ∈ Rn×p, each row Xi ∼ N(0, Σ), i.i.d. Example: For some µ ∈ (0, 1), consider the covariance matrix Σ = (1 − µ)Ip×p + µ11T . Elementwise incoherence violated: for any j = k P xj, xk n ≥ µ − ǫ

  • ≥ 1 − c1 exp(−c2nǫ2).
slide-42
SLIDE 42

Violating matrix incoherence (elementwise/RIP)

Form random design matrix X = x1 x2 . . . xp

  • p columns

=      XT

1

XT

2

. . . XT

n

     n rows ∈ Rn×p, each row Xi ∼ N(0, Σ), i.i.d. Example: For some µ ∈ (0, 1), consider the covariance matrix Σ = (1 − µ)Ip×p + µ11T . Elementwise incoherence violated: for any j = k P xj, xk n ≥ µ − ǫ

  • ≥ 1 − c1 exp(−c2nǫ2).

RIP constants tend to infinity as (n, |S|) increases: P

  • XT

S XS

n − Is×s

  • 2 ≥ µ (s − 1) − 1 − ǫ
  • ≥ 1 − c1 exp(−c2nǫ2).
slide-43
SLIDE 43

Noiseless ℓ1 recovery for µ = 0.5

0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rescaled sample size α

  • Prob. of exact recovery
  • Prob. exact recovery vs. sample size (

µ = 0.5) p = 128 p = 256 p = 512

  • Probab. versus rescaled sample size α :=

n s log(p/s).

slide-44
SLIDE 44

Direct result for restricted nullspace/eigenvalues

Theorem (Raskutti, W., & Yu, 2010) Consider a random design X ∈ Rn×p with each row Xi ∼ N(0, Σ) i.i.d., and define κ(Σ) = max

j=1,2,...p Σjj. Then for universal constants c1, c2,

Xθ2 √n ≥ 1 2Σ1/2θ2 − 9κ(Σ)

  • log p

n θ1 for all θ ∈ Rp with probability greater than 1 − c1 exp(−c2n).

slide-45
SLIDE 45

Direct result for restricted nullspace/eigenvalues

Theorem (Raskutti, W., & Yu, 2010) Consider a random design X ∈ Rn×p with each row Xi ∼ N(0, Σ) i.i.d., and define κ(Σ) = max

j=1,2,...p Σjj. Then for universal constants c1, c2,

Xθ2 √n ≥ 1 2Σ1/2θ2 − 9κ(Σ)

  • log p

n θ1 for all θ ∈ Rp with probability greater than 1 − c1 exp(−c2n). much less restrictive than incoherence/RIP conditions many interesting matrix families are covered

◮ Toeplitz dependency ◮ constant µ-correlation (previous example) ◮ covariance matrix Σ can even be degenerate ◮ extensions to sub-Gaussian matrices

(Rudelson & Zhou, 2012)

related results hold for generalized linear models

slide-46
SLIDE 46

Easy verification of restricted nullspace

for any ∆ ∈ C(S), we have ∆1 = ∆S1 + ∆Sc1 ≤ 2∆S ≤ 2√s ∆2 applying previous result: X∆2 √n ≥

  • λmin(

√ Σ) − 18κ(Σ)

  • s log p

n

  • γ(Σ)

∆2.

slide-47
SLIDE 47

Easy verification of restricted nullspace

for any ∆ ∈ C(S), we have ∆1 = ∆S1 + ∆Sc1 ≤ 2∆S ≤ 2√s ∆2 applying previous result: X∆2 √n ≥

  • λmin(

√ Σ) − 18κ(Σ)

  • s log p

n

  • γ(Σ)

∆2. have actually proven much more than restricted nullspace....

slide-48
SLIDE 48

Easy verification of restricted nullspace

for any ∆ ∈ C(S), we have ∆1 = ∆S1 + ∆Sc1 ≤ 2∆S ≤ 2√s ∆2 applying previous result: X∆2 √n ≥

  • λmin(

√ Σ) − 18κ(Σ)

  • s log p

n

  • γ(Σ)

∆2. have actually proven much more than restricted nullspace.... Definition A design matrix X ∈ Rn×p satisfies the restricted eigenvalue (RE) condition

  • ver S (denote RE(S)) with parameters α ≥ 1 and γ > 0 if

X∆2 √n ≥ γ ∆2 for all ∆ ∈ Rp such that ∆Sc1 ≤ α∆S1.

(van de Geer, 2007; Bickel, Ritov & Tsybakov, 2008)

slide-49
SLIDE 49

Lasso and restricted eigenvalues

Turning to noisy observations...

=

+

n S w y X θ∗ Sc n × p Estimator: Lasso program

  • θλn ∈ arg min

θ∈Rp

1 2ny − Xθ2

2 + λnθ1

  • .

Goal: Obtain bounds on θλn − θ∗2 that hold with high probability.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 26 / 1

slide-50
SLIDE 50

Lasso bounds: Four simple steps

Let’s analyze constrained version: min

θ∈Rp

1 2ny − Xθ2

2

such that θ1 ≤ R = θ∗1.

slide-51
SLIDE 51

Lasso bounds: Four simple steps

Let’s analyze constrained version: min

θ∈Rp

1 2ny − Xθ2

2

such that θ1 ≤ R = θ∗1. (1) By optimality of θ and feasibility of θ∗: 1 2ny − X θ2

2 ≤ 1

2ny − Xθ∗2

2.

slide-52
SLIDE 52

Lasso bounds: Four simple steps

Let’s analyze constrained version: min

θ∈Rp

1 2ny − Xθ2

2

such that θ1 ≤ R = θ∗1. (1) By optimality of θ and feasibility of θ∗: 1 2ny − X θ2

2 ≤ 1

2ny − Xθ∗2

2.

(2) Derive a basic inequality: re-arranging in terms of ∆ = θ − θ∗: 1 nX ∆2

2 ≤ 2

n ∆, XT w.

slide-53
SLIDE 53

Lasso bounds: Four simple steps

Let’s analyze constrained version: min

θ∈Rp

1 2ny − Xθ2

2

such that θ1 ≤ R = θ∗1. (1) By optimality of θ and feasibility of θ∗: 1 2ny − X θ2

2 ≤ 1

2ny − Xθ∗2

2.

(2) Derive a basic inequality: re-arranging in terms of ∆ = θ − θ∗: 1 nX ∆2

2 ≤ 2

n ∆, XT w. (3) Restricted eigenvalue for LHS; H¨

  • lder’s inequality for RHS

γ ∆2

2 ≤ 1

nX ∆2

2 ≤ 2

n ∆, XT w ≤ 2 ∆1

  • XT w

n

  • ∞.
slide-54
SLIDE 54

Lasso bounds: Four simple steps

Let’s analyze constrained version: min

θ∈Rp

1 2ny − Xθ2

2

such that θ1 ≤ R = θ∗1. (1) By optimality of θ and feasibility of θ∗: 1 2ny − X θ2

2 ≤ 1

2ny − Xθ∗2

2.

(2) Derive a basic inequality: re-arranging in terms of ∆ = θ − θ∗: 1 nX ∆2

2 ≤ 2

n ∆, XT w. (3) Restricted eigenvalue for LHS; H¨

  • lder’s inequality for RHS

γ ∆2

2 ≤ 1

nX ∆2

2 ≤ 2

n ∆, XT w ≤ 2 ∆1

  • XT w

n

  • ∞.

(4) As before, ∆ ∈ C(S), so that ∆1 ≤ 2√s ∆2, and hence

  • ∆2 ≤ 4

γ √s

  • XT w

n

  • ∞.
slide-55
SLIDE 55

Lasso error bounds for different models

Proposition Suppose that vector θ∗ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ∗1 or regularized Lasso with λn = 2XT w/n∞, any optimal solution θ satisfies the bound

  • θ − θ∗2 ≤ 4√s

γ

  • XT w

n ∞.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 28 / 1

slide-56
SLIDE 56

Lasso error bounds for different models

Proposition Suppose that vector θ∗ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ∗1 or regularized Lasso with λn = 2XT w/n∞, any optimal solution θ satisfies the bound

  • θ − θ∗2 ≤ 4√s

γ

  • XT w

n ∞. this is a deterministic result on the set of optimizers various corollaries for specific statistical models

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 28 / 1

slide-57
SLIDE 57

Lasso error bounds for different models

Proposition Suppose that vector θ∗ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ∗1 or regularized Lasso with λn = 2XT w/n∞, any optimal solution θ satisfies the bound

  • θ − θ∗2 ≤ 4√s

γ

  • XT w

n ∞. this is a deterministic result on the set of optimizers various corollaries for specific statistical models

◮ Compressed sensing: Xij ∼ N(0, 1) and bounded noise w2 ≤ σ√n ◮ Deterministic design: X with bounded columns and wi ∼ N(0, σ2)

XT w n ∞ ≤

  • 3σ2 log p

n w.h.p. = ⇒ θ − θ∗2 ≤ 4σ γ(L)

  • s log p

n .

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 28 / 1

slide-58
SLIDE 58

Look-ahead to Lecture 2: A more general theory

Recap: Thus far..... Derived error bounds for basis pursuit and Lasso (ℓ1-relaxation) Seen importance of restricted nullspace and restricted eigenvalues

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 29 / 1

slide-59
SLIDE 59

Look-ahead to Lecture 2: A more general theory

The big picture: Lots of other estimators with same basic form:

  • θλn
  • Estimate

∈ arg min

θ∈Ω

  • L(θ; Zn

1 )

  • Loss function

+ λn R(θ) Regularizer

  • .

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 29 / 1

slide-60
SLIDE 60

Look-ahead to Lecture 2: A more general theory

The big picture: Lots of other estimators with same basic form:

  • θλn
  • Estimate

∈ arg min

θ∈Ω

  • L(θ; Zn

1 )

  • Loss function

+ λn R(θ) Regularizer

  • .

Past years have witnessed an explosion of results (compressed sensing, covariance estimation, block-sparsity, graphical models, matrix completion...) Question: Is there a common set of underlying principles?

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 29 / 1

slide-61
SLIDE 61

Some papers (www.eecs.berkeley.edu/wainwrig)

1 M. J. Wainwright (2009), Sharp thresholds for high-dimensional and noisy

sparsity recovery using ℓ1-constrained quadratic programming (Lasso)”, IEEE Trans. Information Theory, May 2009.

2 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A

unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science. December 2012.

3 G. Raskutti, M. J. Wainwright and B. Yu (2011) Minimax rates for linear

regression over ℓq-balls. IEEE Trans. Information Theory, October 2011.

4 G. Raskutti, M. J. Wainwright and B. Yu (2010). Restricted nullspace

and eigenvalue properties for correlated Gaussian designs. Journal of Machine Learning Research. August 2010.