High-dimensional statistics: Some progress and challenges ahead - - PowerPoint PPT Presentation

high dimensional statistics some progress and challenges
SMART_READER_LITE
LIVE PREVIEW

High-dimensional statistics: Some progress and challenges ahead - - PowerPoint PPT Presentation

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 2 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh,


slide-1
SLIDE 1

High-dimensional statistics: Some progress and challenges ahead

Martin Wainwright

UC Berkeley Departments of Statistics, and EECS

University College, London Master Class: Lecture 2 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

slide-2
SLIDE 2

High-level overview

Last lecture: least-squares loss and ℓ1-regularization. The big picture: Lots of other estimators with same basic form:

  • θλn
  • Estimate

∈ arg min

θ∈Ω

  • L(θ; Zn

1 )

  • Loss function

+ λn R(θ) Regularizer

  • .

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29

slide-3
SLIDE 3

High-level overview

Last lecture: least-squares loss and ℓ1-regularization. The big picture: Lots of other estimators with same basic form:

  • θλn
  • Estimate

∈ arg min

θ∈Ω

  • L(θ; Zn

1 )

  • Loss function

+ λn R(θ) Regularizer

  • .

Past years have witnessed an explosion of results (compressed sensing, covariance estimation, block-sparsity, graphical models, matrix completion...) Question: Is there a common set of underlying principles?

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29

slide-4
SLIDE 4

Last lecture: Sparse linear regression

=

+

n S w y X θ∗ Sc n × p Set-up: noisy observations y = Xθ∗ + w with sparse θ∗ Estimator: Lasso program

  • θ ∈ arg min

θ

1 n

n

  • i=1

(yi − xT

i θ)2 + λn p

  • j=1

|θj|

slide-5
SLIDE 5

Block-structured extension

=

+

r r r n n p S W Y X Θ∗ Sc n × p Signal Θ∗ is a p × r matrix: partitioned into non-zero rows S and zero rows Sc Various applications: multiple-view imaging, gene array prediction, graphical model fitting.

slide-6
SLIDE 6

Block-structured extension

=

+

r r r n n p S W Y X Θ∗ Sc n × p Row-wise ℓ1/ℓ2-norm | | |Θ| | |1,2 =

p

  • j=1

Θj2

slide-7
SLIDE 7

Block-structured extension

=

+

r r r n n p S W Y X Θ∗ Sc n × p Row-wise ℓ1/ℓ2-norm | | |Θ| | |1,2 =

p

  • j=1

Θj2 More complicated group structure:

(Obozinski et al., 2009)

| | |Θ∗| | |G =

  • g∈G

Θg2

slide-8
SLIDE 8

Example: Low-rank matrix approximation

Θ∗ U D V T r × r p1 × p2 p1 × r r × p2 Set-up: Matrix Θ∗ ∈ Rp1×p2 with rank r ≪ min{p1, p2}. Estimator:

  • Θ ∈ arg min

Θ

1 n

n

  • i=1

(yi − Xi, Θ )2 + λn

min{p1,p2}

  • j=1

σj(Θ)

  • Some past work: Fazel, 2001; Srebro et al., 2004; Recht, Fazel & Parillo, 2007; Bach, 2008;

Candes & Recht, 2008; Keshavan et al., 2009; Rohde & Tsybakov, 2010; Recht, 2009; Negahban & W., 2010 ...

slide-9
SLIDE 9

Application: Collaborative filtering

                            . . . . . . 4 ∗ 3 . . . . . . ∗ 3 5 ∗ . . . . . . 2 5 4 3 . . . . . . 3 2 ∗ ∗ . . . . . . 1                             Universe of p1 individuals and p2 films Observe n ≪ p2p2 ratings (e.g., Srebro, Alon & Jaakkola, 2004)

slide-10
SLIDE 10

Security and robustness issues

Spiritual guide Break-down of Amazon recommendation system, 2002.

slide-11
SLIDE 11

Security and robustness issues

Spiritual guide Sex manual Break-down of Amazon recommendation system, 2002.

slide-12
SLIDE 12

Matrix decomposition: Low-rank plus sparse

Matrix Y can be (approximately) decomposed into sum: Y U V T D p1 × p2

p1 × r r × r r × p2

≈ + Y = Θ∗

  • Low-rank component

+ Γ∗

  • Sparse component
slide-13
SLIDE 13

Matrix decomposition: Low-rank plus sparse

Matrix Y can be (approximately) decomposed into sum: Y U V T D p1 × p2

p1 × r r × r r × p2

≈ + Y = Θ∗

  • Low-rank component

+ Γ∗

  • Sparse component

exact decomposition: initially studied by Chandrasekaran, Sanghavi, Parillo & Willsky, 2009 subsequent work: Candes et al., 2010; Xu et al., 2010 Hsu et al., 2010;

Agarwal et al., 2011

Various applications:

◮ robust collaborative filtering ◮ robust PCA ◮ graphical model selection with hidden variables

slide-14
SLIDE 14

Gauss-Markov models with hidden variables

X1 X2 X3 X4 Z Problems with hidden variables: conditioned on hidden Z, vector X = (X1, X2, X3, X4) is Gauss-Markov.

slide-15
SLIDE 15

Gauss-Markov models with hidden variables

X1 X2 X3 X4 Z Problems with hidden variables: conditioned on hidden Z, vector X = (X1, X2, X3, X4) is Gauss-Markov. Inverse covariance of X satisfies {sparse, low-rank} decomposition:     1 − µ µ µ µ µ 1 − µ µ µ µ µ 1 − µ µ µ µ µ 1 − µ     = I4×4 − µ11T . (Chandrasekaran, Parrilo & Willsky, 2010)

slide-16
SLIDE 16

Example: Sparse principal components analysis

  • +

=

Σ ZZT D Set-up: Covariance matrix Σ = ZZT + D, where leading eigenspace Z has sparse columns. Estimator:

  • Θ ∈ arg min

Θ

Θ, Σ + λn

  • (j,k)

|Θjk|

  • Some past work: Johnstone, 2001; Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al.,

2004; d’Aspr´ emont et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008

slide-17
SLIDE 17

Motivation and roadmap

many results on different high-dimensional models all based on estimators of the type:

  • θλn
  • Estimate

∈ arg min

θ∈Ω

  • L(θ; Zn

1 )

  • Loss function

+ λn R(θ) Regularizer

  • .

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29

slide-18
SLIDE 18

Motivation and roadmap

many results on different high-dimensional models all based on estimators of the type:

  • θλn
  • Estimate

∈ arg min

θ∈Ω

  • L(θ; Zn

1 )

  • Loss function

+ λn R(θ) Regularizer

  • .

Question: Is there a common set of underlying principles?

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29

slide-19
SLIDE 19

Motivation and roadmap

many results on different high-dimensional models all based on estimators of the type:

  • θλn
  • Estimate

∈ arg min

θ∈Ω

  • L(θ; Zn

1 )

  • Loss function

+ λn R(θ) Regularizer

  • .

Question: Is there a common set of underlying principles? Answer: Yes, two essential ingredients. (I) Restricted strong convexity of loss function (II) Decomposability of the regularizer

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29

slide-20
SLIDE 20

(I) Role of curvature

1 Curvature controls difficulty of estimation:

δL

  • θ

θ ∆ δL

  • θ

θ ∆ High curvature: easy to estimate (b) Low curvature: harder

slide-21
SLIDE 21

(I) Role of curvature

1 Curvature controls difficulty of estimation:

δL

  • θ

θ ∆ δL

  • θ

θ ∆ High curvature: easy to estimate (b) Low curvature: harder

2 captured by lower bound on Taylor series error TL(∆; θ∗)

L(θ∗ + ∆) − L(θ∗) − ∇L(θ∗), ∆

  • TL(∆,θ∗)

≥ γ2∆2 for all ∆ around θ∗.

slide-22
SLIDE 22

High dimensions: no strong convexity!

−1 −0.5 0.5 1 −1 −0.5 0.5 1 0.2 0.4 0.6 0.8 1

When p > n, the Hessian ∇2L(θ; Zn

1 ) has nullspace of dimension p − n.

slide-23
SLIDE 23

Restricted strong convexity

Definition Loss function Ln satisfies restricted strong convexity (RSC) with respect to regularizer R if Ln(θ∗ + ∆) −

  • Ln(θ∗) + ∇Ln(θ∗), ∆
  • Taylor error TL(∆; θ∗)

≥ γ2

ℓ ∆2 e

Lower curvature − τ 2

ℓ R2(∆)

  • Tolerance

for all ∆ in a suitable neighborhood of θ∗.

slide-24
SLIDE 24

Restricted strong convexity

Definition Loss function Ln satisfies restricted strong convexity (RSC) with respect to regularizer R if Ln(θ∗ + ∆) −

  • Ln(θ∗) + ∇Ln(θ∗), ∆
  • Taylor error TL(∆; θ∗)

≥ γ2

ℓ ∆2 e

Lower curvature − τ 2

ℓ R2(∆)

  • Tolerance

for all ∆ in a suitable neighborhood of θ∗.

  • rdinary strong convexity:

◮ special case with tolerance τℓ = 0 ◮ does not hold for most loss functions when p > n

slide-25
SLIDE 25

Restricted strong convexity

Definition Loss function Ln satisfies restricted strong convexity (RSC) with respect to regularizer R if Ln(θ∗ + ∆) −

  • Ln(θ∗) + ∇Ln(θ∗), ∆
  • Taylor error TL(∆; θ∗)

≥ γ2

ℓ ∆2 e

Lower curvature − τ 2

ℓ R2(∆)

  • Tolerance

for all ∆ in a suitable neighborhood of θ∗.

  • rdinary strong convexity:

◮ special case with tolerance τℓ = 0 ◮ does not hold for most loss functions when p > n

RSC enforces a lower bound on curvature, but only when R2(∆) ≪ ∆2

e

slide-26
SLIDE 26

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) =

1 2ny − Xθ2 2:

TL(∆; θ∗) = Ln(θ∗ + ∆) −

  • Ln(θ∗) − ∇Ln(θ∗), ∆
  • =

1 2nX∆2

2.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29

slide-27
SLIDE 27

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) =

1 2ny − Xθ2 2:

TL(∆; θ∗) = Ln(θ∗ + ∆) −

  • Ln(θ∗) − ∇Ln(θ∗), ∆
  • =

1 2nX∆2

2.

Restricted eigenvalue (RE) condition

(van de Geer, 2007; Bickel et al., 2009):

X∆2

2

2n ≥ γ2 ∆2

2

for all ∆Sc1 ≤ ∆S1.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29

slide-28
SLIDE 28

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) =

1 2ny − Xθ2 2:

TL(∆; θ∗) = Ln(θ∗ + ∆) −

  • Ln(θ∗) − ∇Ln(θ∗), ∆
  • =

1 2nX∆2

2.

Restricted eigenvalue (RE) condition

(van de Geer, 2007; Bickel et al., 2009):

X∆2

2

2n ≥ γ2 ∆2

2

for all ∆ ∈ Rp with ∆1 ≤ 2√s∆2.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29

slide-29
SLIDE 29

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) =

1 2ny − Xθ2 2:

TL(∆; θ∗) = Ln(θ∗ + ∆) −

  • Ln(θ∗) − ∇Ln(θ∗), ∆
  • =

1 2nX∆2

2.

Restricted eigenvalue (RE) condition

(van de Geer, 2007; Bickel et al., 2009):

X∆2

2

2n ≥ γ2 ∆2

2

for all ∆ ∈ Rp with ∆1 ≤ 2√s∆2. holds with high probability for various sub-Gaussian designs when n s log p/s. fairly strong dependency between covariates is possible

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29

slide-30
SLIDE 30

Restricted strong convexity for GLMS

generalized linear model linking covariates x ∈ Rp to output y ∈ R: Pθ(y | x, θ∗) ∝ exp

  • y x, θ∗ − Φ(x, θ∗)
  • .
slide-31
SLIDE 31

Restricted strong convexity for GLMS

generalized linear model linking covariates x ∈ Rp to output y ∈ R: Pθ(y | x, θ∗) ∝ exp

  • y x, θ∗ − Φ(x, θ∗)
  • .

Taylor series expansion involves random Hessian H(θ) = 1 n

n

  • i=1

Φ′′ θ, Xi

  • XiXT

i

∈ Rp.

slide-32
SLIDE 32

Restricted strong convexity for GLMS

generalized linear model linking covariates x ∈ Rp to output y ∈ R: Pθ(y | x, θ∗) ∝ exp

  • y x, θ∗ − Φ(x, θ∗)
  • .

Taylor series expansion involves random Hessian H(θ) = 1 n

n

  • i=1

Φ′′ θ, Xi

  • XiXT

i

∈ Rp. Proposition (Negahban, W., Ravikumar & Yu, 2010) For zero-mean sub-Gaussian covariates Xi with covariance Σ and any GLM, TL(∆; θ∗) ≥ c3Σ

1 2 ∆2

  • Σ

1 2 ∆2 − c4κ(Σ)

log p n 1/2∆1

  • for all ∆2 ≤ 1

with probability at least 1 − c1 exp(−c2n). Here κ(Σ) = maxj Σjj.

slide-33
SLIDE 33

(II) Decomposable regularizers

M M⊥ Subspace M: Approximation to model parameters Complementary subspace M⊥: Undesirable deviations.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29

slide-34
SLIDE 34

(II) Decomposable regularizers

M M⊥ Subspace M: Approximation to model parameters Complementary subspace M⊥: Undesirable deviations. Regularizer R decomposes across (M, M⊥) if R(α + β) = R(α) + R(β) for all α ∈ M, and β ∈ M⊥.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29

slide-35
SLIDE 35

(II) Decomposable regularizers

M M⊥ Regularizer R decomposes across (M, M⊥) if R(α + β) = R(α) + R(β) for all α ∈ M, and β ∈ M⊥. Includes:

  • (weighted) ℓ1-norms
  • nuclear norm
  • group-sparse norms
  • sums of decomposable norms

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29

slide-36
SLIDE 36

Main theorem

Estimator

  • θλn

∈ arg min

θ∈Rp

  • Ln(θ; Zn

1 ) + λnR(θ)

  • ,

where L satisfies RSC(γ, τ) w.r.t regularizer R.

slide-37
SLIDE 37

Main theorem

Estimator

  • θλn

∈ arg min

θ∈Rp

  • Ln(θ; Zn

1 ) + λnR(θ)

  • ,

where L satisfies RSC(γ, τ) w.r.t regularizer R. Theorem (Negahban, Ravikumar, W., & Yu, 2012) Suppose that θ∗ ∈ M. For any regularization parameter λn ≥ 2R∗ ∇L(θ∗; Zn

1 )

  • , any solution

θλn satisfies

  • θλn − θ∗2

1 γ2(L)

  • λ2

n + τ 2(L)

  • Ψ2(M).

Quantities that control rates:

curvature in RSC: γℓ tolerance in RSC: τ dual norm of regularizer: R∗(v) := sup

R(u)≤1

v, u.

  • ptimal subspace const.: Ψ(M) =

sup

θ∈M\{0}

R(θ)/θ

slide-38
SLIDE 38

Main theorem

Estimator

  • θλn

∈ arg min

θ∈Rp

  • Ln(θ; Zn

1 ) + λnR(θ)

  • ,

Theorem (Oracle version) For any regularization parameter λn ≥ 2R∗ ∇L(θ∗; Zn

1 )

  • , any solution

θ satisfies

  • θλn − θ∗2

(λ′

n)2

γ2(L)Ψ2(M)

  • Estimation error

+ λ′

n

γ(L) R(ΠM⊥(θ∗))

  • Approximation error

where ρ′ = max{ρ, τ Quantities that control rates:

curvature in RSC: γℓ tolerance in RSC: τ dual norm of regularizer: R∗(v) := sup

R(u)≤1

v, u.

  • ptimal subspace const.: Ψ(M) =

sup

θ∈M\{0}

R(θ)/θ

slide-39
SLIDE 39

Example: Linear regression (exact sparsity)

Lasso program: min

θ∈Rp

  • y − Xθ2

2 + λnθ1}

RSC corresponds to lower bound on restricted eigenvalues of XT X ∈ Rp×p for a s-sparse vector, we have θ1 ≤ √s θ2.

Corollary Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with λn ≥ 2 XT w

n

∞, then any Lasso solution satisfies θ − θ∗2

2 ≤ 4 γ2 s λ2 n.

slide-40
SLIDE 40

Example: Linear regression (exact sparsity)

Lasso program: min

θ∈Rp

  • y − Xθ2

2 + λnθ1}

RSC corresponds to lower bound on restricted eigenvalues of XT X ∈ Rp×p for a s-sparse vector, we have θ1 ≤ √s θ2.

Corollary Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with λn ≥ 2 XT w

n

∞, then any Lasso solution satisfies θ − θ∗2

2 ≤ 4 γ2 s λ2 n.

Some stochastic instances: recover known results

Compressed sensing: Xij ∼ N(0, 1) and bounded noise w2 ≤ σ√n Deterministic design: X with bounded columns and wi ∼ N(0, σ2) XT w n ∞ ≤

  • 3σ2 log p

n w.h.p. = ⇒ θ − θ∗2

2 ≤ 12σ2

γ2 s log p n .

(e.g., Candes & Tao, 2007; Huang & Zhang, 2008; Meinshausen & Yu, 2008; Bickel et al., 2008)

slide-41
SLIDE 41

Example: Linear regression (weak sparsity)

for some q ∈ [0, 1], say θ∗ belongs to ℓq-“ball” Bq(Rq) :=

  • θ ∈ Rp |

p

  • j=1

|θj|q ≤ Rq

  • .

Corollary For θ∗ ∈ Bq(Rq), any Lasso solution satisfies (w.h.p.)

  • θ − θ∗2

2

  • σ2Rq

log p n 1−q/2 .

rate known to be minimax optimal (Raskutti, W. & Yu, 2011)

slide-42
SLIDE 42

Example: Group-structured regularizers

Many applications exhibit sparsity with more structure..... G1 G2 G3 divide index set {1, 2, . . . , p} into groups G = {G1, G2, . . . , GT } for parameters νi ∈ [1, ∞], define block-norm θν,G :=

T

  • t=1

θGtνt group/block Lasso program

  • θλn ∈ arg min

θ∈Rp

1 2ny − Xθ2

2 + λnθν,G

  • .

different versions studied by various authors

(Wright et al., 2005; Tropp et al., 2006; Yuan & Li, 2006; Baraniuk, 2008; Obozinski et al., 2008; Zhao et al., 2008; Bach et al., 2009; Lounici et al., 2009)

slide-43
SLIDE 43

Convergence rates for general group Lasso

Corollary Say Θ∗ is supported on sG groups, and X satisfies RSC. Then for regularization parameter λn ≥ 2 max

t=1,2,...,T

  • XT w

n

  • ν∗

t ,

where

1 ν∗

t = 1 − 1

νt ,

any solution θλn satisfies

  • θλn − θ∗2 ≤ 2

γℓ Ψν(SG) λn, where Ψν(SG) = sup

θ∈M(SG)\{0} θν,G θ2 .

slide-44
SLIDE 44

Convergence rates for general group Lasso

Corollary Say Θ∗ is supported on sG groups, and X satisfies RSC. Then for regularization parameter λn ≥ 2 max

t=1,2,...,T

  • XT w

n

  • ν∗

t ,

where

1 ν∗

t = 1 − 1

νt ,

any solution θλn satisfies

  • θλn − θ∗2 ≤ 2

γℓ Ψν(SG) λn, where Ψν(SG) = sup

θ∈M(SG)\{0} θν,G θ2 .

Some special cases with m ≡ max. group size

1 ℓ1/ℓ2 regularization: Group norm with ν = 2

  • θλn − θ∗2

2 = O

sGm n + sG log T n

  • .
slide-45
SLIDE 45

Convergence rates for general group Lasso

Corollary Say Θ∗ is supported on sG groups, and X satisfies RSC. Then for regularization parameter λn ≥ 2 max

t=1,2,...,T

  • XT w

n

  • ν∗

t ,

where

1 ν∗

t = 1 − 1

νt ,

any solution θλn satisfies

  • θλn − θ∗2 ≤ 2

γℓ Ψν(SG) λn, where Ψν(SG) = sup

θ∈M(SG)\{0} θν,G θ2 .

Some special cases with m ≡ max. group size

1 ℓ1/ℓ∞ regularization: Group norm with ν = ∞

  • θλn − θ∗2

2 = O

sGm2 n + sG log T n

  • .
slide-46
SLIDE 46

Example: Low-rank matrices and nuclear norm

low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank noisy/partial observations of the form yi =

  • Xi, Θ∗

+ wi, i = 1, . . . , n, wi i.i.d. noise estimate by solving semi-definite program (SDP):

  • Θ ∈ arg min

Θ

1 n

n

  • i=1

(yi − Xi, Θ )2 + λn

min{p1,p2}

  • j=1

σj(Θ)

  • |

| |Θ| | |1

slide-47
SLIDE 47

Example: Low-rank matrices and nuclear norm

low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank noisy/partial observations of the form yi =

  • Xi, Θ∗

+ wi, i = 1, . . . , n, wi i.i.d. noise estimate by solving semi-definite program (SDP):

  • Θ ∈ arg min

Θ

1 n

n

  • i=1

(yi − Xi, Θ )2 + λn

min{p1,p2}

  • j=1

σj(Θ)

  • |

| |Θ| | |1

  • various applications:

◮ matrix completion ◮ rank-reduced multivariate regression ◮ time-series modeling (vector autoregressions) ◮ ...

slide-48
SLIDE 48

Example: Low-rank matrices and nuclear norm

low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank noisy/partial observations of the form yi =

  • Xi, Θ∗

+ wi, i = 1, . . . , n, wi i.i.d. noise estimate by solving semi-definite program (SDP):

  • Θ ∈ arg min

Θ

1 n

n

  • i=1

(yi − Xi, Θ )2 + λn

min{p1,p2}

  • j=1

σj(Θ)

  • |

| |Θ| | |1

  • various applications:

◮ matrix completion ◮ rank-reduced multivariate regression ◮ time-series modeling (vector autoregressions) ◮ ...

Some past work: Fazel, 2011; Srebro et al., 2004; Recht et al., 2007, Candes & Recht, 2008; Recht, 2009; Negahbah & W., 2010; Rohde & Tsybakov, 2010.

slide-49
SLIDE 49

Rates for (near) low-rank estimation

For parameter q ∈ [0, 1], set of near low-rank matrices: Bq(Rq) =

  • Θ∗ ∈ Rp1×p2 |

min{p1,p2}

  • j=1

|σj(Θ∗)|q ≤ Rq

  • .

Corollary (Negahban & W., 2011) Under RSC condition, with regularization parameter λn ≥ 16σ p1

n +

p2

n

  • ,

we have w.h.p. | | | Θ − Θ∗| | |2

F

≤ c0 Rq γ2

σ2 (p1 + p2) n 1− q

2

slide-50
SLIDE 50

Rates for (near) low-rank estimation

For parameter q ∈ [0, 1], set of near low-rank matrices: Bq(Rq) =

  • Θ∗ ∈ Rp1×p2 |

min{p1,p2}

  • j=1

|σj(Θ∗)|q ≤ Rq

  • .

Corollary (Negahban & W., 2011) Under RSC condition, with regularization parameter λn ≥ 16σ p1

n +

p2

n

  • ,

we have w.h.p. | | | Θ − Θ∗| | |2

F

≤ c0 Rq γ2

σ2 (p1 + p2) n 1− q

2

for a rank r matrix M | | |M| | |1 =

r

  • j=1

σj(M) ≤ √r

  • r
  • j=1

σ2

j (M) = √r |

| |M| | |F solve nuclear norm regularized program with λn ≥ 2

n|

| | n

i=1 wiXi|

| |2

slide-51
SLIDE 51

Noisy matrix completion (unrescaled)

1000 2000 3000 4000 5000 6000 0.1 0.2 0.3 0.4 0.5 0.6 Sample size

  • Frob. norm MSE

MSE versus raw sample size (q = 0) d2 = 402 d2 = 602 d2 = 802 d2 = 1002

slide-52
SLIDE 52

Noisy matrix completion (rescaled)

0.5 1 1.5 2 2.5 3 3.5 4 0.1 0.2 0.3 0.4 0.5 0.6 Rescaled sample size

  • Frob. norm MSE

MSE versus rescaled sample size (q = 0) d2 = 402 d2 = 602 d2 = 802 d2 = 1002

slide-53
SLIDE 53

Summary

unified framework for high-dimensional M-estimators

◮ decomposability of regularizer R ◮ restricted strong convexity of loss functions

actual rates determined by:

◮ noise measured in dual function R∗ ◮ subspace constant Ψ in moving from R to error norm · ◮ restricted strong convexity constant Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 27 / 29

slide-54
SLIDE 54

Summary

unified framework for high-dimensional M-estimators

◮ decomposability of regularizer R ◮ restricted strong convexity of loss functions

actual rates determined by:

◮ noise measured in dual function R∗ ◮ subspace constant Ψ in moving from R to error norm · ◮ restricted strong convexity constant

Looking ahead to tomorrow: From parametric to non-parametric problems: using kernel-based methods in high-dimensions.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 27 / 29

slide-55
SLIDE 55

Some papers (www.eecs.berkeley.edu/wainwrig)

1 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A

unified framework for high-dimensional analysis of M-estimators with decomposable regularizers, Statistical Science, December 2012.

2 S. Negahban and M. J. Wainwright (2011). Estimation rates of (near)

low-rank matrices with noise and high-dimensional scaling. Annals of Statistics, 39(1):1069–1097.

3 S. Negahban and M. J. Wainwright (2012). Restricted strong convexity

and (weighted) matrix completion: Optimal bounds with noise. Journal

  • f Machine Learning Research, May 2012.

4 G. Raskutti, M. J. Wainwright and B. Yu (2011) Minimax rates for linear

regression over ℓq-balls. IEEE Transactions on Information Theory, 57(10): 6976–6994.

slide-56
SLIDE 56

Significance of decomposability

R(ΠM⊥(∆)) R(ΠM(∆)) R(ΠM(∆)) R(ΠM⊥(∆)) (a) C for exact model (cone) (b) C for approximate model (star-shaped) Lemma Suppose that L is convex, and R is decomposable w.r.t. M. Then as long as λn ≥ 2R∗ ∇L(θ∗; Zn

1 )

  • , the error vector

∆ = θλn − θ∗ belongs to C(M, M; θ∗) :=

  • ∆ ∈ Ω | R(ΠM⊥(∆)) ≤ 3R(Π

M(∆)) + 4R(ΠM⊥(θ∗))

  • .