Estimating Sparse Principal Components and Subspaces Jing Lei - - PowerPoint PPT Presentation

estimating sparse principal components and subspaces jing
SMART_READER_LITE
LIVE PREVIEW

Estimating Sparse Principal Components and Subspaces Jing Lei - - PowerPoint PPT Presentation

Estimating Sparse Principal Components and Subspaces Jing Lei Department of Statistics, CMU Joint work with V. Q. Vu (OSU), J. Cho, and K. Rohe (U. of Wisc.) July 1, 2013 Outline PCA in high dimensions. Sparsity of principal


slide-1
SLIDE 1

Estimating Sparse Principal Components and Subspaces Jing Lei

Department of Statistics, CMU Joint work with V. Q. Vu (OSU), J. Cho, and K. Rohe (U. of Wisc.)

July 1, 2013

slide-2
SLIDE 2

Outline

  • PCA in high dimensions.
  • Sparsity of principal components.
  • Consistent estimation and minimax theory.
  • Feasible algorithms using convex relaxation.
slide-3
SLIDE 3

Principal Components Analysis

  • I have iid data points X1,...,Xn on p variables.
  • p may be large, so I want to use principal components analysis

(PCA) for dimension reduction.

slide-4
SLIDE 4

Principal Components Analysis

  • 4
  • 2

2 4

  • 4
  • 2

2 4 y x y

slide-5
SLIDE 5

Principal Components Analysis

  • 4
  • 2

2 4

  • 4
  • 2

2 4 y x y

slide-6
SLIDE 6

Principal Components Analysis

  • 4
  • 2

2 4

  • 4
  • 2

2 4 pc1 x y

slide-7
SLIDE 7

Principal Components Analysis

  • Σ = E(XXT) is the population covariance matrix (say EX = 0).
  • Eigen-decomposition

Σ = VDVT = λ1v1vT

1 +λ2v2vT 2 +...+λpvpvT p

D = diag(λ1,λ2,...,λp), λ1 ≥ λ2 ≥ ... ≥ λp ≥ 0 (eigenvalues) VVT = Ip, V = (v1,v2,...,vp) (eigenvectors)

  • “Optimal” d-dimensional projection: X → ΠdX

Πd = VdVT

d (d-dimensional projection matrix),

Vd = (v1,...,vd).

slide-8
SLIDE 8

Classical Estimator

  • Sample covariance matrix: ˆ

Σ = n−1(X1XT

1 +...+XnXT n ).

  • Estimate (ˆ

λj, ˆ vj) by eigen-decomposition of ˆ Σ. ˆ Vd = (ˆ v1,..., ˆ vd), ˆ Πd = ˆ Vd ˆ VT

d .

  • Standard theory for p fixed and n → ∞:

ˆ Πd → Πd a.s. if λj −λj+1 > 0.

slide-9
SLIDE 9

High-Dimensional PCA: Challenges

  • Estimation accuracy. Classical theory fails when p/n → c > 0:

ˆ λ1 → c′ > 1, and ˆ vT

1v1 ≈ 0 under a simple model (Johnstone &

Lu 2009).

  • Interpretability. ˆ

ΠdX may be hard to interpret when it involves linear combination of many variables.

  • Sparsity is a possible solution.
slide-10
SLIDE 10

Sparsity for Principal Subspaces [Vu & L 2012b]

  • Identifiability. If λ1 = λ2 = ... = λd, then one cannot distinguish

Vd and VdQ from observed data for any orthogonal Q.

  • Intuition: a good notion of sparsity must be rotation invariant.
  • Matrix (2,0) norm: for any matrix V ∈ Rp×d,

V2,0 = # of non-zero rows in V

  • Row sparsity: Vd2,0 ≤ R0 ≪ p. Vd = (v1,v2,...,vd).
  • Loss function: ˆ

Πd −Πd2

F (·F: the Frobenius norm).

Recall: ˆ Πd = VdVT

d , ˆ

Πd = ˆ Vd ˆ VT

d .

slide-11
SLIDE 11

Two Sparse PCA Models

  • 1. Spiked model:

Σ = (λ1 −λd+1)v1vT

1 +...+(λd −λd+1)vdvT d +λd+1Ip .

  • 2. General model:

Σ = λ1v1vT

1 +...+λdvdvT d +λd+1Σ′

where Σ′ 0, Σ′ = 1, Σ′vj = 0, ∀1 ≤ j ≤ d.

slide-12
SLIDE 12

Spiked Model is a Special Case of General Model

Black cell: |Σ(i,j)| ≤ 0.01, White cell: |Σ(i,j)| > 0.01 In spiked model, all black cells outside the upper 20×20 are 0.

20 40 60 80 100 100 80 60 40 20 Covariance Pattern of Spiked Model 20 40 60 80 100 100 80 60 40 20 Covariance Pattern of General Model

slide-13
SLIDE 13

How Does Sparsity Help?

  • Question: how does sparsity help with the estimation?
  • 1. How well can we do if sparsity is assumed?
  • 2. How to estimate under sparsity assumption?
  • Intuition: Estimation is easy if
  • 1. n is large.
  • 2. p is small.
  • 3. λd+1 is close to 0.
  • 4. λd −λd+1 is away from 0.
  • 5. R0 is small.
  • Under the spiked model, [Johnstone & Lu 2009] gives a

consistent estimator of v1 when p/n → c > 0, and others fixed.

slide-14
SLIDE 14

A Minimax Framework

Find f(n,p,R0,λ1,λ2) such that sup

Σ

E ˆ Πd −Πd2

F f(n,p,R0,

λ), ∀ estimator ˆ Πd , and a particular estimator ˆ Πd such that E ˆ Πd −Πd2

F f(n,p,R0,

λ), ∀ Σ. Σ is taken over all matrices in the sparse PCA model.

slide-15
SLIDE 15

Answer to the Minimax Question

Theorem: Minimax Error Rate of Estimating Vd (Vu and Lei 2012b) Under the general model, the minimax rate of estimating VdVT

d is

fd(n,p,R0, λ) ≍ R0 λ1λd+1 (λd −λd+1)2 d +logp n , and can be achieved by ˆ Vd = arg max

VT

d Vd=Id,Vd2,0≤R0

Tr(VT

d ˆ

ΣVd).

slide-16
SLIDE 16

About This Result

  • Good news
  • Exact minimax error rate in (n,p,d,R0,

λ) for general models.

  • First consistency result for ℓ1 constrained/penalized PCA (Jolliffe

et al 2003, Zou et al 2006).

  • Price to pay
  • Finding the global maximizer is computationally demanding.
  • Extensions
  • Soft sparsity: ℓq-ball with q ∈ [0,1] [Vu & L 2012a,b].
  • Feasible algorithms [Vu, Cho, L, Rohe 2013].
slide-17
SLIDE 17

Related Work

  • When d = 1, [Birnbaum et al 2012, and Ma 2013] established

the minimax rate under the spiked model, where the estimator is

  • btained by power method and thresholding.
  • For subspace estimation, the minimax rate is independently
  • btained by [Cai et al 2012] under a Gaussian spiked model.
slide-18
SLIDE 18

Feasible Algorithm Via Convex Relaxation

  • For d = 1, the optimal estimator (consider Z = v1vT

1) is

ˆ Z = argmax

Z

Tr(ˆ ΣZ)−λZ0, s.t. rank(Z) = 1, Z 0, Tr(Z) = 1.

  • [d’Aspremont et al 2004] proposed an SDP relaxation

ˆ Z = argmax

Z Tr(ˆ

ΣZ)−λZ1, s.t. Z 0, Tr(Z) = 1,

  • ˆ

Z gives consistent variable selection with optimal rate under a stringent spiked model, provided that ˆ Z is rank 1 [Amini & Wainwright 2009].

slide-19
SLIDE 19

Preliminary Results for SDP Relaxation

Theorem: Error Bound for SDP Relaxation [VCLR 2013] When d = 1 under the general model, assume v10 ≤ R0 and choose λ ≍

λ1 λ1−λ2

  • logp/n in the SDP relaxation. Then w.h.p the global
  • ptimizer ˆ

Z satisfies ˆ Z −v1vT

12 2 R2

λ 2

1

(λ1 −λ2)2 logp n .

slide-20
SLIDE 20

SDP Reslaxation is *Near* Optimal

  • Recall the SDP rate and minimax rate (d = 1, q = 0)

R2 λ 2

1

(λ1 −λ2)2 logp n

  • vs. R0

λ1λ2 (λ1 −λ2)2 logp n

  • These are off by a factor of

R0 λ1 λ2 .

  • The R0 factor is unavoidable for polynomial time algorithms in a

hypothesis testing context [Berthet & Rigollet 2013].

  • λ1/λ2 factor may be removable using finer analysis.
slide-21
SLIDE 21

Summary

  • Sparsity helps improve both estimation accuracy and

interpretability of PCA in high dimensions.

  • Sparsity can be defined for principal subspaces.
  • Minimax error rates are established for general covariance

models.

  • Convex relaxation using SDP is near-optimal.
slide-22
SLIDE 22

Ongoing Work

  • Statistical properties for SDP relaxation under soft sparsity.
  • SDP relaxation for subspaces (d > 1).
  • Other penalties than ℓ1, such as the group lasso penalty.
slide-23
SLIDE 23

Main References

  • 1. V. Vu and J. Lei (2012) “Minimax rates of estimation for sparse

PCA in high dimensions”, AISTATS’12

  • 2. Vincent Vu and Jing Lei (2013) “Minimax Sparse Principal

Subspace Estimation in High Dimensions”, revision submitted.

  • 3. Vincent Q. Vu, Juhee Cho, Jing Lei, and Karl Rohe (2013),
  • ngoing work.