Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and - - PowerPoint PPT Presentation

sparse canonical correlation analysis minimaxity
SMART_READER_LITE
LIVE PREVIEW

Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and - - PowerPoint PPT Presentation

Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and Computational Barrier Harrison H. Zhou Department of Statistics Yale University 1 Chao Gao Mengjie Chen Zongming Ma Zhao Ren 2 Outline Introduction to Sparse CCA An


slide-1
SLIDE 1

Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and Computational Barrier

Harrison H. Zhou Department of Statistics Yale University

1

slide-2
SLIDE 2

Mengjie Chen Chao Gao Zongming Ma Zhao Ren

2

slide-3
SLIDE 3

Outline

  • Introduction to Sparse CCA
  • An Elementary Reparametrization of CCA
  • A Naive Methodology and Its Theoretical Justification
  • Minimaxity, Algorithm, and Computational Barrier

3

slide-4
SLIDE 4

Introduction to Sparse CCA

4

slide-5
SLIDE 5

What Is CCA?

Find θ and η: max Cov(θTX, ηTY ), s.t. Var(θTX) = Var(ηTY ) = 1, where Cov  X Y   =   Σx Σxy Σyx Σy   . (Hotelling 1936)

5

slide-6
SLIDE 6

Oracle Solution

Find θ and η: max θTΣxyη, s.t. θTΣxθ = ηTΣyη = 1. Solution: Σ1/2

x θ and Σ1/2 y η are the first singular pair of Σ−1/2 x

ΣxyΣ−1/2

y

.

6

slide-7
SLIDE 7

Sample Version

Find θ and η: max θT ˆ Σxyη, s.t. θT ˆ Σxθ = ηT ˆ Σyη = 1. Solution: ˆ Σ1/2

x θ and ˆ

Σ1/2

y η are the first singular pair of ˆ

Σ−1/2

x

ˆ Σxy ˆ Σ−1/2

y

. Concerns: Let X ∈ Rp and Y ∈ Rm. When p ∧ m ≫ n,

  • Estimation may not be consistent.
  • The performance of ˆ

Σ−1/2

x

and ˆ Σ−1/2

y

can be poor.

7

slide-8
SLIDE 8

Sparse CCA

Impose sparsity on θ and η.

8

slide-9
SLIDE 9

An Attempt to Sparse CCA

PMD (Penalized Matrix Decomposition) Witten, Tibshirani & Hastie (2009) Find θ and η: max θT ˆ Σxyη, s.t. θTθ ≤ 1, ηTη ≤ 1, ||θ||1 ≤ c1, ||η||1 ≤ c2. Main Ideas:

  • Impose sparsity.
  • “Estimate” Σx and Σy by identity matrices.

9

slide-10
SLIDE 10

An Attempt to Sparse CCA - Cont.

Some concerns:

  • Computation: the problem is not convex (bi-convex).
  • Theory: no theoretical guarantee for the global maximizer.
  • Bias: consequence of using identities is unclear.

10

slide-11
SLIDE 11

Simulation result when Σx and Σy are not identities (n=500)

100 200 300 400 0.00 0.20

Truth

100 200 300 400 0.0 0.2 0.4

CAPIT

100 200 300 400 −0.6 0.0

PMD

11

slide-12
SLIDE 12

An Elementary Reparametrization of CCA

12

slide-13
SLIDE 13

Reparametrization

Find θ and η: max θTΣxyη, s.t. θTΣxθ = ηTΣyη = 1. Reparametrization: Σxy = ΣxAΣy. SVD w.r.t. Σx and Σy: A = ΘΛHT, ΘTΣxΘ = HTΣyH = I, for some Θ = [θ1, θ2, ..., θr], H = [η1, η2, ..., ηr], and Λ = diag(λ1, ..., λr) with λi deceasing.

13

slide-14
SLIDE 14

An Explicit Solution

Find θ and η: max θTΣxyη, s.t. θTΣxθ = ηTΣyη = 1. Solution: θ = σθ1, η = ση1, where σ = ±1.

14

slide-15
SLIDE 15

The Single Canonical Pair (SCP) Model

Let r = 1, we have Σxy = λΣxθηTΣy, θTΣxθ = ηTΣyη = 1. Sparse CCA for r = 1:

  • The rank one matrix ΩxΣxyΩy has a sparse decomposition λθηT, where

Ωx = Σ−1

x , and Ωy = Σ−1 y

  • We assume the vectors θ and η have at most s non-zero coordinates.

15

slide-16
SLIDE 16

Comparison: The Single Spike Model (Johnstone & Lu, 2009) Σ = λθθT + I, θTθ = 1.

  • Sparse CCA is harder: extra nuisance parameters Σx and Σy.
  • Sparsity of θ, η may not imply sparsity of Σxy.

16

slide-17
SLIDE 17

A Naive Methodology and Its Theoretical Justification

17

slide-18
SLIDE 18

Known Covariance

Observations: {(Xi, Yi)}n

i=1 i.i.d.

Cov  Xi Yi   =   Σx λΣxθηTΣy λΣyηθTΣx Σy   . Transformation: Cov  ΩxXi ΩyYi   =   Ωx λθηT ληθT Ωy   . Unbiased estimator of A = λθη: ˆ A = 1 n

n

i=1

ΩxXiY T

i Ωy.

Apply sparse SVD to ˆ A.

18

slide-19
SLIDE 19

CCA via Precision adjusted Iterative Thresholding

Step 1: Splitting the data into two halves. Use the first half data to form ˆ Ωx, ˆ Ωy. Step 2: Apply coordinate thresholding on the matrix ˆ A = 2 n

n

i=n/2+1

ˆ ΩxXiY T

i ˆ

Ωy. to get an initializer u(0) or v(0). Step 3: Apply iterative thresholding on ˆ A with the initializer and get u(k) and v(k).

19

slide-20
SLIDE 20

Convergence Rate of CAPIT - Assumptions

Assumption A: s = o   ( n log p )1/2  . Assumption B: ||(ˆ ΩxΣx − I)θ|| ∨ ||(ˆ ΩyΣy − I)η|| = oP(1). In addition, we assume that λ ≥ M −1, ||Σx|| ∨ ||Σy|| ∨ ||Ωx|| ∨ ||Ωy|| ≤ M.

20

slide-21
SLIDE 21

Convergence Rate of CAPIT - Loss Function

Consider the joint loss L2(ˆ θ, θ) + L2(ˆ η, η) = | sin ∠(ˆ θ, θ)|2 + | sin ∠(ˆ η, η)|2.

21

slide-22
SLIDE 22

Convergence Rate of CAPIT

Theorem 1 Under the assumptions, we have, L2(ˆ θ, θ) + L2(ˆ η, η) ≲ slog p n + ||(ˆ ΩxΣx − I)θ||2 + ||(ˆ ΩyΣy − I)η||2, with high probability.

22

slide-23
SLIDE 23

Remark

The convergence rate depends on ||(ˆ ΩxΣx − I)θ|| + ||(ˆ ΩyΣy − I)η||, determined by the covariance class Fp. Example of Fp: Bandable, Sparse, Toeplitz, Graphical Model, Spiked Covariance...

23

slide-24
SLIDE 24

Minimaxity

24

slide-25
SLIDE 25

Questions on Fundamental Limits

  • Go beyond r = 1?
  • Allow to have residual canonical correlation directions?
  • Avoid the ugly terms in Theorem 1?

25

slide-26
SLIDE 26

General Sparse CCA Model: Σxy = Σx ( U1Λ1V T

1 + U2Λ2V T 2

) Σy, U TΣxU = V TΣyV = I, where U = [U1, U2], V = [V1, V2]. Goal: estimating sparse U1 and V1 (at most s nonzero rows). No structural assumption on U2, V2, Σx, Σy.

26

slide-27
SLIDE 27

Procedure

The estimator ( U1, V1) is a solution to the following optimization problem, max

(A,B)

tr(A′ ΣxyB) s.t. A′ ΣxA = B′ ΣyB = Ir and exactly s nonzero rows for both A and B, where Σx, Σy, and Σxy are sample covariance matrices.

27

slide-28
SLIDE 28

Assumptions

  • U1 ∈ Rp×r and V1 ∈ Rm×r have at most s number of nonzero rows;
  • 1 > κλ ≥ λ1 ≥ ... ≥ λr ≥ λ > 0;
  • λr+1 ≤ cλr for some c small;
  • Σl

x

  • p ∨
  • Σl

y

  • p ≤ M for l = ±1.

28

slide-29
SLIDE 29

Minimaxity

Theorem 2 Under the assumptions, we have, inf

  • U1V ′

1

sup

Σ∈F0(s,p,m,r,λ)

E||U1V ′

1 −

U1V ′

1||2 F

≍ 1 nλ2[rs + s(log p s + log m s )].

29

slide-30
SLIDE 30

Remarks:

  • We allow arbitrary r as results for sparse PCA.
  • The presence of the residual canonical correlation directions do not

influence the minimax rates under a mild condition on eigengap.

  • The minimax rates are not affected by estimation of Σ−1

x

and Σ−1

y . 30

slide-31
SLIDE 31

Algorithm

31

slide-32
SLIDE 32

Questions on Computational Feasibility

A polynomial time algorithm to

  • Go beyond r = 1?
  • Allow to have residual canonical correlation directions?
  • Avoid the ugly terms?

Answer: Not yet. We need to assume residual canonical correlation directions to be zero, i.e., U2 = 0 or V2 = 0.

32

slide-33
SLIDE 33

Two-Stage Procedure: I

Initialization by Convex Programming maximize tr( ΣxyF ′) − ρ||F||1, subject to || Σ1/2

x F

Σ1/2

y ||∗ ≤ r,

|| Σ1/2

x F

Σ1/2

y ||spectral ≤ 1.

Motivation: Exhaustive search procedure max

(A,B)

tr(A′ ΣxyB) s.t. A′ ΣxA = B′ ΣyB = Ir and exactly s nonzero rows for both A and B.

33

slide-34
SLIDE 34

Two-Stage Procedure: II

Refinement by Sparse Regression – Group Lasso

  • U =

arg min

L∈Rp×r

{ tr(L′ ΣxL) − 2tr(L′ ΣxyV (0)) + ρu

p

j=1

||Lj·|| } ,

  • V =

arg min

R∈Rm×r

{ tr(R′ ΣyR) − 2tr(R′ ΣyxU (0)) + ρv

m

j=1

||Rj·|| } . Motivation: Least square solutions min

L∈Rp×r E||L′X − V ′Y ||2 F,

min

R∈Rm×r E||R′Y − U ′X||2 F.

The minimizers are UΛ and V Λ.

34

slide-35
SLIDE 35

Statistical Optimality

Assume that s2 log(p + m) nλ2

r

≤ c, for some sufficiently small c ∈ (0, 1). We can show that ||P

U − PU||2 F

≤ C s(r + log p) nλ2

r

, ||P

V − PV ||2 F

≤ C s(r + log m) nλ2

r

, with high probability.

35

slide-36
SLIDE 36

Computational Barrier

36

slide-37
SLIDE 37

Computational Barrier

Consider r = 1. There is a set of Gaussian distributions G such that for some δ ∈ (0, 1) with lim

n→∞

s2−δ log(p + m) nλ2 > 0, then for any randomized polynomial-time estimator ˆ U, lim

n→∞ sup G

E||P

U − PU||2 F > c,

for some constant c > 0 under the assumption that the Planted Clique Hypothesis holds.

37

slide-38
SLIDE 38

Summary

  • An elementary characterization of Sparse CCA is provided.
  • A preliminary adaptive procedure (CAPIT) is proposed, but needs to take

advantage of the covariance structure.

  • Minimax rate for Sparse CCA is nailed down, but the upper bound is

achieved through exhaustively searching the support.

  • A new computationally feasible algorithm attains the minimax rate, but

need to assume residual canonical correlation directions to be zero.

  • A computational barrier.

38