sparse canonical correlation analysis minimaxity
play

Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and - PowerPoint PPT Presentation

Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and Computational Barrier Harrison H. Zhou Department of Statistics Yale University 1 Chao Gao Mengjie Chen Zongming Ma Zhao Ren 2 Outline Introduction to Sparse CCA An


  1. Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and Computational Barrier Harrison H. Zhou Department of Statistics Yale University 1

  2. Chao Gao Mengjie Chen Zongming Ma Zhao Ren 2

  3. Outline • Introduction to Sparse CCA • An Elementary Reparametrization of CCA • A Naive Methodology and Its Theoretical Justification • Minimaxity, Algorithm, and Computational Barrier 3

  4. Introduction to Sparse CCA 4

  5. What Is CCA? Find θ and η : max Cov( θ T X, η T Y ) , s.t. Var( θ T X ) = Var( η T Y ) = 1 , where      Σ x Σ xy  X  =  . Cov Σ yx Σ y Y (Hotelling 1936) 5

  6. Oracle Solution Find θ and η : max θ T Σ xy η, s.t. θ T Σ x θ = η T Σ y η = 1 . Solution : Σ 1 / 2 x θ and Σ 1 / 2 y η are the first singular pair of Σ − 1 / 2 Σ xy Σ − 1 / 2 . x y 6

  7. Sample Version Find θ and η : max θ T ˆ s.t. θ T ˆ Σ x θ = η T ˆ Σ xy η, Σ y η = 1 . Solution : ˆ x θ and ˆ y η are the first singular pair of ˆ Σ xy ˆ ˆ Σ 1 / 2 Σ 1 / 2 Σ − 1 / 2 Σ − 1 / 2 . x y Concerns: Let X ∈ R p and Y ∈ R m . When p ∧ m ≫ n , • Estimation may not be consistent. • The performance of ˆ Σ − 1 / 2 and ˆ Σ − 1 / 2 can be poor. x y 7

  8. Sparse CCA Impose sparsity on θ and η . 8

  9. An Attempt to Sparse CCA PMD (Penalized Matrix Decomposition) Witten, Tibshirani & Hastie (2009) Find θ and η : max θ T ˆ s.t. θ T θ ≤ 1 , η T η ≤ 1 , Σ xy η, || θ || 1 ≤ c 1 , || η || 1 ≤ c 2 . Main Ideas: • Impose sparsity. • “Estimate” Σ x and Σ y by identity matrices. 9

  10. An Attempt to Sparse CCA - Cont. Some concerns: • Computation: the problem is not convex (bi-convex). • Theory: no theoretical guarantee for the global maximizer. • Bias: consequence of using identities is unclear. 10

  11. Simulation result when Σ x and Σ y are not identities (n=500) Truth 0.20 0.00 0 100 200 300 400 CAPIT 0.4 0.2 0.0 0 100 200 300 400 PMD 0.0 −0.6 0 100 200 300 400 11

  12. An Elementary Reparametrization of CCA 12

  13. Reparametrization Find θ and η : max θ T Σ xy η, s.t. θ T Σ x θ = η T Σ y η = 1 . Reparametrization : Σ xy = Σ x A Σ y . SVD w.r.t. Σ x and Σ y : A = ΘΛ H T , Θ T Σ x Θ = H T Σ y H = I, for some Θ = [ θ 1 , θ 2 , ..., θ r ] , H = [ η 1 , η 2 , ..., η r ], and Λ = diag( λ 1 , ..., λ r ) with λ i deceasing. 13

  14. An Explicit Solution Find θ and η : max θ T Σ xy η, s.t. θ T Σ x θ = η T Σ y η = 1 . Solution : θ = σθ 1 , η = ση 1 , where σ = ± 1. 14

  15. The Single Canonical Pair (SCP) Model Let r = 1, we have Σ xy = λ Σ x θη T Σ y , θ T Σ x θ = η T Σ y η = 1 . Sparse CCA for r = 1 : • The rank one matrix Ω x Σ xy Ω y has a sparse decomposition λθη T , where Ω x = Σ − 1 x , and Ω y = Σ − 1 y • We assume the vectors θ and η have at most s non-zero coordinates. 15

  16. Comparison: The Single Spike Model (Johnstone & Lu, 2009) Σ = λθθ T + I, θ T θ = 1 . • Sparse CCA is harder: extra nuisance parameters Σ x and Σ y . • Sparsity of θ, η may not imply sparsity of Σ xy . 16

  17. A Naive Methodology and Its Theoretical Justification 17

  18. Known Covariance Observations : { ( X i , Y i ) } n i =1 i.i.d.     λ Σ x θη T Σ y Σ x  X i  =   . Cov λ Σ y ηθ T Σ x Σ y Y i Transformation :     λθη T  Ω x X i  Ω x  =  . Cov ληθ T Ω y Y i Ω y Unbiased estimator of A = λθη : ∑ n A = 1 ˆ Ω x X i Y T i Ω y . n i =1 Apply sparse SVD to ˆ A . 18

  19. CCA via Precision adjusted Iterative Thresholding Step 1: Splitting the data into two halves. Use the first half data to form Ω x , ˆ ˆ Ω y . Step 2: Apply coordinate thresholding on the matrix n ∑ A = 2 ˆ ˆ i ˆ Ω x X i Y T Ω y . n i = n/ 2+1 to get an initializer u (0) or v (0) . A with the initializer and get u ( k ) and Step 3: Apply iterative thresholding on ˆ v ( k ) . 19

  20. Convergence Rate of CAPIT - Assumptions Assumption A:  ) 1 / 2  ( n   . s = o log p Assumption B: || (ˆ Ω x Σ x − I ) θ || ∨ || (ˆ Ω y Σ y − I ) η || = o P (1) . In addition, we assume that λ ≥ M − 1 , || Σ x || ∨ || Σ y || ∨ || Ω x || ∨ || Ω y || ≤ M. 20

  21. Convergence Rate of CAPIT - Loss Function Consider the joint loss L 2 (ˆ θ, θ ) + L 2 (ˆ η, η ) θ, θ ) | 2 + | sin ∠ (ˆ | sin ∠ (ˆ η, η ) | 2 . = 21

  22. Convergence Rate of CAPIT Theorem 1 Under the assumptions, we have, s log p L 2 (ˆ θ, θ ) + L 2 (ˆ η, η ) ≲ n Ω x Σ x − I ) θ || 2 + || (ˆ || (ˆ Ω y Σ y − I ) η || 2 , + with high probability. 22

  23. Remark The convergence rate depends on || (ˆ Ω x Σ x − I ) θ || + || (ˆ Ω y Σ y − I ) η || , determined by the covariance class F p . Example of F p : Bandable, Sparse, Toeplitz, Graphical Model, Spiked Covariance... 23

  24. Minimaxity 24

  25. Questions on Fundamental Limits • Go beyond r = 1? • Allow to have residual canonical correlation directions? • Avoid the ugly terms in Theorem 1? 25

  26. General Sparse CCA Model: ( ) U 1 Λ 1 V T 1 + U 2 Λ 2 V T U T Σ x U = V T Σ y V = I, Σ xy = Σ x Σ y , 2 where U = [ U 1 , U 2 ] , V = [ V 1 , V 2 ] . Goal: estimating sparse U 1 and V 1 (at most s nonzero rows). No structural assumption on U 2 , V 2 , Σ x , Σ y . 26

  27. Procedure The estimator ( � U 1 , � V 1 ) is a solution to the following optimization problem, tr ( A ′ � max Σ xy B ) ( A,B ) s.t. A ′ � Σ x A = B ′ � Σ y B = I r and exactly s nonzero rows for both A and B, where � Σ x , � Σ y , and � Σ xy are sample covariance matrices. 27

  28. Assumptions • U 1 ∈ R p × r and V 1 ∈ R m × r have at most s number of nonzero rows; • 1 > κλ ≥ λ 1 ≥ ... ≥ λ r ≥ λ > 0; • λ r +1 ≤ cλ r for some c small; � � � � � Σ l � � Σ l � • op ∨ op ≤ M for l = ± 1. x y 28

  29. Minimaxity Theorem 2 Under the assumptions, we have, 1 − � E || U 1 V ′ 1 || 2 U 1 V ′ inf sup F � Σ ∈F 0 ( s,p,m,r,λ ) U 1 V ′ 1 1 nλ 2 [ rs + s (log p s + log m ≍ s )] . 29

  30. Remarks: • We allow arbitrary r as results for sparse PCA. • The presence of the residual canonical correlation directions do not influence the minimax rates under a mild condition on eigengap. • The minimax rates are not affected by estimation of Σ − 1 and Σ − 1 y . x 30

  31. Algorithm 31

  32. Questions on Computational Feasibility A polynomial time algorithm to • Go beyond r = 1? • Allow to have residual canonical correlation directions? • Avoid the ugly terms? Answer : Not yet. We need to assume residual canonical correlation directions to be zero, i.e., U 2 = 0 or V 2 = 0. 32

  33. Two-Stage Procedure: I Initialization by Convex Programming tr ( � Σ xy F ′ ) − ρ || F || 1 , maximize || � x F � Σ 1 / 2 Σ 1 / 2 subject to y || ∗ ≤ r, || � x F � Σ 1 / 2 Σ 1 / 2 y || spectral ≤ 1 . Motivation: Exhaustive search procedure tr ( A ′ � max Σ xy B ) ( A,B ) s.t. A ′ � Σ x A = B ′ � Σ y B = I r and exactly s nonzero rows for both A and B. 33

  34. Two-Stage Procedure: II Refinement by Sparse Regression – Group Lasso { } p ∑ � tr ( L ′ � Σ x L ) − 2 tr ( L ′ � Σ xy V (0) ) + ρ u U = arg min || L j · || , L ∈ R p × r j =1 { } m ∑ � tr ( R ′ � Σ y R ) − 2 tr ( R ′ � Σ yx U (0) ) + ρ v V = arg min || R j · || . R ∈ R m × r j =1 Motivation: Least square solutions L ∈ R p × r E || L ′ X − V ′ Y || 2 R ∈ R m × r E || R ′ Y − U ′ X || 2 min min F , F . The minimizers are U Λ and V Λ. 34

  35. Statistical Optimality Assume that s 2 log( p + m ) ≤ c, nλ 2 r for some sufficiently small c ∈ (0 , 1). We can show that C s ( r + log p ) U − P U || 2 || P � ≤ , F nλ 2 r C s ( r + log m ) V − P V || 2 || P � ≤ , F nλ 2 r with high probability. 35

  36. Computational Barrier 36

  37. Computational Barrier Consider r = 1. There is a set of Gaussian distributions G such that for some δ ∈ (0 , 1) with s 2 − δ log( p + m ) lim > 0 , nλ 2 n →∞ then for any randomized polynomial-time estimator ˆ U , U − P U || 2 n →∞ sup lim E || P � F > c, G for some constant c > 0 under the assumption that the Planted Clique Hypothesis holds. 37

  38. Summary • An elementary characterization of Sparse CCA is provided. • A preliminary adaptive procedure (CAPIT) is proposed, but needs to take advantage of the covariance structure. • Minimax rate for Sparse CCA is nailed down, but the upper bound is achieved through exhaustively searching the support. • A new computationally feasible algorithm attains the minimax rate, but need to assume residual canonical correlation directions to be zero. • A computational barrier. 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend