SLIDE 1
Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and - - PowerPoint PPT Presentation
Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and - - PowerPoint PPT Presentation
Sparse Canonical Correlation Analysis: Minimaxity, Algorithm, and Computational Barrier Harrison H. Zhou Department of Statistics Yale University 1 Chao Gao Mengjie Chen Zongming Ma Zhao Ren 2 Outline Introduction to Sparse CCA An
SLIDE 2
SLIDE 3
Outline
- Introduction to Sparse CCA
- An Elementary Reparametrization of CCA
- A Naive Methodology and Its Theoretical Justification
- Minimaxity, Algorithm, and Computational Barrier
3
SLIDE 4
Introduction to Sparse CCA
4
SLIDE 5
What Is CCA?
Find θ and η: max Cov(θTX, ηTY ), s.t. Var(θTX) = Var(ηTY ) = 1, where Cov X Y = Σx Σxy Σyx Σy . (Hotelling 1936)
5
SLIDE 6
Oracle Solution
Find θ and η: max θTΣxyη, s.t. θTΣxθ = ηTΣyη = 1. Solution: Σ1/2
x θ and Σ1/2 y η are the first singular pair of Σ−1/2 x
ΣxyΣ−1/2
y
.
6
SLIDE 7
Sample Version
Find θ and η: max θT ˆ Σxyη, s.t. θT ˆ Σxθ = ηT ˆ Σyη = 1. Solution: ˆ Σ1/2
x θ and ˆ
Σ1/2
y η are the first singular pair of ˆ
Σ−1/2
x
ˆ Σxy ˆ Σ−1/2
y
. Concerns: Let X ∈ Rp and Y ∈ Rm. When p ∧ m ≫ n,
- Estimation may not be consistent.
- The performance of ˆ
Σ−1/2
x
and ˆ Σ−1/2
y
can be poor.
7
SLIDE 8
Sparse CCA
Impose sparsity on θ and η.
8
SLIDE 9
An Attempt to Sparse CCA
PMD (Penalized Matrix Decomposition) Witten, Tibshirani & Hastie (2009) Find θ and η: max θT ˆ Σxyη, s.t. θTθ ≤ 1, ηTη ≤ 1, ||θ||1 ≤ c1, ||η||1 ≤ c2. Main Ideas:
- Impose sparsity.
- “Estimate” Σx and Σy by identity matrices.
9
SLIDE 10
An Attempt to Sparse CCA - Cont.
Some concerns:
- Computation: the problem is not convex (bi-convex).
- Theory: no theoretical guarantee for the global maximizer.
- Bias: consequence of using identities is unclear.
10
SLIDE 11
Simulation result when Σx and Σy are not identities (n=500)
100 200 300 400 0.00 0.20
Truth
100 200 300 400 0.0 0.2 0.4
CAPIT
100 200 300 400 −0.6 0.0
PMD
11
SLIDE 12
An Elementary Reparametrization of CCA
12
SLIDE 13
Reparametrization
Find θ and η: max θTΣxyη, s.t. θTΣxθ = ηTΣyη = 1. Reparametrization: Σxy = ΣxAΣy. SVD w.r.t. Σx and Σy: A = ΘΛHT, ΘTΣxΘ = HTΣyH = I, for some Θ = [θ1, θ2, ..., θr], H = [η1, η2, ..., ηr], and Λ = diag(λ1, ..., λr) with λi deceasing.
13
SLIDE 14
An Explicit Solution
Find θ and η: max θTΣxyη, s.t. θTΣxθ = ηTΣyη = 1. Solution: θ = σθ1, η = ση1, where σ = ±1.
14
SLIDE 15
The Single Canonical Pair (SCP) Model
Let r = 1, we have Σxy = λΣxθηTΣy, θTΣxθ = ηTΣyη = 1. Sparse CCA for r = 1:
- The rank one matrix ΩxΣxyΩy has a sparse decomposition λθηT, where
Ωx = Σ−1
x , and Ωy = Σ−1 y
- We assume the vectors θ and η have at most s non-zero coordinates.
15
SLIDE 16
Comparison: The Single Spike Model (Johnstone & Lu, 2009) Σ = λθθT + I, θTθ = 1.
- Sparse CCA is harder: extra nuisance parameters Σx and Σy.
- Sparsity of θ, η may not imply sparsity of Σxy.
16
SLIDE 17
A Naive Methodology and Its Theoretical Justification
17
SLIDE 18
Known Covariance
Observations: {(Xi, Yi)}n
i=1 i.i.d.
Cov Xi Yi = Σx λΣxθηTΣy λΣyηθTΣx Σy . Transformation: Cov ΩxXi ΩyYi = Ωx λθηT ληθT Ωy . Unbiased estimator of A = λθη: ˆ A = 1 n
n
∑
i=1
ΩxXiY T
i Ωy.
Apply sparse SVD to ˆ A.
18
SLIDE 19
CCA via Precision adjusted Iterative Thresholding
Step 1: Splitting the data into two halves. Use the first half data to form ˆ Ωx, ˆ Ωy. Step 2: Apply coordinate thresholding on the matrix ˆ A = 2 n
n
∑
i=n/2+1
ˆ ΩxXiY T
i ˆ
Ωy. to get an initializer u(0) or v(0). Step 3: Apply iterative thresholding on ˆ A with the initializer and get u(k) and v(k).
19
SLIDE 20
Convergence Rate of CAPIT - Assumptions
Assumption A: s = o ( n log p )1/2 . Assumption B: ||(ˆ ΩxΣx − I)θ|| ∨ ||(ˆ ΩyΣy − I)η|| = oP(1). In addition, we assume that λ ≥ M −1, ||Σx|| ∨ ||Σy|| ∨ ||Ωx|| ∨ ||Ωy|| ≤ M.
20
SLIDE 21
Convergence Rate of CAPIT - Loss Function
Consider the joint loss L2(ˆ θ, θ) + L2(ˆ η, η) = | sin ∠(ˆ θ, θ)|2 + | sin ∠(ˆ η, η)|2.
21
SLIDE 22
Convergence Rate of CAPIT
Theorem 1 Under the assumptions, we have, L2(ˆ θ, θ) + L2(ˆ η, η) ≲ slog p n + ||(ˆ ΩxΣx − I)θ||2 + ||(ˆ ΩyΣy − I)η||2, with high probability.
22
SLIDE 23
Remark
The convergence rate depends on ||(ˆ ΩxΣx − I)θ|| + ||(ˆ ΩyΣy − I)η||, determined by the covariance class Fp. Example of Fp: Bandable, Sparse, Toeplitz, Graphical Model, Spiked Covariance...
23
SLIDE 24
Minimaxity
24
SLIDE 25
Questions on Fundamental Limits
- Go beyond r = 1?
- Allow to have residual canonical correlation directions?
- Avoid the ugly terms in Theorem 1?
25
SLIDE 26
General Sparse CCA Model: Σxy = Σx ( U1Λ1V T
1 + U2Λ2V T 2
) Σy, U TΣxU = V TΣyV = I, where U = [U1, U2], V = [V1, V2]. Goal: estimating sparse U1 and V1 (at most s nonzero rows). No structural assumption on U2, V2, Σx, Σy.
26
SLIDE 27
Procedure
The estimator ( U1, V1) is a solution to the following optimization problem, max
(A,B)
tr(A′ ΣxyB) s.t. A′ ΣxA = B′ ΣyB = Ir and exactly s nonzero rows for both A and B, where Σx, Σy, and Σxy are sample covariance matrices.
27
SLIDE 28
Assumptions
- U1 ∈ Rp×r and V1 ∈ Rm×r have at most s number of nonzero rows;
- 1 > κλ ≥ λ1 ≥ ... ≥ λr ≥ λ > 0;
- λr+1 ≤ cλr for some c small;
- Σl
x
- p ∨
- Σl
y
- p ≤ M for l = ±1.
28
SLIDE 29
Minimaxity
Theorem 2 Under the assumptions, we have, inf
- U1V ′
1
sup
Σ∈F0(s,p,m,r,λ)
E||U1V ′
1 −
U1V ′
1||2 F
≍ 1 nλ2[rs + s(log p s + log m s )].
29
SLIDE 30
Remarks:
- We allow arbitrary r as results for sparse PCA.
- The presence of the residual canonical correlation directions do not
influence the minimax rates under a mild condition on eigengap.
- The minimax rates are not affected by estimation of Σ−1
x
and Σ−1
y . 30
SLIDE 31
Algorithm
31
SLIDE 32
Questions on Computational Feasibility
A polynomial time algorithm to
- Go beyond r = 1?
- Allow to have residual canonical correlation directions?
- Avoid the ugly terms?
Answer: Not yet. We need to assume residual canonical correlation directions to be zero, i.e., U2 = 0 or V2 = 0.
32
SLIDE 33
Two-Stage Procedure: I
Initialization by Convex Programming maximize tr( ΣxyF ′) − ρ||F||1, subject to || Σ1/2
x F
Σ1/2
y ||∗ ≤ r,
|| Σ1/2
x F
Σ1/2
y ||spectral ≤ 1.
Motivation: Exhaustive search procedure max
(A,B)
tr(A′ ΣxyB) s.t. A′ ΣxA = B′ ΣyB = Ir and exactly s nonzero rows for both A and B.
33
SLIDE 34
Two-Stage Procedure: II
Refinement by Sparse Regression – Group Lasso
- U =
arg min
L∈Rp×r
{ tr(L′ ΣxL) − 2tr(L′ ΣxyV (0)) + ρu
p
∑
j=1
||Lj·|| } ,
- V =
arg min
R∈Rm×r
{ tr(R′ ΣyR) − 2tr(R′ ΣyxU (0)) + ρv
m
∑
j=1
||Rj·|| } . Motivation: Least square solutions min
L∈Rp×r E||L′X − V ′Y ||2 F,
min
R∈Rm×r E||R′Y − U ′X||2 F.
The minimizers are UΛ and V Λ.
34
SLIDE 35
Statistical Optimality
Assume that s2 log(p + m) nλ2
r
≤ c, for some sufficiently small c ∈ (0, 1). We can show that ||P
U − PU||2 F
≤ C s(r + log p) nλ2
r
, ||P
V − PV ||2 F
≤ C s(r + log m) nλ2
r
, with high probability.
35
SLIDE 36
Computational Barrier
36
SLIDE 37
Computational Barrier
Consider r = 1. There is a set of Gaussian distributions G such that for some δ ∈ (0, 1) with lim
n→∞
s2−δ log(p + m) nλ2 > 0, then for any randomized polynomial-time estimator ˆ U, lim
n→∞ sup G
E||P
U − PU||2 F > c,
for some constant c > 0 under the assumption that the Planted Clique Hypothesis holds.
37
SLIDE 38
Summary
- An elementary characterization of Sparse CCA is provided.
- A preliminary adaptive procedure (CAPIT) is proposed, but needs to take
advantage of the covariance structure.
- Minimax rate for Sparse CCA is nailed down, but the upper bound is
achieved through exhaustively searching the support.
- A new computationally feasible algorithm attains the minimax rate, but
need to assume residual canonical correlation directions to be zero.
- A computational barrier.