Information-theoretically Optimal Sparse PCA Yash Deshpande and - - PowerPoint PPT Presentation
Information-theoretically Optimal Sparse PCA Yash Deshpande and - - PowerPoint PPT Presentation
Information-theoretically Optimal Sparse PCA Yash Deshpande and Andrea Montanari Stanford University July 3 rd , 2014 Problem Definition n xx T Y = + Z . 2 Problem Definition n xx T Y = + Z . n n
Problem Definition
Yλ =
- λ
n xxT + Z.
2
Problem Definition
Yλ =
- λ
n xxT + Z.
λ
n
λ
n
λ
n
λ
n
Zij = Zji
xi ∼ Bernoulli(ε), Zij ∼ Normal(0, 1) independent. Estimate X = xxT from Yλ
2
An example: gene expression data
[Baechler et al, 2003 PNAS]
- Genes × patients matrix
- Blue - lupus patients,
Aqua - healthy controls
- Black - a subset of
immune system specific genes
3
An example: gene expression data
[Baechler et al, 2003 PNAS]
- Genes × patients matrix
- Blue - lupus patients,
Aqua - healthy controls
- Black - a subset of
immune system specific genes A simple probabilistic model
3
Related work
Detection and estimation: Y = X + noise .
- X ∈ S ⊂ {0, 1}n, a known set
- Goal: hypothesis testing, support recovery
- [Donoho, Jin 2004], [Addario-Berry et al. 2010], [Arias-Castro
et al. 2011] . . .
4
Related work
Machine learning: maximize v, Yλv subject to: v2 ≤ 1, v is sparse.
- Goal: maximize “variance”, support recovery
- [d’Aspremont et al. 2004], [Moghaddam et al. 2005], [Zou et
- al. 2006], [Amini, Wainwright 2009] , [Papailiopoulos et al.
2013]. . .
4
Related work
Information theory: minimize Yλ − vvT2
F + f (v).
- Probabilistic model for x, Yλ
- Propose approximate message passing algorithm
- [Rangan, Fletcher 2012], [Kabashima et al. 2014]
4
A first try: simple PCA
Yλ =
- λ
n xxT + Z.
5
A first try: simple PCA
Yλ =
- λ
n xxT + Z. Estimate x using scaled principal eigenvector x1(Yλ).
5
Limitations of PCA
6
Limitations of PCA
If λε2 > 1
2 −2 Limiting Spectral Density
limn→∞
x1(Yλ),x √nε
> 0 a. s.
6
Limitations of PCA
If λε2 > 1
2 −2 Limiting Spectral Density
limn→∞
x1(Yλ),x √nε
> 0 a. s. If λε2 < 1
2 −2 Limiting Spectral Density
limn→∞
x1(Yλ),x √nε
= 0 a. s.
6
Limitations of PCA
If λε2 > 1
2 −2 Limiting Spectral Density
limn→∞
x1(Yλ),x √nε
> 0 a. s. If λε2 < 1
2 −2 Limiting Spectral Density
limn→∞
x1(Yλ),x √nε
= 0 a. s. [Knowles, Yin, 2011]
6
Our contributions
- Poly-time algorithm that exploits sparsity
7
Our contributions
- Poly-time algorithm that exploits sparsity
- Provably optimal in terms of MSE when ε > εc
7
Our contributions
- Poly-time algorithm that exploits sparsity
- Provably optimal in terms of MSE when ε > εc
- “Single-letter” characterization of MMSE
7
Single letter characterization
Original high-dimensional problem
Yλ =
- λ
n xxT + Z,
8
Single letter characterization
Original high-dimensional problem
Yλ =
- λ
n xxT + Z, M-mmse(λ, n) ≡ 1 n2 E
- X − E{X|Yλ}2
F
- .
8
Single letter characterization
Original high-dimensional problem
Yλ =
- λ
n xxT + Z, M-mmse(λ, n) ≡ 1 n2 E
- X − E{X|Yλ}2
F
- .
Scalar problem
Yλ = √ λX0 + Z,
8
Single letter characterization
Original high-dimensional problem
Yλ =
- λ
n xxT + Z, M-mmse(λ, n) ≡ 1 n2 E
- X − E{X|Yλ}2
F
- .
Scalar problem
Yλ = √ λX0 + Z, S-mmse(λ) ≡ E
- (X0 − E{X0|Yλ})2
.
8
Single letter characterization
Original high-dimensional problem
Yλ =
- λ
n xxT + Z, M-mmse(λ, n) ≡ 1 n2 E
- X − E{X|Yλ}2
F
- .
Scalar problem
Yλ = √ λX0 + Z, S-mmse(λ) ≡ E
- (X0 − E{X0|Yλ})2
. Here X0 ∼ Bernoulli(ε), Z ∼ Normal(0, 1).
8
Main result
Theorem (Deshpande, Montanari 2014)
There exists an εc < 1 such that the following happens. For every ε > εc lim
n→∞ M-mmse(λ, n) = ε2 − τ 2 ∗
where τ∗ = ε − S-mmse(λτ∗). Further there exists a polynomial time algorithm that achieves this MSE.
9
Main result
Theorem (Deshpande, Montanari 2014)
There exists an εc < 1 such that the following happens. For every ε > εc lim
n→∞ M-mmse(λ, n) = ε2 − τ 2 ∗
where τ∗ = ε − S-mmse(λτ∗). Further there exists a polynomial time algorithm that achieves this MSE. εc ≈ 0.05 (solution to scalar non-linear equation)
9
Making use of sparsity
10
Making use of sparsity
The power iteration with A = Yλ/√n: xt+1 = A xt.
10
Making use of sparsity
The power iteration with A = Yλ/√n: xt+1 = A xt. Improvement: xt+1 = AFt(xt), where Ft(xt) = (ft(xt
1), . . . ft(xt n))T.
Choose ft to exploit sparsity.
10
A heuristic analysis
Expanding the ith entry of xt+1: xt+1
i
= √ λx, Ft(xt) n
- ≈µt
xi + 1 √n
- j
Zijft(xt
j )
- ≈Normal(0,τt)
11
A heuristic analysis
Expanding the ith entry of xt+1: xt+1
i
= √ λx, Ft(xt) n
- ≈µt
xi + 1 √n
- j
Zijft(xt
j )
- ≈Normal(0,τt)
Thus: xt+1 d ≈ µtx + √τtz, where z ∼ Normal(0, In)
11
Approximate Message Passing (AMP)
This analysis is obviously wrong, but. . .
12
Approximate Message Passing (AMP)
This analysis is obviously wrong, but. . . is asymptotically exact for the modified iteration: xt+1 = A xt − bt xt−1,
- xt = Ft(xt).
[Donoho, Maleki, Montanari 2009], [Bayati, Montanari 2011], [Rangan, Fletcher 2012].
12
Asymptotic behavior
t = 2
−2 −1 1 2 3 50 100 150 hist(xt
i − µtxi)
Power method
−2 −1 1 2 50 100 150 hist(xt
i − µtxi)
AMP
13
Asymptotic behavior
t = 4
−2 2 4 50 100 150 hist(xt
i − µtxi)
Power method
−1 −0.5 0.5 1 1.5 50 100 150 hist(xt
i − µtxi)
AMP
13
Asymptotic behavior
t = 8
−10 −5 5 10 15 20 50 100 150 hist(xt
i − µtxi)
Power method
−0.6 −0.4 −0.2 0.2 0.4 0.6 50 100 150 200 hist(xt
i − µtxi)
AMP
13
Asymptotic behavior
t = 12
−50 50 100 50 100 150 200 hist(xt
i − µtxi)
Power method
−0.3 −0.2 −0.1 0.1 0.2 0.3 50 100 150 hist(xt
i − µtxi)
AMP
13
Asymptotic behavior
t = 16
−200 200 400 600 50 100 150 200 hist(xt
i − µtxi)
Power method
−0.1 −5 · 10−2 5 · 10−2 0.1 0.15 50 100 150 hist(xt
i − µtxi)
AMP
13
Asymptotic behavior: a lemma
Lemma
Let ft be a sequence of Lipschitz functions. For every fixed t and uniformly random i: (xi, xt
i ) d
→ (X0, µtX0 + √τtZ) almost surely.
14
State evolution
Deterministic recursions: µt+1 = E{ √ λft(µtX0 + √τtZ)} τt+1 = E{ft(µtX0 + √τtZ)2}.
15
State evolution
Deterministic recursions: µt+1 = E{ √ λft(µtX0 + √τtZ)} τt+1 = E{ft(µtX0 + √τtZ)2}. With optimal ft: µt+1 = √ λτt+1 τt+1 = ε − S-mmse(λτt).
15
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt)
16
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt)
16
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt) τ1
16
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt)
16
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt) τ2
16
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt)
16
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt) τ3
16
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt)
16
State evolution: an illustration
τt+1 τt ε − S-mmse(λτt) M-mmse(λ) = ε2 − τ 2
∗
τ∗
16
Proof sketch: MSE expression
Using estimator Xt = xt( xt)T: mse( Xt, λ) = 1 n2 E{ x( xt)T − xxT2
F}
= 1 n2 E{x4} + 1 n2 E{ x4} − 2E
- xt, x2
n2
- → ε2 − τ 2
t+1. 17
Proof sketch: MSE expression
Using estimator Xt = xt( xt)T: mse( Xt, λ) = 1 n2 E{ x( xt)T − xxT2
F}
= 1 n2 E{x4} + 1 n2 E{ x4} − 2E
- xt, x2
n2
- → ε2 − τ 2
t+1.
Thus mseAMP(λ) = lim
t→∞ lim n→∞ mse(
Xt, λ) = ε2 − τ 2
∗ . 17
Proof sketch: I-MMSE identity
M-mmse(λ) ≤ mseAMP(λ)
18
Proof sketch: I-MMSE identity
1 4
∞ M-mmse(λ)dλ ≤
1 4
∞ mseAMP(λ)dλ
18
Proof sketch: I-MMSE identity
1 4
∞ M-mmse(λ)dλ ≤
1 4
∞ mseAMP(λ)dλ I(X; Y∞) − I(X; Y0)
18
Proof sketch: I-MMSE identity
1 4
∞ M-mmse(λ)dλ ≤
1 4
∞ mseAMP(λ)dλ I(X; Y∞) − I(X; Y0) = h(ε)
18
Proof sketch: I-MMSE identity
1 4
∞ M-mmse(λ)dλ ≤
1 4
∞ mseAMP(λ)dλ I(X; Y∞) − I(X; Y0) = h(ε)
1 4
∞ (ε2 − τ∗(ε, λ)2)dλ
18
Proof sketch: I-MMSE identity
1 4
∞ M-mmse(λ)dλ ≤
1 4
∞ mseAMP(λ)dλ I(X; Y∞) − I(X; Y0) = h(ε)
1 4
∞ (ε2 − τ∗(ε, λ)2)dλ = h(ε)
18
Conclusion
Some open problems. . .
- MMSE characterization with multiple fixed points
- General distributions for x
19
Conclusion
Some open problems. . .
- MMSE characterization with multiple fixed points
- General distributions for x