Information-theoretically Optimal Sparse PCA Yash Deshpande and - - PowerPoint PPT Presentation

information theoretically optimal sparse pca
SMART_READER_LITE
LIVE PREVIEW

Information-theoretically Optimal Sparse PCA Yash Deshpande and - - PowerPoint PPT Presentation

Information-theoretically Optimal Sparse PCA Yash Deshpande and Andrea Montanari Stanford University July 3 rd , 2014 Problem Definition n xx T Y = + Z . 2 Problem Definition n xx T Y = + Z . n n


slide-1
SLIDE 1

Information-theoretically Optimal Sparse PCA

Yash Deshpande and Andrea Montanari

Stanford University

July 3rd, 2014

slide-2
SLIDE 2

Problem Definition

Yλ =

  • λ

n xxT + Z.

2

slide-3
SLIDE 3

Problem Definition

Yλ =

  • λ

n xxT + Z.

λ

n

λ

n

λ

n

λ

n

Zij = Zji

xi ∼ Bernoulli(ε), Zij ∼ Normal(0, 1) independent. Estimate X = xxT from Yλ

2

slide-4
SLIDE 4

An example: gene expression data

[Baechler et al, 2003 PNAS]

  • Genes × patients matrix
  • Blue - lupus patients,

Aqua - healthy controls

  • Black - a subset of

immune system specific genes

3

slide-5
SLIDE 5

An example: gene expression data

[Baechler et al, 2003 PNAS]

  • Genes × patients matrix
  • Blue - lupus patients,

Aqua - healthy controls

  • Black - a subset of

immune system specific genes A simple probabilistic model

3

slide-6
SLIDE 6

Related work

Detection and estimation: Y = X + noise .

  • X ∈ S ⊂ {0, 1}n, a known set
  • Goal: hypothesis testing, support recovery
  • [Donoho, Jin 2004], [Addario-Berry et al. 2010], [Arias-Castro

et al. 2011] . . .

4

slide-7
SLIDE 7

Related work

Machine learning: maximize v, Yλv subject to: v2 ≤ 1, v is sparse.

  • Goal: maximize “variance”, support recovery
  • [d’Aspremont et al. 2004], [Moghaddam et al. 2005], [Zou et
  • al. 2006], [Amini, Wainwright 2009] , [Papailiopoulos et al.

2013]. . .

4

slide-8
SLIDE 8

Related work

Information theory: minimize Yλ − vvT2

F + f (v).

  • Probabilistic model for x, Yλ
  • Propose approximate message passing algorithm
  • [Rangan, Fletcher 2012], [Kabashima et al. 2014]

4

slide-9
SLIDE 9

A first try: simple PCA

Yλ =

  • λ

n xxT + Z.

5

slide-10
SLIDE 10

A first try: simple PCA

Yλ =

  • λ

n xxT + Z. Estimate x using scaled principal eigenvector x1(Yλ).

5

slide-11
SLIDE 11

Limitations of PCA

6

slide-12
SLIDE 12

Limitations of PCA

If λε2 > 1

2 −2 Limiting Spectral Density

limn→∞

x1(Yλ),x √nε

> 0 a. s.

6

slide-13
SLIDE 13

Limitations of PCA

If λε2 > 1

2 −2 Limiting Spectral Density

limn→∞

x1(Yλ),x √nε

> 0 a. s. If λε2 < 1

2 −2 Limiting Spectral Density

limn→∞

x1(Yλ),x √nε

= 0 a. s.

6

slide-14
SLIDE 14

Limitations of PCA

If λε2 > 1

2 −2 Limiting Spectral Density

limn→∞

x1(Yλ),x √nε

> 0 a. s. If λε2 < 1

2 −2 Limiting Spectral Density

limn→∞

x1(Yλ),x √nε

= 0 a. s. [Knowles, Yin, 2011]

6

slide-15
SLIDE 15

Our contributions

  • Poly-time algorithm that exploits sparsity

7

slide-16
SLIDE 16

Our contributions

  • Poly-time algorithm that exploits sparsity
  • Provably optimal in terms of MSE when ε > εc

7

slide-17
SLIDE 17

Our contributions

  • Poly-time algorithm that exploits sparsity
  • Provably optimal in terms of MSE when ε > εc
  • “Single-letter” characterization of MMSE

7

slide-18
SLIDE 18

Single letter characterization

Original high-dimensional problem

Yλ =

  • λ

n xxT + Z,

8

slide-19
SLIDE 19

Single letter characterization

Original high-dimensional problem

Yλ =

  • λ

n xxT + Z, M-mmse(λ, n) ≡ 1 n2 E

  • X − E{X|Yλ}2

F

  • .

8

slide-20
SLIDE 20

Single letter characterization

Original high-dimensional problem

Yλ =

  • λ

n xxT + Z, M-mmse(λ, n) ≡ 1 n2 E

  • X − E{X|Yλ}2

F

  • .

Scalar problem

Yλ = √ λX0 + Z,

8

slide-21
SLIDE 21

Single letter characterization

Original high-dimensional problem

Yλ =

  • λ

n xxT + Z, M-mmse(λ, n) ≡ 1 n2 E

  • X − E{X|Yλ}2

F

  • .

Scalar problem

Yλ = √ λX0 + Z, S-mmse(λ) ≡ E

  • (X0 − E{X0|Yλ})2

.

8

slide-22
SLIDE 22

Single letter characterization

Original high-dimensional problem

Yλ =

  • λ

n xxT + Z, M-mmse(λ, n) ≡ 1 n2 E

  • X − E{X|Yλ}2

F

  • .

Scalar problem

Yλ = √ λX0 + Z, S-mmse(λ) ≡ E

  • (X0 − E{X0|Yλ})2

. Here X0 ∼ Bernoulli(ε), Z ∼ Normal(0, 1).

8

slide-23
SLIDE 23

Main result

Theorem (Deshpande, Montanari 2014)

There exists an εc < 1 such that the following happens. For every ε > εc lim

n→∞ M-mmse(λ, n) = ε2 − τ 2 ∗

where τ∗ = ε − S-mmse(λτ∗). Further there exists a polynomial time algorithm that achieves this MSE.

9

slide-24
SLIDE 24

Main result

Theorem (Deshpande, Montanari 2014)

There exists an εc < 1 such that the following happens. For every ε > εc lim

n→∞ M-mmse(λ, n) = ε2 − τ 2 ∗

where τ∗ = ε − S-mmse(λτ∗). Further there exists a polynomial time algorithm that achieves this MSE. εc ≈ 0.05 (solution to scalar non-linear equation)

9

slide-25
SLIDE 25

Making use of sparsity

10

slide-26
SLIDE 26

Making use of sparsity

The power iteration with A = Yλ/√n: xt+1 = A xt.

10

slide-27
SLIDE 27

Making use of sparsity

The power iteration with A = Yλ/√n: xt+1 = A xt. Improvement: xt+1 = AFt(xt), where Ft(xt) = (ft(xt

1), . . . ft(xt n))T.

Choose ft to exploit sparsity.

10

slide-28
SLIDE 28

A heuristic analysis

Expanding the ith entry of xt+1: xt+1

i

= √ λx, Ft(xt) n

  • ≈µt

xi + 1 √n

  • j

Zijft(xt

j )

  • ≈Normal(0,τt)

11

slide-29
SLIDE 29

A heuristic analysis

Expanding the ith entry of xt+1: xt+1

i

= √ λx, Ft(xt) n

  • ≈µt

xi + 1 √n

  • j

Zijft(xt

j )

  • ≈Normal(0,τt)

Thus: xt+1 d ≈ µtx + √τtz, where z ∼ Normal(0, In)

11

slide-30
SLIDE 30

Approximate Message Passing (AMP)

This analysis is obviously wrong, but. . .

12

slide-31
SLIDE 31

Approximate Message Passing (AMP)

This analysis is obviously wrong, but. . . is asymptotically exact for the modified iteration: xt+1 = A xt − bt xt−1,

  • xt = Ft(xt).

[Donoho, Maleki, Montanari 2009], [Bayati, Montanari 2011], [Rangan, Fletcher 2012].

12

slide-32
SLIDE 32

Asymptotic behavior

t = 2

−2 −1 1 2 3 50 100 150 hist(xt

i − µtxi)

Power method

−2 −1 1 2 50 100 150 hist(xt

i − µtxi)

AMP

13

slide-33
SLIDE 33

Asymptotic behavior

t = 4

−2 2 4 50 100 150 hist(xt

i − µtxi)

Power method

−1 −0.5 0.5 1 1.5 50 100 150 hist(xt

i − µtxi)

AMP

13

slide-34
SLIDE 34

Asymptotic behavior

t = 8

−10 −5 5 10 15 20 50 100 150 hist(xt

i − µtxi)

Power method

−0.6 −0.4 −0.2 0.2 0.4 0.6 50 100 150 200 hist(xt

i − µtxi)

AMP

13

slide-35
SLIDE 35

Asymptotic behavior

t = 12

−50 50 100 50 100 150 200 hist(xt

i − µtxi)

Power method

−0.3 −0.2 −0.1 0.1 0.2 0.3 50 100 150 hist(xt

i − µtxi)

AMP

13

slide-36
SLIDE 36

Asymptotic behavior

t = 16

−200 200 400 600 50 100 150 200 hist(xt

i − µtxi)

Power method

−0.1 −5 · 10−2 5 · 10−2 0.1 0.15 50 100 150 hist(xt

i − µtxi)

AMP

13

slide-37
SLIDE 37

Asymptotic behavior: a lemma

Lemma

Let ft be a sequence of Lipschitz functions. For every fixed t and uniformly random i: (xi, xt

i ) d

→ (X0, µtX0 + √τtZ) almost surely.

14

slide-38
SLIDE 38

State evolution

Deterministic recursions: µt+1 = E{ √ λft(µtX0 + √τtZ)} τt+1 = E{ft(µtX0 + √τtZ)2}.

15

slide-39
SLIDE 39

State evolution

Deterministic recursions: µt+1 = E{ √ λft(µtX0 + √τtZ)} τt+1 = E{ft(µtX0 + √τtZ)2}. With optimal ft: µt+1 = √ λτt+1 τt+1 = ε − S-mmse(λτt).

15

slide-40
SLIDE 40

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt)

16

slide-41
SLIDE 41

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt)

16

slide-42
SLIDE 42

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt) τ1

16

slide-43
SLIDE 43

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt)

16

slide-44
SLIDE 44

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt) τ2

16

slide-45
SLIDE 45

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt)

16

slide-46
SLIDE 46

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt) τ3

16

slide-47
SLIDE 47

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt)

16

slide-48
SLIDE 48

State evolution: an illustration

τt+1 τt ε − S-mmse(λτt) M-mmse(λ) = ε2 − τ 2

τ∗

16

slide-49
SLIDE 49

Proof sketch: MSE expression

Using estimator Xt = xt( xt)T: mse( Xt, λ) = 1 n2 E{ x( xt)T − xxT2

F}

= 1 n2 E{x4} + 1 n2 E{ x4} − 2E

  • xt, x2

n2

  • → ε2 − τ 2

t+1. 17

slide-50
SLIDE 50

Proof sketch: MSE expression

Using estimator Xt = xt( xt)T: mse( Xt, λ) = 1 n2 E{ x( xt)T − xxT2

F}

= 1 n2 E{x4} + 1 n2 E{ x4} − 2E

  • xt, x2

n2

  • → ε2 − τ 2

t+1.

Thus mseAMP(λ) = lim

t→∞ lim n→∞ mse(

Xt, λ) = ε2 − τ 2

∗ . 17

slide-51
SLIDE 51

Proof sketch: I-MMSE identity

M-mmse(λ) ≤ mseAMP(λ)

18

slide-52
SLIDE 52

Proof sketch: I-MMSE identity

1 4

∞ M-mmse(λ)dλ ≤

1 4

∞ mseAMP(λ)dλ

18

slide-53
SLIDE 53

Proof sketch: I-MMSE identity

1 4

∞ M-mmse(λ)dλ ≤

1 4

∞ mseAMP(λ)dλ I(X; Y∞) − I(X; Y0)

18

slide-54
SLIDE 54

Proof sketch: I-MMSE identity

1 4

∞ M-mmse(λ)dλ ≤

1 4

∞ mseAMP(λ)dλ I(X; Y∞) − I(X; Y0) = h(ε)

18

slide-55
SLIDE 55

Proof sketch: I-MMSE identity

1 4

∞ M-mmse(λ)dλ ≤

1 4

∞ mseAMP(λ)dλ I(X; Y∞) − I(X; Y0) = h(ε)

1 4

∞ (ε2 − τ∗(ε, λ)2)dλ

18

slide-56
SLIDE 56

Proof sketch: I-MMSE identity

1 4

∞ M-mmse(λ)dλ ≤

1 4

∞ mseAMP(λ)dλ I(X; Y∞) − I(X; Y0) = h(ε)

1 4

∞ (ε2 − τ∗(ε, λ)2)dλ = h(ε)

18

slide-57
SLIDE 57

Conclusion

Some open problems. . .

  • MMSE characterization with multiple fixed points
  • General distributions for x

19

slide-58
SLIDE 58

Conclusion

Some open problems. . .

  • MMSE characterization with multiple fixed points
  • General distributions for x

Thanks!

19