Spectral Methods for Latent Variable Models Kaizheng Wang - - PowerPoint PPT Presentation

spectral methods for latent variable models
SMART_READER_LITE
LIVE PREVIEW

Spectral Methods for Latent Variable Models Kaizheng Wang - - PowerPoint PPT Presentation

Spectral Methods for Latent Variable Models Kaizheng Wang Department of ORFE Princeton University March 20 th 2020 Data Diversity Unstructured, heterogeneous and incomplete information: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Credit:


slide-1
SLIDE 1

Spectral Methods for Latent Variable Models

Kaizheng Wang Department of ORFE
 Princeton University March 20th 2020

slide-2
SLIDE 2

Data Diversity

Unstructured, heterogeneous and incomplete information:

Credit: https://www.mathworks.com/help/textanalytics/gs/getting-started-with-topic-modeling.html, https://www.alliance-scotland.org.uk/alliance-homepage-holding-people-networking-2017-01-3/, https://medicalxpress.com/news/2015-04-tumor-only-genetic-sequencing-misguide-cancer .html, https://www.nature.com/articles/nature21386/figures/1, https://viterbi-web.usc.edu/ soltanol/RSC.pdf, Dzenan Hamzic

2

? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

slide-3
SLIDE 3

Matrix Representations

3

Credit (upper right): https://viterbi-web.usc.edu/ soltanol/RSC.pdf.

? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

n × d

Object-by-feature ( ):


  • Texts: document-term;
  • Genetics: individual-marker;
  • Recomm. systems: user-item.

n × n

Object-by-object ( ):

  • Networks: adjacency matrices.
slide-4
SLIDE 4

Matrix Representations

Common belief: high ambient dim. but low intrinsic dim. Low-rank approximation:

4

= +

slide-5
SLIDE 5

Matrix Representations

2

Principal Component Analysis (PCA) 


— truncated SVD

   X1 . . . Xn    =    f 1 . . . fn    B +    E1 . . . En    .

n × r n × d r × d n × d

samples latent
 coordinates latent
 bases noises

Low-dimensional embedding via latent variables:

5

slide-6
SLIDE 6

Example: Genes Mirror Geography within Europe

Novembre et al. (2008), Nature. n = 1387 individuals and d = 197146 SNPs; Figure 1a: 2-dim. embedding vs. labels.

PC1 P C 2

6

slide-7
SLIDE 7

Outline

  • Distributed PCA and linearization of eigenvectors
  • An ℓp theory for spectral methods
  • Summary and future directions
slide-8
SLIDE 8

Distributed PCA and linearization of eigenvectors

slide-9
SLIDE 9

Principal Component Analysis

2

9

Data: i.i.d., , . Goal: estimate the principal subspace spanned by the K leading eigenvectors of . PCA:

{Xi}n

i=1 ✓ Rd n

EXi = 0

E(XiX>

i ) = Σ

) = Σ

SVD

X = (X1, · · · , Xn)> ˆ U = (ˆ u1, · · · , ˆ uK) 2 Rd⇥K

slide-10
SLIDE 10

10

Center Data1 Data2 Data3 Data4 Data5

Distributed PCA

slide-11
SLIDE 11

Distributed PCA

11

Related works: Mcdonald et al. 2009; Zhang et al. 2013; Lee et al., 2015; Battey et al., 2015; Qu et al. 2002; El Karoui and d’Aspremont 2010; Liang et al. 2014.

  • 1. PCA in parallel: the ℓ-th machine conducts

and sends to the central server;

  • 2. Aggregation: .

m local machines in total, each has n samples.

X(`) 2 Rn⇥d ˆ U (`) 2 Rd⇥K

SVD

ˆ U (`)

{ ˆ U (`)}m

`=1

ˆ U 2 Rd⇥K

slide-12
SLIDE 12

How to find that best summarizes ?

Center of Subspaces

12

{ ˆ U (`)}m

`=1

ˆ U 2 Od⇥K

slide-13
SLIDE 13

How to find that best summarizes ?

  • Subspace distance:
  • Least squares:
  • Algorithm:

Center of Subspaces

13

{ ˆ U (`)}m

`=1

ˆ U 2 Od⇥K ⇢(V , W ) = kV V > W W >kF.

ˆ U = argmin

V 2Od×K m

X

`=1

⇢2(V , ˆ U (`)).

( ˆ U (1), · · · , ˆ U (m)) 2 Rd⇥mK ˆ U 2 Od⇥K.

SVD

slide-14
SLIDE 14

Assume that is sub-Gaussian, i.e. . Define the effective rank and condition number as

Theoretical Results

14

Xi

kΣ1/2Xik 2 . 1

r = Tr(Σ)/1,  = 1/(K K+1).

slide-15
SLIDE 15

Assume that is sub-Gaussian, i.e. . Define the effective rank and condition number as

Theoretical Results

15

Xi

kΣ1/2Xik 2 . 1

r = Tr(Σ)/1,  = 1/(K K+1).

There exists constant C such that when ,

Theorem (FWWZ, AoS 2019)

n C2p Kr

  • k ˆ

U ˆ U > UU >kF

  • 1 . 

r Kr mn | {z }

variance

+ 2 p Kr n | {z }

bias

.

slide-16
SLIDE 16
  • If , distributed PCA is optimal.
  • The condition cannot be improved.

Assume that is sub-Gaussian, i.e. . Define the effective rank and condition number as

Theoretical Results

16

Xi

kΣ1/2Xik 2 . 1

r = Tr(Σ)/1,  = 1/(K K+1).

There exists constant C such that when ,

Theorem (FWWZ, AoS 2019)

n C2p Kr

m . n/(2r)

  • k ˆ

U ˆ U > UU >kF

  • 1 . 

r Kr mn | {z }

variance

+ 2 p Kr n | {z }

bias

.

n C2p Kr

slide-17
SLIDE 17

Analysis of Aggregation

17

: eigenvectors of .

1 m

Pm

`=1 ˆ

U (`) ˆ U (`)>

ˆ U

X(1) 2 Rn⇥d ˆ U (1) 2 Od⇥K X(m) 2 Rn⇥d ˆ U (m) 2 Od⇥K ˆ U 2 Od⇥K

. . . . . .

SVD SVD

slide-18
SLIDE 18

Analysis of Aggregation

18

: eigenvectors of . Averaging reduces variance but retains bias.

  • Variance: controlled by Davis-Kahan:
  • Bias: how large it is?

1 m

Pm

`=1 ˆ

U (`) ˆ U (`)>

ˆ U

X(1) 2 Rn⇥d ˆ U (1) 2 Od⇥K X(m) 2 Rn⇥d ˆ U (m) 2 Od⇥K ˆ U 2 Od⇥K

. . . . . .

SVD SVD

k ˆ U (`) ˆ U (`)>UU >kF . k( ˆ Σ(`)Σ)UkF/∆.

slide-19
SLIDE 19

19

is a linear functional determined by .

Theorem (FWWZ, AoS 2019)

Linearization of Eigenvectors

k

>

f : Rd⇥d ! Rd⇥d

) = Σ

k ˆ U (`) ˆ U (`)> [UU > + f( ˆ Σ(`) Σ)]kF . [k( ˆ Σ(`) Σ)UkF/∆]2,

slide-20
SLIDE 20

20

More precise than Davis-Kahan: PCA has small bias:

is a linear functional determined by .

Theorem (FWWZ, AoS 2019)

Linearization of Eigenvectors

k

>

f : Rd⇥d ! Rd⇥d

) = Σ

k ˆ U (`) ˆ U (`)>UU >kF . k( ˆ Σ(`)Σ)UkF/∆.

kE( ˆ U (`) ˆ U (`)>) UU >kF . [k( ˆ Σ(`) Σ)UkF/∆]2. k ˆ U (`) ˆ U (`)> [UU > + f( ˆ Σ(`) Σ)]kF . [k( ˆ Σ(`) Σ)UkF/∆]2,

slide-21
SLIDE 21

Summary

Theoretical guarantees for distributed PCA:

  • Bias and variance of PCA;
  • Linearization of eigenvectors, high-order Davis-Kahan.

Paper (alphabetical order):

  • Fan, Wang, Wang and Zhu. Distributed estimation of

principal eigenspaces. The Annals of Statistics, 2019.

21

slide-22
SLIDE 22

Example: Genes Mirror Geography within Europe

Novembre et al. (2008), Nature. n = 1387 individuals and d = 197146 SNPs; Figure 1a: 2-dim. embedding vs. labels.

PC1 P C 2

22

slide-23
SLIDE 23

A Pipeline for Spectral Methods

  • 1. Similarity matrix construction


e.g. Gram , adjacency ;

  • 2. Spectral decomposition


get r eigen-pairs ;

  • 3. r-dim. embedding


e.g. using the rows of ;

  • 4. Downstream tasks


e.g. visualization.

XX> A

  • λj, uj

r

j=1

(u1, u2, . . . , ur)

Ext.: { robust, probabilistic, sparse, nonnegative } PCA.

Pearson (1901), Hotelling (1933), Schölkopf (1997), Tipping and Bishop (1999), Shi and Malik (2000), Ng et al. (2002), Belkin and Niyogi (2003), Von Luxburg (2007)

23

slide-24
SLIDE 24

An ℓp theory for spectral methods

  • Network analysis and Wigner-type matrices
  • Mixture model and Wishart-type matrices
slide-25
SLIDE 25

Community Detection and SBM

25

Community detection in networks:

Credit: Yuxin Chen.

slide-26
SLIDE 26

Community Detection and SBM

26

Community detection in networks:

Credit: Yuxin Chen.

Symmetric adjacency matrix , :

Stochastic Block Model (Holland et al., 1983)

A ∈ {0, 1}n×n

McSherry (2001), Coja-Oghlan (2006), Rohe et al. (2011), Mossel et al. (2013), Massoulie (2014), Lelarge et al. (2015), Chin et al. (2015), Abbe et al. (2016), Zhang and Zhou (2016).

|J| = |Jc| = n

2

P(Aij = 1) = ( p, if i, j or i, j q, if i, j or i, j .

slide-27
SLIDE 27

The 2nd eigenvector reveals .

Community Detection and SBM

27

¯ u = 1 √n(1J − 1Jc) (J, Jc)

EA = ✓ p1J,J q1J,Jc q1Jc,J p1Jc,Jc ◆ = p + q 2n 11> + p − q 2n ✓ 1J −1Jc ◆ 1>

J

−1>

Jc

  • .

A

= EA

+ A − EA.

A = A =

EA + EA +

Credit: Yuxin Chen.

slide-28
SLIDE 28

To recover , we need in a uniform way. Classical ℓ2 bounds (Davis and Kahan, 1970) are too loose! The 2nd eigenvector reveals .

Community Detection and SBM

28

¯ u = 1 √n(1J − 1Jc) (J, Jc) u ≈ ¯ u (J, Jc)

Spectral method:

the 2nd eigenvector .


sgn(u)

u

A

SVD

EA = ✓ p1J,J q1J,Jc q1Jc,J p1Jc,Jc ◆ = p + q 2n 11> + p − q 2n ✓ 1J −1Jc ◆ 1>

J

−1>

Jc

  • .
slide-29
SLIDE 29

Let and .

29

Optimality of Spectral Method

a 6= b

Theorem (AFWZ, AoS 2020+)

  • Exact recovery w.h.p. when ;
  • Error rate when .
  • Optimality (Abbe et al., 2016; Zhang and Zhou, 2016).

√a − √ b 2 > 2 n−(√a−

√ b)2/2

√a − √ b 2 ≤ 2

P(Aij = 1) = (

a log n n

, if i, j or i, j

b log n n

, if i, j or i, j

slide-30
SLIDE 30

30

Optimality of Spectral Method

Key ingredients:

  • Entrywise linear approximation ;
  • Weighted sum of independent Bernoulli variables.

u = Au/λ ≈ A¯ u/¯ λ

Let and . a 6= b

Theorem (AFWZ, AoS 2020+)

  • Exact recovery w.h.p. when ;
  • Error rate when .
  • Optimality (Abbe et al., 2016; Zhang and Zhou, 2016).

√a − √ b 2 > 2 n−(√a−

√ b)2/2

√a − √ b 2 ≤ 2

P(Aij = 1) = (

a log n n

, if i, j or i, j

b log n n

, if i, j or i, j

slide-31
SLIDE 31

31

Theorem (Linear approximation)

ℓ∞ Analysis: Linearization

31

P(pnku A¯ u/¯ λk1 < εn) > 1 n3.

slide-32
SLIDE 32

32

Theorem (Linear approximation)

ℓ∞ Analysis: Linearization

General results (AFWZ, AoS 2020+)

  • Singular vectors of Wigner-type matrices:
  • Symmetric, independent entries above the diagonal;
  • Rectangular, independent entries;
  • ;
  • Applications: synchronization, matrix completion, (inference).

minO∈Or×r kUO A ¯ U ¯ Λ−1k2,∞ ⌧ k ¯ Uk2,∞

32

P(pnku A¯ u/¯ λk1 < εn) > 1 n3.

slide-33
SLIDE 33

33

Merits and demerits of ℓ∞ analysis

Characterizes individual objects precisely

  • Results and tools apply to ncvx opt. (MWCC, FoCM 2019);

Requires strong signals for uniform control:

  • e.g. degree for SBM.

Successor: ℓp analysis with p < ∞

  • Controls a vast majority of the entries.

A General ℓ∞ Theory

& log n

slide-34
SLIDE 34

An ℓp theory for spectral methods

  • Network analysis and Wigner-type matrices
  • Mixture model and Wishart-type matrices
slide-35
SLIDE 35

Dimensionality Reduction

35

slide-36
SLIDE 36

36

Heteroscedastic model

`i 2 {±1}, Xi|`i ⇠ N(`iµ, Σi), Σi Σ.

Low-rank: . E(X | `) = `µ>

X = (X1, · · · , Xn)> ∈ Rn⇥d

Gaussian Mixture Model

slide-37
SLIDE 37

Spectral Method

37

Related works: Montanari and Sun 2018, Ndaoud 2018, Cai et al. 2019.

Hollowing

  • improves concentration;
  • helps tackle heteroscedasticity.

Spectral method

  • 1. Get the hollowed Gram matrix

;

  • 2. .

G = H(XX>) ∈ Rn⇥n

SVD

the 1st eigenvector G

sgn(u)

u Recall . E(X | `) = `µ>

slide-38
SLIDE 38

38

Theoretical Challenges

  • Dependency: has Wishart-type distribution.
  • SBM: Wigner-type adjacency matrix.
  • High dimensionality: most existing results require .
  • Clustering vs. parameter est.:

d . n

H(XX>)

⇠ 1

2N(µ, Id) + 1 2N(µ, Id)

slide-39
SLIDE 39

39

Theoretical Challenges

  • Dependency: has Wishart-type distribution.
  • SBM: Wigner-type adjacency matrix.
  • High dimensionality: most existing results require .
  • Clustering vs. parameter est.:

d . n

H(XX>)

⇠ 1

2N(µ, Id) + 1 2N(µ, Id)

log kµk2 log(d/n) kµk2 ⇠ (d/n)1/2 kµk2 ⇠ (d/n)1/4

`

(`, µ)

slide-40
SLIDE 40

ℓp Analysis: Linear Approximation

40

G = H(XX>) ∈ Rn⇥n

is the 1st eigenvector of ,

u

40

SNR = kµk4

2

kΣk2kµk2

2 + kΣk2 F/n 1,

, is the 1st eigen-pair of .

¯ X = `µ> (¯ λ, ¯ u)

¯ X ¯ X>

slide-41
SLIDE 41

ℓp Analysis: Linear Approximation

41

G = H(XX>) ∈ Rn⇥n

is the 1st eigenvector of ,

u

41

If , there exists s.t. εn → 0

Theorem

2 ≤ p . SNR

P

  • ku G¯

u/¯ λkp < εnk¯ ukp

  • > 1 e−p.

SNR = kµk4

2

kΣk2kµk2

2 + kΣk2 F/n 1,

  • Applies to RKHS (dim. = ∞): kernel PCA;
  • Adaptivity: large large strong result.

SNR

p ) NR p p )

, is the 1st eigen-pair of .

¯ X = `µ> (¯ λ, ¯ u)

¯ X ¯ X>

slide-42
SLIDE 42

When : and

ℓp Analysis: Corollaries

42

k · kp ⇣ k · k∞

P ✓ ku G¯ u/¯ λk∞ < εn pn ◆ > 1 1 n3 .

p ⇣ SNR & log n

slide-43
SLIDE 43

When : and

ℓp Analysis: Corollaries

43

k · kp ⇣ k · k∞

P ✓ ku G¯ u/¯ λk∞ < εn pn ◆ > 1 1 n3 .

  • Exact recovery w.h.p. when ;
  • Error rate when .

Corollary (optimal clustering, )

1111 Σ = Id

  • Kmeans (Lu and Zhou 2016);
  • Spectral (Vempala and Wang 2004, Jin et al. 2017, Ndaoud 2018, Löffler et al. 2019);
  • SDP (Mixon et al. 2016, Royer 2017, Fei and Chen 2018, Giraud and Verzelen 2018, Chen and Yang 2018).

SNR > 2 log n

p ⇣ SNR & log n e−SNR/[2+o(1)] 1 ⌧ SNR  2 log n

slide-44
SLIDE 44

Summary

Sharpness of spectral methods.

  • Abbe, Fan, Wang and Zhong. Entrywise eigenvector

analysis of random matrices with low expected rank. The Annals of Statistics, 2020+.

  • Abbe, Fan and Wang. An ℓp analysis of kernel PCA and

contextual network analysis. Manuscript. Extensions

  • ranking (CFMW, AoS 2019), topic models, ncvx optimization;
  • statistical inference based on linear representation.

(alphabetical orders)

44

slide-45
SLIDE 45

Finding a Needle in a Haystack

45

Spectral methods are powerful but not omnipotent. : covariance

  • Max variance useful
  • PCA: or

Seek for clustering-friendly projections!

1 2N(µ, Σ) + 1 2N(−µ, Σ)

µµ> + Σ kµk2

2/kΣk2 1

Σ ≈ I

6=

slide-46
SLIDE 46

Example: Fashion-MNIST

46

70000 fashion products, 10 categories (Xiao et al. 2017).

  • T-shirts/tops
  • Pullovers

Visualization by PCA

slide-47
SLIDE 47

A CURE for Clustering Problems

Clustering via Uncoupled REgression (CURE): Wang, Yan and Diaz. Efficient clustering for stretched mixtures: landscape and optimality. Submitted.

  • Clustering -> classification;
  • Stat. and comp. guarantees under mixture models.

47

slide-48
SLIDE 48

Q & A

slide-49
SLIDE 49

Thank you!