Spectral Methods for Latent Variable Models Kaizheng Wang - PowerPoint PPT Presentation

Spectral Methods for Latent Variable Models Kaizheng Wang Department of ORFE   Princeton University March 20 th 2020

Data Diversity Unstructured, heterogeneous and incomplete information: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Credit: https://www.mathworks.com/help/textanalytics/gs/getting-started-with-topic-modeling.html, https://www.alliance-scotland.org.uk/alliance-homepage-holding-people-networking-2017-01-3/, 2 https://medicalxpress.com/news/2015-04-tumor-only-genetic-sequencing-misguide-cancer .html, https://www.nature.com/articles/nature21386/figures/1, https://viterbi-web.usc.edu/ soltanol/RSC.pdf, Dzenan Hamzic

Matrix Representations Object-by-feature ( ):   n × d ? ? ? ? • Texts: document-term; ? ? ? ? • Genetics: individual-marker; ? ? ? ? • Recomm. systems: user-item. ? ? ? Object-by-object ( ): n × n • Networks: adjacency matrices. 3 Credit (upper right): https://viterbi-web.usc.edu/ soltanol/RSC.pdf.

Matrix Representations Common belief: high ambient dim. but low intrinsic dim. Low-rank approximation: = + 4

  Matrix Representations Low-dimensional embedding via latent variables :       X 1 f 1 E 1 . . . . . . B  = +       . . .  .     X n f n E n n × d r × d n × d n × r latent   latent   samples noises coordinates bases Principal Component Analysis (PCA)   — truncated SVD 5 0 2

Example: Genes Mirror Geography within Europe Novembre et al. (2008), Nature. n = 1387 individuals and d = 197146 SNPs; Figure 1a: 2-dim. embedding vs. labels. PC1 2 C P 6

Outline • Distributed PCA and linearization of eigenvectors • An ℓ p theory for spectral methods • Summary and future directions

Distributed PCA and linearization of eigenvectors

Principal Component Analysis { X i } n i =1 ✓ R d E ( X i X > E X i = 0 i ) = Σ Data : i.i.d., , . n Goal : estimate the principal subspace spanned by the K leading eigenvectors of . ) = Σ SVD ˆ X = ( X 1 , · · · , X n ) > u K ) 2 R d ⇥ K PCA : U = (ˆ u 1 , · · · , ˆ 0 9 2

Distributed PCA Data 5 Data 1 Center Data 4 Data 2 Data 3 10

Distributed PCA m local machines in total, each has n samples. 1. PCA in parallel : the ℓ - th machine conducts X ( ` ) 2 R n ⇥ d U ( ` ) 2 R d ⇥ K SVD ˆ and sends to the central server; ˆ U ( ` ) 2. Aggregation : . { ˆ ˆ U ( ` ) } m U 2 R d ⇥ K ` =1 Related works: Mcdonald et al. 2009; Zhang et al. 2013; Lee et al., 2015; Battey et al., 2015; Qu et al. 2002; El Karoui and d’Aspremont 2010; Liang et al. 2014. 11

Center of Subspaces ˆ { ˆ U ( ` ) } m How to find that best summarizes ? U 2 O d ⇥ K ` =1 12

Center of Subspaces ˆ { ˆ U ( ` ) } m How to find that best summarizes ? U 2 O d ⇥ K ` =1 • Subspace distance : ⇢ ( V , W ) = k V V > � W W > k F . • Least squares : m X ˆ ⇢ 2 ( V , ˆ U ( ` ) ) . U = argmin V 2 O d × K ` =1 • Algorithm : SVD ( ˆ U (1) , · · · , ˆ ˆ U ( m ) ) 2 R d ⇥ mK U 2 O d ⇥ K . 13

Theoretical Results k Σ � 1 / 2 X i k 2 . 1 Assume that is sub-Gaussian, i.e. . X i Define the effective rank and condition number as r = Tr( Σ ) / � 1 ,  = � 1 / ( � K � � K +1 ) . 14

Theoretical Results k Σ � 1 / 2 X i k 2 . 1 Assume that is sub-Gaussian, i.e. . X i Define the effective rank and condition number as r = Tr( Σ ) / � 1 ,  = � 1 / ( � K � � K +1 ) . Theorem (FW W Z, AoS 2019 ) n � C  2 p There exists constant C such that when , Kr p r � � Kr Kr U > � UU > k F � k ˆ U ˆ +  2 � � . 1 .  � mn n | {z } | {z } bias variance 15

Theoretical Results k Σ � 1 / 2 X i k 2 . 1 Assume that is sub-Gaussian, i.e. . X i Define the effective rank and condition number as r = Tr( Σ ) / � 1 ,  = � 1 / ( � K � � K +1 ) . Theorem (FW W Z, AoS 2019 ) n � C  2 p There exists constant C such that when , Kr p r � � Kr Kr U > � UU > k F � k ˆ U ˆ +  2 � � . 1 .  � mn n | {z } | {z } bias variance • If , distributed PCA is optimal. m . n/ (  2 r ) n � C  2 p • The condition cannot be improved. Kr 16

Analysis of Aggregation X (1) 2 R n ⇥ d U (1) 2 O d ⇥ K ˆ . . . . ˆ SVD SVD . . U 2 O d ⇥ K X ( m ) 2 R n ⇥ d U ( m ) 2 O d ⇥ K ˆ P m U ( ` ) ˆ ` =1 ˆ ˆ 1 U ( ` ) > : eigenvectors of . U m 17

Analysis of Aggregation X (1) 2 R n ⇥ d U (1) 2 O d ⇥ K ˆ . . . . ˆ SVD SVD . . U 2 O d ⇥ K X ( m ) 2 R n ⇥ d U ( m ) 2 O d ⇥ K ˆ P m U ( ` ) ˆ ` =1 ˆ ˆ 1 U ( ` ) > : eigenvectors of . U m Averaging reduces variance but retains bias . • Variance: controlled by Davis-Kahan: U ( ` ) ˆ k ˆ U ( ` ) > � UU > k F . k ( ˆ Σ ( ` ) � Σ ) U k F / ∆ . • Bias: how large it is? 18

  Linearization of Eigenvectors Theorem (FW W Z, AoS 2019 ) U ( ` ) ˆ U ( ` ) > � [ UU > + f ( ˆ Σ ( ` ) � Σ )] k F . [ k ( ˆ Σ ( ` ) � Σ ) U k F / ∆ ] 2 , k ˆ > � k f : R d ⇥ d ! R d ⇥ d is a linear functional determined by . ) = Σ 19

  Linearization of Eigenvectors Theorem (FW W Z, AoS 2019 ) U ( ` ) ˆ U ( ` ) > � [ UU > + f ( ˆ Σ ( ` ) � Σ )] k F . [ k ( ˆ Σ ( ` ) � Σ ) U k F / ∆ ] 2 , k ˆ > � k f : R d ⇥ d ! R d ⇥ d is a linear functional determined by . ) = Σ More precise than Davis-Kahan : U ( ` ) ˆ k ˆ U ( ` ) > � UU > k F . k ( ˆ Σ ( ` ) � Σ ) U k F / ∆ . PCA has small bias: U ( ` ) ˆ Σ ( ` ) � Σ ) U k F / ∆ ] 2 . k E ( ˆ U ( ` ) > ) � UU > k F . [ k ( ˆ 20

Summary Theoretical guarantees for distributed PCA : • Bias and variance of PCA; • Linearization of eigenvectors, high-order Davis-Kahan. Paper (alphabetical order): • Fan, Wang, Wang and Zhu. Distributed estimation of principal eigenspaces. The Annals of Statistics , 2019. 21

Example: Genes Mirror Geography within Europe Novembre et al. (2008), Nature. n = 1387 individuals and d = 197146 SNPs; Figure 1a: 2-dim. embedding vs. labels. PC1 2 C P 22

A Pipeline for Spectral Methods 1. Similarity matrix construction   e.g. Gram , adjacency ; XX > A 2. Spectral decomposition   r get r eigen-pairs ; � λ j , u j j =1 3. r -dim. embedding   e.g. using the rows of ; ( u 1 , u 2 , . . . , u r ) 4. Downstream tasks   e.g. visualization. Ext.: { robust, probabilistic, sparse, nonnegative } PCA. Pearson (1901), Hotelling (1933), Schölkopf (1997), Tipping and Bishop (1999), Shi and Malik (2000), Ng et al. (2002), Belkin and Niyogi (2003), Von Luxburg (2007) 23

An ℓ p theory for spectral methods • Network analysis and Wigner-type matrices • Mixture model and Wishart-type matrices

Community Detection and SBM Community detection in networks: 25 Credit: Yuxin Chen.

Community Detection and SBM Community detection in networks: Stochastic Block Model (Holland et al., 1983) Symmetric adjacency matrix , : | J | = | J c | = n A ∈ { 0 , 1 } n × n 2 ( p, if i, j or i, j P ( A ij = 1) = if i, j or i, j . q, McSherry (2001), Coja-Oghlan (2006), Rohe et al. (2011), Mossel et al. (2013), Massoulie (2014), Lelarge et al. (2015), Chin et al. (2015), Abbe et al. (2016), Zhang and Zhou (2016). 26 Credit: Yuxin Chen.

Community Detection and SBM ✓ 1 J ✓ p 1 J,J ◆ ◆ � = p + q q 1 J,J c 2 n 11 > + p − q 1 > − 1 > � E A = . J J c q 1 J c ,J p 1 J c ,J c − 1 J c 2 n 1 The 2 nd eigenvector reveals . u = √ n ( 1 J − 1 J c ) ( J, J c ) ¯ A = E A + A = E A + + A − E A . A = E A 27 Credit: Yuxin Chen.

  Community Detection and SBM ✓ 1 J ✓ p 1 J,J ◆ ◆ � = p + q q 1 J,J c 2 n 11 > + p − q 1 > − 1 > � E A = . J J c q 1 J c ,J p 1 J c ,J c − 1 J c 2 n 1 The 2 nd eigenvector reveals . u = √ n ( 1 J − 1 J c ) ( J, J c ) ¯ Spectral method: SVD the 2 nd eigenvector .   sgn( u ) A u To recover , we need in a uniform way. ( J, J c ) u ≈ ¯ u Classical ℓ 2 bounds (Davis and Kahan, 1970) are too loose! 28

  Optimality of Spectral Method Theorem (AF W Z, AoS 2020+ ) ( a log n , if i, j or i, j Let and . n P ( A ij = 1) = a 6 = b b log n , if i, j or i, j n � 2 > 2 • Exact recovery w.h.p. when ; √ � √ a − b √ � √ a − • Error rate when . √ � 2 ≤ 2 n − ( √ a − b ) 2 / 2 b • Optimality (Abbe et al., 2016; Zhang and Zhou, 2016) . 29

Spectral Methods for Latent Variable Models Kaizheng Wang - PowerPoint PPT Presentation

Spectral Methods for Latent Variable Models Kaizheng Wang Department of ORFE Princeton University March 20 th 2020 Data Diversity Unstructured, heterogeneous and incomplete information: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Credit:

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

1 Latent variable models In the next section we will discuss latent variable models for

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

1. (Three-player Cournot competition) Consider a three-player Cournot competition, in which three

Voronoi Games on Cycle Graphs Marios Mavronicolas Burkhard Monien Vicky G. Papadopoulou Florian

Random projections, reweighting and half-sampling for high-dimensional statistical inference Art

Repeated Measures ANOVA Rick Balkin, Ph.D., LPC-S, NCC Department of Counseling Texas A & M

Advanced Microeconomics: Game Theory P . v. Mouche Wageningen University 2017 Motivation

sss r t t

Spatial Price Discrimination with Heterogeneous Firms Jonathan Vogel Columbia and NBER August

Individual decision-making under certainty Objects of inquiry Our study of microeconomics begins

Spectral Methods for Latent Variable Models Kaizheng Wang - PowerPoint PPT Presentation

Spectral Methods for Latent Variable Models Kaizheng Wang Department of ORFE Princeton University March 20 th 2020 Data Diversity Unstructured, heterogeneous and incomplete information: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Credit:

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

1 Latent variable models In the next section we will discuss latent variable models for

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

1. (Three-player Cournot competition) Consider a three-player Cournot competition, in which three

Voronoi Games on Cycle Graphs Marios Mavronicolas Burkhard Monien Vicky G. Papadopoulou Florian

Random projections, reweighting and half-sampling for high-dimensional statistical inference Art

Repeated Measures ANOVA Rick Balkin, Ph.D., LPC-S, NCC Department of Counseling Texas A &amp; M

Advanced Microeconomics: Game Theory P . v. Mouche Wageningen University 2017 Motivation

sss r t t

Spatial Price Discrimination with Heterogeneous Firms Jonathan Vogel Columbia and NBER August

Individual decision-making under certainty Objects of inquiry Our study of microeconomics begins

Repeated Measures ANOVA Rick Balkin, Ph.D., LPC-S, NCC Department of Counseling Texas A & M