 
              An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013
An Introduction to Spectral Learning Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words
An Introduction to Spectral Learning Preliminaries X 1 , · · · , X n ∼ p ( x ; θ ) , θ = ( θ 1 , · · · θ m ) ⊤ θ = ˆ ˆ θ n = w ( X 1 , · · · , X n ) Maximum Likelihood Estimator (MLE) ˆ θ = argmax log L ( θ ) θ Bayes Estimator (BE) � θ p ( x | θ ) π ( θ ) d θ ˆ θ = E ( θ | X ) = � p ( x | θ ) π ( θ ) d θ
An Introduction to Spectral Learning Preliminaries Question What makes a good estimator? MLE is consistent Both the MLE and BE have asymptotic normality √ n 1 � � � � ˆ θ n − θ 0, � N I ( θ ) under mild (regularity) conditions Can be computationally expensive
An Introduction to Spectral Learning Preliminaries Example ( Gamma distribution ) 1 − x i � � Γ ( α ) θ α x α − 1 p ( x i ; α , θ ) = exp i θ � n � n � α − 1 � n 1 � � i = 1 x i � � L ( α , θ ) = x i exp − Γ ( α ) θ α θ i = 1 MLE is hard to compute due to the existence of Γ ( α )
An Introduction to Spectral Learning Method of Moments j -th theoretical moment, j ∈ [ k ] � X j � µ j ( θ ) : = E θ j -th sample moment, j ∈ [ k ] n M j : = 1 � X j i n i = 1 Plug-in and solve the multivariate polynomial equations M j = µ j ( θ ) j ∈ [ k ] sometimes can be recast as spectral decomposition
An Introduction to Spectral Learning Method of Moments Example ( Gamma distribution ) 1 � − x i � Γ ( α ) θ α x α − 1 p ( x i ; α , θ ) = exp i θ X = E ( X i ) = αθ n � 2 = Var ( X i ) = αθ 2 1 � � X i − X n i = 1 n nX 2 � 2 , ˆ 1 α = X � ⇒ ˆ � θ = X i − X θ = ˆ � 2 nX � � n X i − X i = 1 i = 1
An Introduction to Spectral Learning Method of Moments lack guarantee about the solution high-order sample moments are hard to estimate To reach a specified accuracy, the required sample size and computational cost is exponential in k (or n )! Question Could we recover the true θ from only low-order moments? Question Could we lower the sample requirement and computational complexity based on some (hopefully mild) assumptions?
An Introduction to Spectral Learning Learning the Topic Models Papadimitriou et al. (2000) Non-overlapping separation condition (strong) Anandkumar et al. (2012), MoM+SD Full rank assumption (weak) Multinomial Mixture, LDA Arora et al. (2012), MoM+NMF+LP Anchor words (mild) LDA, Correlated Topic Model A more practical algorithm proposed in 2013
An Introduction to Spectral Learning Learning the Topic Models Suppose there are n documents, k hidden topics, d features M = [ µ 1 | µ 2 | . . . | µ k ] ∈ R d × k , µ j ∈ ∆ d − 1 ∀ j ∈ [ k ] w = ( w 1 , . . . , w k ) , w ∈ ∆ k − 1 P ( h = j ) = w j j ∈ [ k ] For the v -th word in a document, x v ∈ { e 1 , . . . e d } P ( x v = e i | h = j ) = µ i j , j ∈ [ k ] , i ∈ [ d ] Goal : Recover the M using low-order moments
An Introduction to Spectral Learning Learning the Topic Models Construct moment statistics Pairs ij : = P ( x 1 = e i , x 2 = e j ) Triples ij : = P ( x 1 = e i , x 2 = e j , x 3 = e t ) Pair = E [ x 1 ⊗ x 2 ] ∈ R d × d Triples = E [ x 1 ⊗ x 2 ⊗ x 3 ] ∈ R d × d × d ˆ ˆ Empirical plug-ins i.e. Pairs and Triples could be obtained from data through a straightforward manner We want to establish some equivalence between the empirical moments and parameters of interest
An Introduction to Spectral Learning Learning the Topic Models Triples ( η ) : = E [ x 1 ⊗ x 2 ⊗ � x 3 , η � ] ∈ R d × d Triples ( η ) : R d → R d × d Lemma Pairs = M diag ( w ) M ⊤ � � � � M ⊤ η M ⊤ Triples ( η ) = M diag ( w ) diag The unknown M and w are twisted.
An Introduction to Spectral Learning Learning the Topic Models Assumption ( Non-degeneracy ) M has full column rank k � − 1 and � − 1 exist. 1 Find U , V ∈ R d × k s.t. � � U ⊤ M V ⊤ M 2 ∀ η ∈ R d , define B ( η ) ∈ R k × k � − 1 � � � U ⊤ Triples ( η ) V U ⊤ Pairs V B ( η ) : = Lemma (Observable Operator) � − 1 � � � � � U ⊤ M M ⊤ η U ⊤ M B ( η ) = diag
An Introduction to Spectral Learning Learning the Topic Models ˆ ˆ Input : Pairs and Triples Output : topic-word distributions ˆ M U , ˆ ˆ Pairs a ˆ V ← top k left, right eigenvectors of η ← random sample from range( ˆ U ) � � ξ 1 , ˆ ˆ ξ 2 , . . . , ˆ ← right eigenvectors of B ( η ) b ξ k for j ← 1 to k do µ j ← ˆ U ˆ ξ j / � 1, ˆ U ˆ ˆ ξ j � end return ˆ M = [ ˆ µ 1 | ˆ µ 2 | . . . | ˆ µ k ] a Pairs = M diag ( w ) M ⊤ U ⊤ M � − 1 b B ( η ) = � U ⊤ M � M ⊤ η � � diag �
An Introduction to Spectral Learning Learning the Topic Models Lemma (Observable Operator) � − 1 � � � � � U ⊤ M M ⊤ η U ⊤ M B ( η ) = diag We hope M ⊤ η has distinct entries. How to pick η ? η ← e i ⇒ M ⊤ η i- th word’s distribution over topics Prior knowledge required! Otherwise, η ← U θ , θ ∼ Uniform ( S k − 1 )
An Introduction to Spectral Learning Learning the Topic Models SVD is carried out on R k × k , k ≪ d Only involves trigram statistics i.e. low-order moments Guaranteed to recover the parameters Parameters of more complicated models like LDA can be recovered in the same manner
An Introduction to Spectral Learning Tensor Decomposition Recall Pairs = M diag ( w ) M ⊤ � � � � M ⊤ η M ⊤ Triples ( η ) = M diag diag ( w ) k � Pairs = w j · µ j ⊗ µ j j k � Triples = w j · µ j ⊗ µ j ⊗ µ j j Symmetric tensor decomposition? µ j need to be orthogonal
An Introduction to Spectral Learning Tensor Decomposition Whiten Pairs 1 2 ⇒ W ⊤ Pairs W = I W : = UD j : = √ w j W ⊤ µ j µ ′ We can check that µ ′ j , j ∈ [ k ] are orthonormal vectors Do orthogonal tensor decomposition on k k � ⊗ 3 = 1 � ⊗ 3 � W ⊤ µ j � µ ′ Triples ( W , W , W ) = w j √ w j j j = 1 j = 1 Then recover µ j from µ ′ j
An Introduction to Spectral Learning Anchor Words Drawbacks of previous algorithms topics cannot be correlated the bound is weak (comparatively speaking) empirical runtime performance is not satisfactory Alternatively assumptions?
An Introduction to Spectral Learning Anchor Words Definition ( p -separable) M is p -separable if ∀ j , ∃ i s.t. M ij ≥ p and M ij ′ = 0 for j ′ � = j Documents do not necessarily contains anchor words Two-fold algorithm 1 Selection: find the anchor word for each topic 2 Recover: recover M based on anchor words Good theoretical guarantees and empirical results
An Introduction to Spectral Learning Anchor Words 1 1 The illustration is taken from Ankur Moitra’s slides, http://people.csail.mit.edu/moitra/docs/IASM.pdf
An Introduction to Spectral Learning Discussion Summary A brief introduction to MoM Learning topic models by spectral decomposition Anchor words assumption Connections with our work?
Recommend
More recommend