an introduction to spectral learning
play

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 - PowerPoint PPT Presentation

An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral Learning Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words An


  1. An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013

  2. An Introduction to Spectral Learning Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words

  3. An Introduction to Spectral Learning Preliminaries X 1 , · · · , X n ∼ p ( x ; θ ) , θ = ( θ 1 , · · · θ m ) ⊤ θ = ˆ ˆ θ n = w ( X 1 , · · · , X n ) Maximum Likelihood Estimator (MLE) ˆ θ = argmax log L ( θ ) θ Bayes Estimator (BE) � θ p ( x | θ ) π ( θ ) d θ ˆ θ = E ( θ | X ) = � p ( x | θ ) π ( θ ) d θ

  4. An Introduction to Spectral Learning Preliminaries Question What makes a good estimator? MLE is consistent Both the MLE and BE have asymptotic normality √ n 1 � � � � ˆ θ n − θ 0, � N I ( θ ) under mild (regularity) conditions Can be computationally expensive

  5. An Introduction to Spectral Learning Preliminaries Example ( Gamma distribution ) 1 − x i � � Γ ( α ) θ α x α − 1 p ( x i ; α , θ ) = exp i θ � n � n � α − 1 � n 1 � � i = 1 x i � � L ( α , θ ) = x i exp − Γ ( α ) θ α θ i = 1 MLE is hard to compute due to the existence of Γ ( α )

  6. An Introduction to Spectral Learning Method of Moments j -th theoretical moment, j ∈ [ k ] � X j � µ j ( θ ) : = E θ j -th sample moment, j ∈ [ k ] n M j : = 1 � X j i n i = 1 Plug-in and solve the multivariate polynomial equations M j = µ j ( θ ) j ∈ [ k ] sometimes can be recast as spectral decomposition

  7. An Introduction to Spectral Learning Method of Moments Example ( Gamma distribution ) 1 � − x i � Γ ( α ) θ α x α − 1 p ( x i ; α , θ ) = exp i θ X = E ( X i ) = αθ n � 2 = Var ( X i ) = αθ 2 1 � � X i − X n i = 1 n nX 2 � 2 , ˆ 1 α = X � ⇒ ˆ � θ = X i − X θ = ˆ � 2 nX � � n X i − X i = 1 i = 1

  8. An Introduction to Spectral Learning Method of Moments lack guarantee about the solution high-order sample moments are hard to estimate To reach a specified accuracy, the required sample size and computational cost is exponential in k (or n )! Question Could we recover the true θ from only low-order moments? Question Could we lower the sample requirement and computational complexity based on some (hopefully mild) assumptions?

  9. An Introduction to Spectral Learning Learning the Topic Models Papadimitriou et al. (2000) Non-overlapping separation condition (strong) Anandkumar et al. (2012), MoM+SD Full rank assumption (weak) Multinomial Mixture, LDA Arora et al. (2012), MoM+NMF+LP Anchor words (mild) LDA, Correlated Topic Model A more practical algorithm proposed in 2013

  10. An Introduction to Spectral Learning Learning the Topic Models Suppose there are n documents, k hidden topics, d features M = [ µ 1 | µ 2 | . . . | µ k ] ∈ R d × k , µ j ∈ ∆ d − 1 ∀ j ∈ [ k ] w = ( w 1 , . . . , w k ) , w ∈ ∆ k − 1 P ( h = j ) = w j j ∈ [ k ] For the v -th word in a document, x v ∈ { e 1 , . . . e d } P ( x v = e i | h = j ) = µ i j , j ∈ [ k ] , i ∈ [ d ] Goal : Recover the M using low-order moments

  11. An Introduction to Spectral Learning Learning the Topic Models Construct moment statistics Pairs ij : = P ( x 1 = e i , x 2 = e j ) Triples ij : = P ( x 1 = e i , x 2 = e j , x 3 = e t ) Pair = E [ x 1 ⊗ x 2 ] ∈ R d × d Triples = E [ x 1 ⊗ x 2 ⊗ x 3 ] ∈ R d × d × d ˆ ˆ Empirical plug-ins i.e. Pairs and Triples could be obtained from data through a straightforward manner We want to establish some equivalence between the empirical moments and parameters of interest

  12. An Introduction to Spectral Learning Learning the Topic Models Triples ( η ) : = E [ x 1 ⊗ x 2 ⊗ � x 3 , η � ] ∈ R d × d Triples ( η ) : R d → R d × d Lemma Pairs = M diag ( w ) M ⊤ � � � � M ⊤ η M ⊤ Triples ( η ) = M diag ( w ) diag The unknown M and w are twisted.

  13. An Introduction to Spectral Learning Learning the Topic Models Assumption ( Non-degeneracy ) M has full column rank k � − 1 and � − 1 exist. 1 Find U , V ∈ R d × k s.t. � � U ⊤ M V ⊤ M 2 ∀ η ∈ R d , define B ( η ) ∈ R k × k � − 1 � � � U ⊤ Triples ( η ) V U ⊤ Pairs V B ( η ) : = Lemma (Observable Operator) � − 1 � � � � � U ⊤ M M ⊤ η U ⊤ M B ( η ) = diag

  14. An Introduction to Spectral Learning Learning the Topic Models ˆ ˆ Input : Pairs and Triples Output : topic-word distributions ˆ M U , ˆ ˆ Pairs a ˆ V ← top k left, right eigenvectors of η ← random sample from range( ˆ U ) � � ξ 1 , ˆ ˆ ξ 2 , . . . , ˆ ← right eigenvectors of B ( η ) b ξ k for j ← 1 to k do µ j ← ˆ U ˆ ξ j / � 1, ˆ U ˆ ˆ ξ j � end return ˆ M = [ ˆ µ 1 | ˆ µ 2 | . . . | ˆ µ k ] a Pairs = M diag ( w ) M ⊤ U ⊤ M � − 1 b B ( η ) = � U ⊤ M � M ⊤ η � � diag �

  15. An Introduction to Spectral Learning Learning the Topic Models Lemma (Observable Operator) � − 1 � � � � � U ⊤ M M ⊤ η U ⊤ M B ( η ) = diag We hope M ⊤ η has distinct entries. How to pick η ? η ← e i ⇒ M ⊤ η i- th word’s distribution over topics Prior knowledge required! Otherwise, η ← U θ , θ ∼ Uniform ( S k − 1 )

  16. An Introduction to Spectral Learning Learning the Topic Models SVD is carried out on R k × k , k ≪ d Only involves trigram statistics i.e. low-order moments Guaranteed to recover the parameters Parameters of more complicated models like LDA can be recovered in the same manner

  17. An Introduction to Spectral Learning Tensor Decomposition Recall Pairs = M diag ( w ) M ⊤ � � � � M ⊤ η M ⊤ Triples ( η ) = M diag diag ( w ) k � Pairs = w j · µ j ⊗ µ j j k � Triples = w j · µ j ⊗ µ j ⊗ µ j j Symmetric tensor decomposition? µ j need to be orthogonal

  18. An Introduction to Spectral Learning Tensor Decomposition Whiten Pairs 1 2 ⇒ W ⊤ Pairs W = I W : = UD j : = √ w j W ⊤ µ j µ ′ We can check that µ ′ j , j ∈ [ k ] are orthonormal vectors Do orthogonal tensor decomposition on k k � ⊗ 3 = 1 � ⊗ 3 � W ⊤ µ j � µ ′ Triples ( W , W , W ) = w j √ w j j j = 1 j = 1 Then recover µ j from µ ′ j

  19. An Introduction to Spectral Learning Anchor Words Drawbacks of previous algorithms topics cannot be correlated the bound is weak (comparatively speaking) empirical runtime performance is not satisfactory Alternatively assumptions?

  20. An Introduction to Spectral Learning Anchor Words Definition ( p -separable) M is p -separable if ∀ j , ∃ i s.t. M ij ≥ p and M ij ′ = 0 for j ′ � = j Documents do not necessarily contains anchor words Two-fold algorithm 1 Selection: find the anchor word for each topic 2 Recover: recover M based on anchor words Good theoretical guarantees and empirical results

  21. An Introduction to Spectral Learning Anchor Words 1 1 The illustration is taken from Ankur Moitra’s slides, http://people.csail.mit.edu/moitra/docs/IASM.pdf

  22. An Introduction to Spectral Learning Discussion Summary A brief introduction to MoM Learning topic models by spectral decomposition Anchor words assumption Connections with our work?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend