learning latent variable models through tensor methods
play

Learning Latent Variable Models through Tensor Methods Anima - PowerPoint PPT Presentation

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges in Unsupervised Learning Learn a latent variable model without labeled examples. E.g. topic models, hidden Markov models, Gaussian mixtures,


  1. Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine

  2. Challenges in Unsupervised Learning Learn a latent variable model without labeled examples. E.g. topic models, hidden Markov models, Gaussian mixtures, community detection. Maximum likelihood is NP-hard in most scenarios. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this talk: guaranteed and efficient learning through tensor methods

  3. How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. h 1 Advanced: Probabilistic models Hidden variable h has more general distributions. h 2 h 3 Can model mixed memberships. x 1 x 2 x 3 x 4 x 5

  4. Moment Based Approaches Multivariate Moments M 1 := E [ x ] , M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .

  5. Outline Introduction 1 Spectral Methods: Matrices to Tensors 2 Tensor Forms for Different Models 3 Experimental Results 4 Overcomplete Tensors 5 Conclusion 6

  6. Classical Spectral Methods: Matrix PCA Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k -means).

  7. Classical Spectral Methods: Matrix PCA Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k -means). Basic method works only for single memberships. Failure to cluster under small separation. Require long documents for good concentration bounds.

  8. Classical Spectral Methods: Matrix PCA Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k -means). Basic method works only for single memberships. Failure to cluster under small separation. Require long documents for good concentration bounds. Efficient Learning Without Separation Constraints?

  9. Beyond SVD: Spectral Methods on Tensors How to learn the mixture components without separation constraints? ◮ Are higher order moments helpful? Unified framework? ◮ Moment-based Estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. ◮ What are the analogues for tensors?

  10. Spectral Decomposition M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2

  11. Spectral Decomposition M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2 M 3 = � λ i u i ⊗ v i ⊗ w i i .... = + Tensor M 3 λ 1 u 1 ⊗ v 1 ⊗ w 1 λ 2 u 2 ⊗ v 2 ⊗ w 2 u ⊗ v ⊗ w is a rank- 1 tensor since its ( i 1 , i 2 , i 3 ) th entry is u i 1 v i 2 w i 3 .

  12. Decomposition of Orthogonal Tensors A has orthogonal columns. � M 3 = w i a i ⊗ a i ⊗ a i . i

  13. Decomposition of Orthogonal Tensors A has orthogonal columns. � M 3 = w i a i ⊗ a i ⊗ a i . i M 3 ( I, a 1 , a 1 ) = � i w i � a i , a 1 � 2 a i = w 1 a 1 .

  14. Decomposition of Orthogonal Tensors A has orthogonal columns. � M 3 = w i a i ⊗ a i ⊗ a i . i M 3 ( I, a 1 , a 1 ) = � i w i � a i , a 1 � 2 a i = w 1 a 1 . a i are eigenvectors of tensor M 3 . Analogous to matrix eigenvectors: Mv = M ( I, v ) = λv .

  15. Decomposition of Orthogonal Tensors A has orthogonal columns. � M 3 = w i a i ⊗ a i ⊗ a i . i M 3 ( I, a 1 , a 1 ) = � i w i � a i , a 1 � 2 a i = w 1 a 1 . a i are eigenvectors of tensor M 3 . Analogous to matrix eigenvectors: Mv = M ( I, v ) = λv . Two Problems How to find eigenvectors of a tensor? A is not orthogonal in general.

  16. Whitening � � M 3 = w i a i ⊗ a i ⊗ a i , M 2 = w i a i ⊗ a i . i i Find whitening matrix W s.t. W ⊤ A = V is an orthogonal matrix. When A ∈ R d × k has full column rank, it is an invertible transformation. v 1 a 1 W a 2 v 2 a 3 v 3 Use pairwise moments M 2 to find W s.t. W ⊤ M 2 W = I . Eigen-decomposition of M 2 = U Diag (˜ λ ) U ⊤ , then W = U Diag (˜ λ − 1 / 2 ) .

  17. Using Whitening to Obtain Orthogonal Tensor Tensor M 3 Tensor T Multi-linear transform M 3 ∈ R d × d × d and T ∈ R k × k × k . T = M 3 ( W, W, W ) = � i w i ( W ⊤ a i ) ⊗ 3 . T = � λ i · v i ⊗ v i ⊗ v i is orthogonal. i ∈ [ k ] Dimensionality reduction when k ≪ d .

  18. Putting it together � � w i a i ⊗ a i , w i a i ⊗ a i ⊗ a i . M 2 = M 3 = i i Obtain whitening matrix W from SVD of M 2 . Use W for multi-linear transform: T = M 3 ( W, W, W ) . Find eigenvectors of T through power method and deflation. For what models can we obtain M 2 and M 3 forms?

  19. Outline Introduction 1 Spectral Methods: Matrices to Tensors 2 Tensor Forms for Different Models 3 Experimental Results 4 Overcomplete Tensors 5 Conclusion 6

  20. Topic Modeling

  21. Geometric Picture for Topic Models Topic proportions vector ( h ) Document

  22. Geometric Picture for Topic Models Single topic ( h )

  23. Geometric Picture for Topic Models Single topic ( h ) A A A x 2 x 1 x 3 Word generation ( x 1 , x 2 , . . . )

  24. Geometric Picture for Topic Models Single topic ( h ) A A A x 2 x 1 x 3 Word generation ( x 1 , x 2 , . . . ) Linear model: E [ x i | h ] = Ah .

  25. Moments for Single Topic Models E [ x i | h ] = Ah. h w := E [ h ] . A A A A A Learn topic-word matrix A , vector w x 1 x 2 x 3 x 4 x 5

  26. Moments for Single Topic Models E [ x i | h ] = Ah. h w := E [ h ] . A A A A A Learn topic-word matrix A , vector w x 1 x 2 x 3 x 4 x 5 Pairwise Co-occurence Matrix M x k � M 2 := E [ x 1 ⊗ x 2 ] = E [ E [ x 1 ⊗ x 2 | h ]] = w i a i ⊗ a i i =1 Triples Tensor M 3 k � M 3 := E [ x 1 ⊗ x 2 ⊗ x 3 ] = E [ E [ x 1 ⊗ x 2 ⊗ x 3 | h ]] = w i a i ⊗ a i ⊗ a i i =1

  27. Moments under LDA α 0 M 2 := E [ x 1 ⊗ x 2 ] − α 0 + 1 E [ x 1 ] ⊗ E [ x 1 ] α 0 E [ x 1 ⊗ x 2 ⊗ x 3 ] − α 0 + 2 E [ x 1 ⊗ x 2 ⊗ E [ x 1 ]] − more stuff... M 3 := Then � w i a i ⊗ a i M 2 = ˜ � M 3 = w i a i ⊗ a i ⊗ a i . ˜ Three words per document suffice for learning LDA. Similar forms for HMM, ICA, etc.

  28. Network Community Models

  29. Network Community Models 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

  30. Network Community Models 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

  31. Network Community Models 0.9 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

  32. Network Community Models 0.1 0.8 0.1 0.1 0.4 0.3 0.3 0.7 0.2 0.1

  33. Network Community Models 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

  34. Subgraph Counts as Graph Moments

  35. Subgraph Counts as Graph Moments

  36. Subgraph Counts as Graph Moments 3 -star counts sufficient for identifiability and learning of MMSB

  37. Subgraph Counts as Graph Moments 3 -star counts sufficient for identifiability and learning of MMSB 3 -Star Count Tensor 1 ˜ x M 3 ( a, b, c ) = | X | # of common neighbors in X X � 1 = G ( x, a ) G ( x, b ) G ( x, c ) . | X | x ∈ X A B C � 1 ˜ [ G ⊤ x,A ⊗ G ⊤ x,B ⊗ G ⊤ M 3 = x,C ] c a b | X | x ∈ X

  38. Multi-view Representation Conditional independence of the three views π x : community membership vector of node x . 3 -stars Graphical model π x x X A B C G ⊤ G ⊤ G ⊤ x,A x,C x,B Similar form as M 2 and M 3 for topic models

  39. Main Results k communities, n nodes. Uniform communities. α 0 : Sparsity level of community memberships (Dirichlet parameter). p, q : intra/inter-community edge density. Scaling Requirements � ( α 0 + 1) 1 . 5 k � p − q n = ˜ Ω( k 2 ( α 0 + 1) 3 ) , = ˜ √ n √ p Ω . “A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

  40. Main Results k communities, n nodes. Uniform communities. α 0 : Sparsity level of community memberships (Dirichlet parameter). p, q : intra/inter-community edge density. Scaling Requirements � ( α 0 + 1) 1 . 5 k � p − q n = ˜ Ω( k 2 ( α 0 + 1) 3 ) , = ˜ √ n √ p Ω . For stochastic block model ( α 0 = 0) , tight results Tight guarantees for sparse graphs (scaling of p, q ) Tight guarantees on community size: require at least √ n sized communities Efficient scaling w.r.t. sparsity level of memberships α 0 “A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

  41. Main Results (Contd) α 0 : Sparsity level of community memberships (Dirichlet parameter). Π : Community membership matrix, Π ( i ) : i th community S : Estimated supports, � � S ( i, j ) : Support for node j in community i . Norm Guarantees � � ( α 0 + 1) 3 / 2 √ p 1 Π i − Π i � 1 = ˜ � � n max O ( p − q ) √ n i

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend