Learning Latent Variable Models through Tensor Methods Anima - PowerPoint PPT Presentation

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine

Challenges in Unsupervised Learning Learn a latent variable model without labeled examples. E.g. topic models, hidden Markov models, Gaussian mixtures, community detection. Maximum likelihood is NP-hard in most scenarios. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this talk: guaranteed and efficient learning through tensor methods

How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. h 1 Advanced: Probabilistic models Hidden variable h has more general distributions. h 2 h 3 Can model mixed memberships. x 1 x 2 x 3 x 4 x 5

Moment Based Approaches Multivariate Moments M 1 := E [ x ] , M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .

Outline Introduction 1 Spectral Methods: Matrices to Tensors 2 Tensor Forms for Different Models 3 Experimental Results 4 Overcomplete Tensors 5 Conclusion 6

Classical Spectral Methods: Matrix PCA Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k -means).

Classical Spectral Methods: Matrix PCA Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k -means). Basic method works only for single memberships. Failure to cluster under small separation. Require long documents for good concentration bounds.

Classical Spectral Methods: Matrix PCA Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k -means). Basic method works only for single memberships. Failure to cluster under small separation. Require long documents for good concentration bounds. Efficient Learning Without Separation Constraints?

Beyond SVD: Spectral Methods on Tensors How to learn the mixture components without separation constraints? ◮ Are higher order moments helpful? Unified framework? ◮ Moment-based Estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. ◮ What are the analogues for tensors?

Spectral Decomposition M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2

Spectral Decomposition M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2 M 3 = � λ i u i ⊗ v i ⊗ w i i .... = + Tensor M 3 λ 1 u 1 ⊗ v 1 ⊗ w 1 λ 2 u 2 ⊗ v 2 ⊗ w 2 u ⊗ v ⊗ w is a rank- 1 tensor since its ( i 1 , i 2 , i 3 ) th entry is u i 1 v i 2 w i 3 .

Decomposition of Orthogonal Tensors A has orthogonal columns. � M 3 = w i a i ⊗ a i ⊗ a i . i

Decomposition of Orthogonal Tensors A has orthogonal columns. � M 3 = w i a i ⊗ a i ⊗ a i . i M 3 ( I, a 1 , a 1 ) = � i w i � a i , a 1 � 2 a i = w 1 a 1 .

Decomposition of Orthogonal Tensors A has orthogonal columns. � M 3 = w i a i ⊗ a i ⊗ a i . i M 3 ( I, a 1 , a 1 ) = � i w i � a i , a 1 � 2 a i = w 1 a 1 . a i are eigenvectors of tensor M 3 . Analogous to matrix eigenvectors: Mv = M ( I, v ) = λv .

Decomposition of Orthogonal Tensors A has orthogonal columns. � M 3 = w i a i ⊗ a i ⊗ a i . i M 3 ( I, a 1 , a 1 ) = � i w i � a i , a 1 � 2 a i = w 1 a 1 . a i are eigenvectors of tensor M 3 . Analogous to matrix eigenvectors: Mv = M ( I, v ) = λv . Two Problems How to find eigenvectors of a tensor? A is not orthogonal in general.

Whitening � � M 3 = w i a i ⊗ a i ⊗ a i , M 2 = w i a i ⊗ a i . i i Find whitening matrix W s.t. W ⊤ A = V is an orthogonal matrix. When A ∈ R d × k has full column rank, it is an invertible transformation. v 1 a 1 W a 2 v 2 a 3 v 3 Use pairwise moments M 2 to find W s.t. W ⊤ M 2 W = I . Eigen-decomposition of M 2 = U Diag (˜ λ ) U ⊤ , then W = U Diag (˜ λ − 1 / 2 ) .

Using Whitening to Obtain Orthogonal Tensor Tensor M 3 Tensor T Multi-linear transform M 3 ∈ R d × d × d and T ∈ R k × k × k . T = M 3 ( W, W, W ) = � i w i ( W ⊤ a i ) ⊗ 3 . T = � λ i · v i ⊗ v i ⊗ v i is orthogonal. i ∈ [ k ] Dimensionality reduction when k ≪ d .

Putting it together � � w i a i ⊗ a i , w i a i ⊗ a i ⊗ a i . M 2 = M 3 = i i Obtain whitening matrix W from SVD of M 2 . Use W for multi-linear transform: T = M 3 ( W, W, W ) . Find eigenvectors of T through power method and deflation. For what models can we obtain M 2 and M 3 forms?

Outline Introduction 1 Spectral Methods: Matrices to Tensors 2 Tensor Forms for Different Models 3 Experimental Results 4 Overcomplete Tensors 5 Conclusion 6

Topic Modeling

Geometric Picture for Topic Models Topic proportions vector ( h ) Document

Geometric Picture for Topic Models Single topic ( h )

Geometric Picture for Topic Models Single topic ( h ) A A A x 2 x 1 x 3 Word generation ( x 1 , x 2 , . . . )

Geometric Picture for Topic Models Single topic ( h ) A A A x 2 x 1 x 3 Word generation ( x 1 , x 2 , . . . ) Linear model: E [ x i | h ] = Ah .

Moments for Single Topic Models E [ x i | h ] = Ah. h w := E [ h ] . A A A A A Learn topic-word matrix A , vector w x 1 x 2 x 3 x 4 x 5

Moments for Single Topic Models E [ x i | h ] = Ah. h w := E [ h ] . A A A A A Learn topic-word matrix A , vector w x 1 x 2 x 3 x 4 x 5 Pairwise Co-occurence Matrix M x k � M 2 := E [ x 1 ⊗ x 2 ] = E [ E [ x 1 ⊗ x 2 | h ]] = w i a i ⊗ a i i =1 Triples Tensor M 3 k � M 3 := E [ x 1 ⊗ x 2 ⊗ x 3 ] = E [ E [ x 1 ⊗ x 2 ⊗ x 3 | h ]] = w i a i ⊗ a i ⊗ a i i =1

Moments under LDA α 0 M 2 := E [ x 1 ⊗ x 2 ] − α 0 + 1 E [ x 1 ] ⊗ E [ x 1 ] α 0 E [ x 1 ⊗ x 2 ⊗ x 3 ] − α 0 + 2 E [ x 1 ⊗ x 2 ⊗ E [ x 1 ]] − more stuff... M 3 := Then � w i a i ⊗ a i M 2 = ˜ � M 3 = w i a i ⊗ a i ⊗ a i . ˜ Three words per document suffice for learning LDA. Similar forms for HMM, ICA, etc.

Network Community Models

Network Community Models 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Network Community Models 0.9 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Network Community Models 0.1 0.8 0.1 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Network Community Models 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Subgraph Counts as Graph Moments

Subgraph Counts as Graph Moments 3 -star counts sufficient for identifiability and learning of MMSB

Subgraph Counts as Graph Moments 3 -star counts sufficient for identifiability and learning of MMSB 3 -Star Count Tensor 1 ˜ x M 3 ( a, b, c ) = | X | # of common neighbors in X X � 1 = G ( x, a ) G ( x, b ) G ( x, c ) . | X | x ∈ X A B C � 1 ˜ [ G ⊤ x,A ⊗ G ⊤ x,B ⊗ G ⊤ M 3 = x,C ] c a b | X | x ∈ X

Multi-view Representation Conditional independence of the three views π x : community membership vector of node x . 3 -stars Graphical model π x x X A B C G ⊤ G ⊤ G ⊤ x,A x,C x,B Similar form as M 2 and M 3 for topic models

Main Results k communities, n nodes. Uniform communities. α 0 : Sparsity level of community memberships (Dirichlet parameter). p, q : intra/inter-community edge density. Scaling Requirements � ( α 0 + 1) 1 . 5 k � p − q n = ˜ Ω( k 2 ( α 0 + 1) 3 ) , = ˜ √ n √ p Ω . “A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Main Results k communities, n nodes. Uniform communities. α 0 : Sparsity level of community memberships (Dirichlet parameter). p, q : intra/inter-community edge density. Scaling Requirements � ( α 0 + 1) 1 . 5 k � p − q n = ˜ Ω( k 2 ( α 0 + 1) 3 ) , = ˜ √ n √ p Ω . For stochastic block model ( α 0 = 0) , tight results Tight guarantees for sparse graphs (scaling of p, q ) Tight guarantees on community size: require at least √ n sized communities Efficient scaling w.r.t. sparsity level of memberships α 0 “A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Main Results (Contd) α 0 : Sparsity level of community memberships (Dirichlet parameter). Π : Community membership matrix, Π ( i ) : i th community S : Estimated supports, � � S ( i, j ) : Support for node j in community i . Norm Guarantees � � ( α 0 + 1) 3 / 2 √ p 1 Π i − Π i � 1 = ˜ � � n max O ( p − q ) √ n i

Learning Latent Variable Models through Tensor Methods Anima - PowerPoint PPT Presentation

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges in Unsupervised Learning Learn a latent variable model without labeled examples. E.g. topic models, hidden Markov models, Gaussian mixtures,

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

1 Latent variable models In the next section we will discuss latent variable models for

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

COMPUTER ARCHITECTURE & SOFTWARE-DEFINED RADIO NEBU JOHN MATHAI DIRECTOR, ENGINEERING AND

The Implications of Digital Currencies for Monetary Policy and the International Monetary

On a q -analog of the Ap ery numbers International conference on orthogonal polynomials and q

Error-prone cryptographic designs Daniel J. Bernstein University of Illinois at Chicago &

Quantile regression Christopher F Baum EC 823: Applied Econometrics Boston College, Spring 2013

The boundary action of a sofic random subgroup Jan Cannizzo University of Ottawa May 29, 2013

Contextual Geometric Structures Modeling the fundamental components of cultural behavior Bradly

Learning Latent Variable Models through Tensor Methods Anima - PowerPoint PPT Presentation

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges in Unsupervised Learning Learn a latent variable model without labeled examples. E.g. topic models, hidden Markov models, Gaussian mixtures,

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

1 Latent variable models In the next section we will discuss latent variable models for

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

COMPUTER ARCHITECTURE &amp; SOFTWARE-DEFINED RADIO NEBU JOHN MATHAI DIRECTOR, ENGINEERING AND

The Implications of Digital Currencies for Monetary Policy and the International Monetary

On a q -analog of the Ap ery numbers International conference on orthogonal polynomials and q

Error-prone cryptographic designs Daniel J. Bernstein University of Illinois at Chicago &amp;

Quantile regression Christopher F Baum EC 823: Applied Econometrics Boston College, Spring 2013

The boundary action of a sofic random subgroup Jan Cannizzo University of Ottawa May 29, 2013

Contextual Geometric Structures Modeling the fundamental components of cultural behavior Bradly

COMPUTER ARCHITECTURE & SOFTWARE-DEFINED RADIO NEBU JOHN MATHAI DIRECTOR, ENGINEERING AND

Error-prone cryptographic designs Daniel J. Bernstein University of Illinois at Chicago &