Guaranteed Learning of Latent Variable Models through Tensor Methods - PowerPoint PPT Presentation

Maximum Likelihood Estimator (MLE) Data { x i } n i =1 = � n Likelihood Pr θ ( data ) iid i =1 Pr θ ( x i ) Model parameter estimation � θ mle := argmax log Pr θ ( data ) θ ∈ Θ Latent variable models: some variables are hidden ◮ No “direct” estimators when some variables are hidden ◮ Local optimization via Expectation-Maximization (EM) (Dempster, Laird, & Rubin, 1977) 15 / 75

MLE for Gaussian Mixture Models Given data { x i } n i =1 and the number of Gaussian components K , the model parameters to be estimated are θ = { ( µ h , Σ h , π h ) } K h =1 . � θ mle for Gaussian Mixture Models � K �� n � π h − 1 � 2( x i − µ h ) ⊤ Σ − 1 θ mle := argmax log det( Σ h ) 1 / 2 exp h ( x i − µ h ) θ i =1 h =1 Solving MLE estimator is NP-hard (Dasgupta, 2008; Aloise, Deshpande, Hansen, & Popat, 2009; Mahajan, Nimbhorkar, & Varadarajan, 2009; Vattani, 2009; Awasthi, Charikar, Krishnaswamy, & Sinop, 2015) . 16 / 75

Consistent Estimator Definition Suppose iid samples { x i } n i =1 are generated by distribution Pr θ ( x i ) where the model parameters θ ∈ Θ are unknown. An estimator � θ is consistent if E � � θ − θ � → 0 as n → ∞ Spherical Gaussian Mixtures Σ h = I (as n → ∞ ) For K = 2 and π h = 1 / 2 : EM is consistent (Xu, H., & Maleki, 2016; Daskalakis, Tzamos, & Zampetakis, 2016) . Larger K: easily trapped in local maxima, far from global max (Jin, Zhang, Balakrishnan, Wainwright, & Jordan, 2016) . Practitioners often use EM with many (random) restarts, but may take a long time to get near the global max. 17 / 75

Hardness of Parameter Estimation Exponentially difficult computationally or statistically to learn model parameters, even under the parametric setting. Cryptographic hardness Information-theoretic hardness E.g., Moitra & Valiant, 2010 E.g., Mossel & Roch, 2006 May require 2 Ω( K ) running time or 2 Ω( K ) sample size. 18 / 75

Ways Around the Hardness Separation conditions. � µ i − µ j � 2 ◮ E.g., assume min is sufficiently large. σ 2 i + σ 2 i � = j j ◮ (Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2002; . . . ) Structural assumptions. ◮ E.g., assume sparsity, separable (anchor words). ◮ (Spielman, Wang & Wright, 2012; Arora, Ge & Moitra, 2012; . . . ) Non-degeneracy conditions. ◮ E.g., assume µ 1 , . . . , µ K span a K -dimensional space. This tutorial: statistically and computationally efficient learning algorithms for non-degenerate instances via method-of-moments. 19 / 75

Outline Introduction 1 Motivation: Challenges of MLE for Gaussian Mixtures 2 Introduction of Method of Moments and Tensor Notations 3 Topic Model for Single-topic Documents 4 Algorithms for Tensor Decompositions 5 Tensor Decomposition for Neural Network Compression 6 Conclusion 7 20 / 75

Method-of-Moments At A Glance Determine function of model parameters θ estimatable from 1 observable data: ◮ Moments E θ [ f ( X )] Form estimates of moments using data (iid samples { x i } n i =1 ): 2 ◮ Empirical Moments � E [ f ( X )] Solve the approximate equations for parameters θ : 3 ◮ Moment matching E θ [ f ( X )] n →∞ � = E [ f ( X )] Toy Example How to estimate Gaussian variable, i.e., ( µ , Σ ), given iid samples { x i } n i =1 ∼ N ( µ , Σ 2 ) ? 21 / 75

What is a tensor? Multi-dimensional Array Tensor - Higher order matrix The number of dimensions is called tensor order. 22 / 75

Tensor Product b i 2 [ a ⊗ b ⊗ c ] i 1 ,i 2 ,i 3 [ a ⊗ b ] i 1 ,i 2 c i 3 b i 2 a i 1 = = a i 1 [ a ⊗ b ] i 1 ,i 2 = a i 1 b i 2 [ a ⊗ b ⊗ c ] i 1 ,i 2 ,i 3 = a i 1 b i 2 c i 3 Rank-1 matrix Rank-1 tensor 23 / 75

Slices Horizontal slices Lateral slices Frontal slices 24 / 75

Fiber Mode-1 (column) Mode-2 (row) Mode-3 (tube) fibers fibers fibers 25 / 75

CP decomposition � R X = a h ⊗ b h ⊗ c h h =1 Rank: Minimum number of rank-1 tensors whose sum generates the tensor. 26 / 75

Multi-linear Transform Multi-linear Operation � R a h ⊗ b h ⊗ c h , a multi-linear operation using matrices If T = h =1 ( X , Y , Z ) is as follows K � ( X ⊤ a h ) ⊗ ( Y ⊤ b h ) ⊗ ( Z ⊤ c h ) . T ( X , Y , Z ) := h =1 Similarly for a multi-linear operation using vectors ( x , y , z ) K � ( x ⊤ a h ) ⊗ ( y ⊤ b h ) ⊗ ( z ⊤ c h ) . T ( x , y , z ) := h =1 27 / 75

Tensors in Method of Moments Matrix: Pair-wise relationship x j [ x ⊗ x ] i,j Signal or data observed x ∈ R d = x i Rank 1 matrix: [ x ⊗ x ] i,j = x i x j Aggregated pair-wise relationship M 2 = E [ x ⊗ x ] Tensor: Triple-wise relationship or higher [ x ⊗ x ⊗ x ] i,j,k Signal or data observed x ∈ R d x k x j Rank 1 tensor: x i = [ x ⊗ x ⊗ x ] i,j,k = x i x j x k Aggregated triple-wise relationship M 3 = E [ x ⊗ x ⊗ x ] = E [ x ⊗ 3 ] 28 / 75

Why are tensors powerful? Matrix Orthogonal Decomposition √ √ Not unique without eigenvalue gap 2 2 � � e 2 u 2 = [ 2 , 2 ] 1 0 = e 1 e ⊤ 1 + e 2 e ⊤ 2 = u 1 u ⊤ 1 + u 2 u ⊤ e 1 2 0 1 √ √ 2 , − 2 2 u 1 = [ 2 ] 29 / 75

Why are tensors powerful? Matrix Orthogonal Decomposition √ √ Not unique without eigenvalue gap 2 2 � � e 2 u 2 = [ 2 , 2 ] 1 0 = e 1 e ⊤ 1 + e 2 e ⊤ 2 = u 1 u ⊤ 1 + u 2 u ⊤ e 1 2 0 1 √ √ 2 , − 2 2 u 1 = [ 2 ] Unique with eigenvalue gap 29 / 75

Why are tensors powerful? Matrix Orthogonal Decomposition √ √ Not unique without eigenvalue gap 2 2 � � e 2 u 2 = [ 2 , 2 ] 1 0 = e 1 e ⊤ 1 + e 2 e ⊤ 2 = u 1 u ⊤ 1 + u 2 u ⊤ e 1 2 0 1 √ √ 2 , − 2 2 u 1 = [ 2 ] Unique with eigenvalue gap Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed = + ≠ 29 / 75

Why are tensors powerful? Matrix Orthogonal Decomposition √ √ Not unique without eigenvalue gap 2 2 � � e 2 u 2 = [ 2 , 2 ] 1 0 = e 1 e ⊤ 1 + e 2 e ⊤ 2 = u 1 u ⊤ 1 + u 2 u ⊤ e 1 2 0 1 √ √ 2 , − 2 2 u 1 = [ 2 ] Unique with eigenvalue gap Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29 / 75

Why are tensors powerful? Matrix Orthogonal Decomposition √ √ Not unique without eigenvalue gap 2 2 � � e 2 u 2 = [ 2 , 2 ] 1 0 = e 1 e ⊤ 1 + e 2 e ⊤ 2 = u 1 u ⊤ 1 + u 2 u ⊤ e 1 2 0 1 √ √ 2 , − 2 2 u 1 = [ 2 ] Unique with eigenvalue gap Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + ≠ 29 / 75

Topic Modeling General Topic Model (e.g., Latent Dirichlet Allocation) K topics ◮ each associated with a distribution over vocab words { a h } K h =1 Hidden topic proportion w ◮ per document i , w ( i ) ∈ ∆ K − 1 Document iid ∼ mixture of topics Topic Word Matrix Business Science Poli Sports play ✑ cs game play game season season Word Count per Document 31 / 75

Topic Modeling Topic Model for Single-topic Documents K topics ◮ each associated with a distribution over vocab words { a h } K h =1 Hidden topic proportion w ◮ per document i , w ( i ) ∈ { e 1 , . . . , e K } Document iid ∼ a h 1.0 0 0 0 Topic Word Matrix Business Science Poli � cs play Sports game play season game season Word Count per Document 31 / 75

Model Parameters of Topic Model for Single-topic Documents Estimate Topic Proportion Topic proportion w = [ w 1 , . . . , w K ] Sports Science Poli � cs Business w h = P [ topic of word = h ] Estimate Topic Word Matrix B Science Topic-word matrix A = [ a 1 , . . . , a K ] P u Sports o s n i l i � e c s s s play A jh = P [ word = e j | topic = h ] game season Goal : to estimate model parameters { ( a h , w h ) } K h =1 , given iid samples of n documents (word count { c ( i ) } n i =1 ) L , the length of document is L = � Frequency vector x ( i ) = c ( i ) j c ( i ) j 32 / 75

Moment Matching Nondegenerate model (linearly independent topic-word matrix) Generative process: ◮ Choose h ∼ Cat ( w 1 , . . . , w K ) B Science P u Sports o s i ◮ Generate L words ∼ a h l n i � e c s s s E [ x ] = � K play h =1 P [ topic = h ] E [ x | topic = h ] game � E [ x | topic = h ] = j P [ word = e j | topic = h ] e j = a h season 33 / 75

Moment Matching Nondegenerate model (linearly independent topic-word matrix) Generative process: ◮ Choose h ∼ Cat ( w 1 , . . . , w K ) B Science P u Sports o s i ◮ Generate L words ∼ a h l n i � e c s s s E [ x ] = � K h =1 P [ topic = h ] E [ x | topic = h ] = � K play h =1 w h a h game � E [ x | topic = h ] = j P [ word = e j | topic = h ] e j = a h season 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Generative process: ◮ Choose h ∼ Cat ( w 1 , . . . , w K ) B Science P u Sports o s i ◮ Generate L words ∼ a h l n i � e c s s s E [ x ] = � K h =1 P [ topic = h ] E [ x | topic = h ] = � K play h =1 w h a h game � E [ x | topic = h ] = j P [ word = e j | topic = h ] e j = a h season M 1 : Distribution of words ( � M 1 : Occurrence frequency of words) n � � M 1 = 1 x ( i ) � M 1 = E [ x ] = w h a h ; witness n h i =1 police campus campus = + + police witness Educa � on campus Sports police witness e m i r c 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Generative process: ◮ Choose h ∼ Cat ( w 1 , . . . , w K ) B Science P u Sports o s i ◮ Generate L words ∼ a h l n i � e c s s s E [ x ] = � K h =1 P [ topic = h ] E [ x | topic = h ] = � K play h =1 w h a h game � E [ x | topic = h ] = j P [ word = e j | topic = h ] e j = a h season M 1 : Distribution of words ( � M 1 : Occurrence frequency of words) n � � M 1 = 1 x ( i ) � M 1 = E [ x ] = w h a h ; witness n h i =1 police campus campus = + + police witness Educa � on campus Sports police witness e m i r c No unique decomposition of vectors 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Generative process: ◮ Choose h ∼ Cat ( w 1 , . . . , w K ) B Science P u Sports o s i ◮ Generate L words ∼ a h l n i � e c s s s E [ x ] = � K h =1 P [ topic = h ] E [ x | topic = h ] = � K play h =1 w h a h game � E [ x | topic = h ] = j P [ word = e j | topic = h ] e j = a h season M 2 : Distribution of word pairs ( � M 2 : Co-occurrence of word pairs) n � � M 2 = 1 x ( i ) ⊗ x ( i ) � M 2 = E [ x ⊗ x ] = w h a h ⊗ a h ; witness n h i =1 police campus campus = + + police witness Educa � on campus Sports police witness e m i r c 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Generative process: ◮ Choose h ∼ Cat ( w 1 , . . . , w K ) B Science P u Sports o s i ◮ Generate L words ∼ a h l n i � e c s s s E [ x ] = � K h =1 P [ topic = h ] E [ x | topic = h ] = � K play h =1 w h a h game � E [ x | topic = h ] = j P [ word = e j | topic = h ] e j = a h season M 2 : Distribution of word pairs ( � M 2 : Co-occurrence of word pairs) n � � M 2 = 1 x ( i ) ⊗ x ( i ) � M 2 = E [ x ⊗ x ] = w h a h ⊗ a h ; witness n h i =1 police campus campus = + + police witness Educa � on campus Sports police witness e m i r c Matrix decomposition recovers subspace, not actual model 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Find a W such that W ⊤ W ⊤ W ⊤ M 2 : Distribution of word pairs ( � M 2 : Co-occurrence of word pairs) n � � M 2 = 1 x ( i ) ⊗ x ( i ) � M 2 = E [ x ⊗ x ] = w h a h ⊗ a h ; witness n h i =1 police campus campus = + + police witness Educa � on campus Sports police witness e m i r c Many such W ’s, find one such that v h = W ⊤ a h orthogonal 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W such that W ⊤ W ⊤ W ⊤ M 3 : Distribution of word triples ( � M 3 : Co-occurrence of word triples) n � � M 3 = 1 M 3 = E [ x ⊗ 3 ] = w h a h ⊗ 3 ; x ( i ) ⊗ 3 � witness n h i =1 police campus campus = + + police witness Educa � on campus Sports police witness e m i r c Orthogonalize the tensor, project data with W : M 3 ( W , W , W ) 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W such that W ⊤ W ⊤ W ⊤ M 3 : Distribution of word triples ( � M 3 : Co-occurrence of word triples) n � � M 3 ( W , W , W ) = 1 M 3 ( W , W , W ) = E [( W ⊤ x ) ⊗ 3 ] = w h ( W ⊤ a h ) ⊗ 3 ; � ( W ⊤ x ( i ) ) ⊗ 3 n h i =1 W = + + W W v h } K Unique orthogonal tensor decomposition { � h =1 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W such that W ⊤ W ⊤ W ⊤ M 3 : Distribution of word triples ( � M 3 : Co-occurrence of word triples) n � � M 3 ( W , W , W ) = 1 M 3 ( W , W , W ) = E [( W ⊤ x ) ⊗ 3 ] = w h ( W ⊤ a h ) ⊗ 3 ; � ( W ⊤ x ( i ) ) ⊗ 3 n h i =1 W = + + W W a h = ( W ⊤ ) † � Model parameter estimation: � v h 33 / 75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W such that W ⊤ W ⊤ W ⊤ M 3 : Distribution of word triples ( � M 3 : Co-occurrence of word triples) n � � M 3 ( W , W , W ) = 1 M 3 ( W , W , W ) = E [( W ⊤ x ) ⊗ 3 ] = w h ( W ⊤ a h ) ⊗ 3 ; � ( W ⊤ x ( i ) ) ⊗ 3 n h i =1 W = + + W W L ≥ 3 : Learning Topic Models through Matrix/Tensor Decomposition 33 / 75

Take Away Message Consider topic models satisfying linear independent word distributions under different topics. Parameters of topic model for single-topic documents can be efficiently recovered from distribution of three-word documents. ◮ Distribution of three-word documents (word triples) � M 3 = E [ x ⊗ x ⊗ x ] = w h a h ⊗ a h ⊗ a h h ◮ � M 3 : Co-occurrence of word triples Two-word documents are not sufficient for identifiability. 34 / 75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 10 4 4 2 10 3 0 35 / 75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 10 4 4 2 10 3 0 Learning Communities from Graph Connectivity Facebook: n ∼ 20 k Yelp: n ∼ 40 k DBLPsub: n ∼ 0 . 1 m DBLP: n ∼ 1 m 10 6 Running Times (s) 10 1 Error /group 10 5 10 0 10 4 10 -1 10 3 10 -2 10 2 FB YP DBLPsub DBLP FB YP DBLPsub DBLP 35 / 75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 e 10 4 t a 4 r u c c A 2 e r o 10 3 0 M & r Learning Communities from Graph Connectivity e t s a F e Facebook: n ∼ 20 k Yelp: n ∼ 40 k d DBLPsub: n ∼ 0 . 1 m DBLP: n ∼ 1 m u t i n 10 6 g a Running Times (s) 10 1 M Error /group 10 5 f o s 10 0 r e d 10 4 r O 10 -1 10 3 10 -2 10 2 FB YP DBLPsub DBLP FB YP DBLPsub DBLP “Online Tensor Methods for Learning Latent Variable Models”, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. “Tensor Methods on Apache Spark”, by F. Huang, A. Anandkumar, Oct. 2015. 35 / 75

Jennrich’s Algorithm (Simplified) Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent Task: components { µ h } K h =1 , find the components (up to scaling). = + ≠ 37 / 75

Jennrich’s Algorithm (Simplified) Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent Task: components { µ h } K h =1 , find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T ( I , I , c ) = � h < µ h , c > µ h ⊗ µ h = + ≠ 37 / 75

Jennrich’s Algorithm (Simplified) Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent Task: components { µ h } K h =1 , find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T ( I , I , c ) = � h < µ h , c > µ h ⊗ µ h = + ≠ Intuitions for Jennrich’s Algorithm 37 / 75

Jennrich’s Algorithm (Simplified) Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent Task: components { µ h } K h =1 , find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T ( I , I , c ) = � h < µ h , c > µ h ⊗ µ h = + ≠ Intuitions for Jennrich’s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors 37 / 75

Jennrich’s Algorithm (Simplified) Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent Task: components { µ h } K h =1 , find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T ( I , I , c ) = � h < µ h , c > µ h ⊗ µ h = + ≠ Intuitions for Jennrich’s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors The shared eigenvectors are tensor components { µ h } K h =1 37 / 75

Jennrich’s Algorithm (Simplified) Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent Task: components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm Require: Tensor T ∈ R d × d × d a.s. µ h } K = { µ h } K Ensure: Components { � h =1 h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ) T ( I , I , c ′ ) † � µ h } K 2: Return { � h =1 ← eigenvectors of 38 / 75

Jennrich’s Algorithm (Simplified) Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent Task: components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm Require: Tensor T ∈ R d × d × d a.s. µ h } K = { µ h } K Ensure: Components { � h =1 h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ) T ( I , I , c ′ ) † � µ h } K 2: Return { � h =1 ← eigenvectors of Consistency of Jennrich’s Algorithm? µ h } K h =1 ≡ unknown components { µ h } K Estimators { � h =1 (up to scaling)? 38 / 75

Analysis of Consistency of Jennrich’s algorithm Recall: Linear comb. of slices share eigenvectors { µ h } K h =1 , i.e., T ( I , I , c ) T ( I , I , c ′ ) † a.s. c ′ U † a.s. = UD c U ⊤ ( U ⊤ ) † D − 1 = U ( D c D − 1 c ′ ) U † , where U = [ µ 1 | . . . | µ K ] are the linearly independent tensor components � � and D c = Diag < µ 1 , c >, . . . , < µ K , c > is diagonal. 39 / 75

Analysis of Consistency of Jennrich’s algorithm Recall: Linear comb. of slices share eigenvectors { µ h } K h =1 , i.e., T ( I , I , c ) T ( I , I , c ′ ) † a.s. c ′ U † a.s. = UD c U ⊤ ( U ⊤ ) † D − 1 = U ( D c D − 1 c ′ ) U † , where U = [ µ 1 | . . . | µ K ] are the linearly independent tensor components � � and D c = Diag < µ 1 , c >, . . . , < µ K , c > is diagonal. By linear independence of { µ i } K i =1 and random choice of c and c ′ : U has rank K ; 1 39 / 75

Analysis of Consistency of Jennrich’s algorithm Recall: Linear comb. of slices share eigenvectors { µ h } K h =1 , i.e., T ( I , I , c ) T ( I , I , c ′ ) † a.s. c ′ U † a.s. = UD c U ⊤ ( U ⊤ ) † D − 1 = U ( D c D − 1 c ′ ) U † , where U = [ µ 1 | . . . | µ K ] are the linearly independent tensor components � � and D c = Diag < µ 1 , c >, . . . , < µ K , c > is diagonal. By linear independence of { µ i } K i =1 and random choice of c and c ′ : U has rank K ; 1 D c and D c ′ are invertible (a.s.); 2 39 / 75

Analysis of Consistency of Jennrich’s algorithm Recall: Linear comb. of slices share eigenvectors { µ h } K h =1 , i.e., T ( I , I , c ) T ( I , I , c ′ ) † a.s. c ′ U † a.s. = UD c U ⊤ ( U ⊤ ) † D − 1 = U ( D c D − 1 c ′ ) U † , where U = [ µ 1 | . . . | µ K ] are the linearly independent tensor components � � and D c = Diag < µ 1 , c >, . . . , < µ K , c > is diagonal. By linear independence of { µ i } K i =1 and random choice of c and c ′ : U has rank K ; 1 D c and D c ′ are invertible (a.s.); 2 Diagonal entries of D c D − 1 are distinct (a.s.); 3 c ′ 39 / 75

Analysis of Consistency of Jennrich’s algorithm Recall: Linear comb. of slices share eigenvectors { µ h } K h =1 , i.e., T ( I , I , c ) T ( I , I , c ′ ) † a.s. c ′ U † a.s. = UD c U ⊤ ( U ⊤ ) † D − 1 = U ( D c D − 1 c ′ ) U † , where U = [ µ 1 | . . . | µ K ] are the linearly independent tensor components � � and D c = Diag < µ 1 , c >, . . . , < µ K , c > is diagonal. By linear independence of { µ i } K i =1 and random choice of c and c ′ : U has rank K ; 1 D c and D c ′ are invertible (a.s.); 2 Diagonal entries of D c D − 1 are distinct (a.s.); 3 c ′ i =1 are the eigenvectors of T ( I , I , c ) T ( I , I , c ) † with distinct So { µ i } K non-zero eigenvalues. Jennrich’s algorithm is consistent 39 / 75

Error-tolerant algorithms for tensor decompositions 40 / 75

Moment Estimator: Empirical Moments 41 / 75

Moment Estimator: Empirical Moments Moments E θ [ f ( X )] are functions of model parameters θ Empirical Moments � E [ f ( X )] are computed using iid samples { x i } n i =1 only 41 / 75

Moment Estimator: Empirical Moments Moments E θ [ f ( X )] are functions of model parameters θ Empirical Moments � E [ f ( X )] are computed using iid samples { x i } n i =1 only Example Third Order Moment: distribution of word triples ◮ E [ x ⊗ x ⊗ x ] = � h w h a h ⊗ a h ⊗ a h Empirical Third Order Moment: co-occurrence frequency of word triples � n ◮ � E [ x ⊗ x ⊗ x ] = 1 x i ⊗ x i ⊗ x i n i =1 41 / 75

Moment Estimator: Empirical Moments Moments E θ [ f ( X )] are functions of model parameters θ Empirical Moments � E [ f ( X )] are computed using iid samples { x i } n i =1 only Example Third Order Moment: distribution of word triples ◮ E [ x ⊗ x ⊗ x ] = � h w h a h ⊗ a h ⊗ a h Empirical Third Order Moment: co-occurrence frequency of word triples � n ◮ � E [ x ⊗ x ⊗ x ] = 1 x i ⊗ x i ⊗ x i n i =1 Inevitably expect error of order n − 1 2 in some norm, e.g., ◮ Operator norm: � E [ x ⊗ x ⊗ x ] − � E [ x ⊗ x ⊗ x ] � � n − 1 2 ◮ where � T � := x , y , z ∈ S d − 1 T ( x , y , z ) sup ◮ Frobenius norm: � E [ x ⊗ x ⊗ x ] − � E [ x ⊗ x ⊗ x ] � F � n − 1 2 � � ◮ where � T � F := T 2 i,j,k i,j,k 41 / 75

Stability of Jennrich’s Algorithm Recall Jennrich’s algorithm Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm Require: Tensor T ∈ R d × d × d a.s. µ h } K = { µ h } K Ensure: Components { � h =1 h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ) T ( I , I , c ′ ) † � µ h } K 2: Return { � h =1 ← eigenvectors of 42 / 75

Stability of Jennrich’s Algorithm Recall Jennrich’s algorithm Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm Require: Tensor T ∈ R d × d × d a.s. µ h } K = { µ h } K Ensure: Components { � h =1 h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ) T ( I , I , c ′ ) † � µ h } K 2: Return { � h =1 ← eigenvectors of T − T � � n − 1 Challenge : Only have access to � T such that � � 2 42 / 75

Stability of Jennrich’s Algorithm Recall Jennrich’s algorithm Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm � T ∈ R d × d × d Require: Tensor a.s. µ h } K = { µ h } K Ensure: Components { � h =1 ? h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ′ ) † � T ( I , I , c ) � � µ h } K 2: Return { � h =1 ← eigenvectors of Challenge : Only have access to � T such that � � T − T � � n − 1 2 42 / 75

Stability of Jennrich’s Algorithm Recall Jennrich’s algorithm Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm � T ∈ R d × d × d Require: Tensor a.s. µ h } K = { µ h } K Ensure: Components { � h =1 ? h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ′ ) † � T ( I , I , c ) � � µ h } K 2: Return { � h =1 ← eigenvectors of Stability of eigenvectors requires eigenvalue gaps 42 / 75

Stability of Jennrich’s Algorithm Recall Jennrich’s algorithm Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm � T ∈ R d × d × d Require: Tensor a.s. µ h } K = { µ h } K Ensure: Components { � h =1 ? h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ′ ) † � T ( I , I , c ) � � µ h } K 2: Return { � h =1 ← eigenvectors of Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for � T ( · , · , c ) � T ( · , · , c ) † , T ( · , · , c ) † − T ( · , · , c ) T ( · , · , c ) † � ≪ ∆ is needed. � � T ( · , · , c ) � 42 / 75

Stability of Jennrich’s Algorithm Recall Jennrich’s algorithm Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm � T ∈ R d × d × d Require: Tensor a.s. µ h } K = { µ h } K Ensure: Components { � h =1 ? h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ′ ) † � T ( I , I , c ) � � µ h } K 2: Return { � h =1 ← eigenvectors of Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for � T ( · , · , c ) � T ( · , · , c ) † , T ( · , · , c ) † − T ( · , · , c ) T ( · , · , c ) † � ≪ ∆ is needed. � � T ( · , · , c ) � Ultimately, � � 1 T − T � F ≪ poly d is required. 42 / 75

Stability of Jennrich’s Algorithm Recall Jennrich’s algorithm Given tensor T = � K h =1 µ h ⊗ 3 with linearly independent components { µ h } K h =1 , find the components (up to scaling). Algorithm Jennrich’s Algorithm � T ∈ R d × d × d Require: Tensor a.s. µ h } K = { µ h } K Ensure: Components { � h =1 ? h =1 1: Sample c and c ′ independently & uniformly at random from S d − 1 � T ( I , I , c ′ ) † � T ( I , I , c ) � � µ h } K 2: Return { � h =1 ← eigenvectors of Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for � T ( · , · , c ) � T ( · , · , c ) † , T ( · , · , c ) † − T ( · , · , c ) T ( · , · , c ) † � ≪ ∆ is needed. � � T ( · , · , c ) � Ultimately, � � 1 T − T � F ≪ poly d is required. A different approach? 42 / 75

Initial Ideas In many applications, we estimate moments of the form K � w h a h ⊗ 3 , M 3 = h =1 where { a h } K h =1 are assumed to be linearly independent. What if { a h } K h =1 has orthonormal columns? 43 / 75

Initial Ideas In many applications, we estimate moments of the form K � w h a h ⊗ 3 , M 3 = h =1 where { a h } K h =1 are assumed to be linearly independent. What if { a h } K h =1 has orthonormal columns? M 3 ( I, a i , a i ) = � h w h � a h , a i � 2 a h = w i a i , ∀ i . 43 / 75

Initial Ideas In many applications, we estimate moments of the form K � w h a h ⊗ 3 , M 3 = h =1 where { a h } K h =1 are assumed to be linearly independent. What if { a h } K h =1 has orthonormal columns? M 3 ( I, a i , a i ) = � h w h � a h , a i � 2 a h = w i a i , ∀ i . Analogous to matrix eigenvectors: Mv = M ( I , v ) = λ v . 43 / 75

Initial Ideas In many applications, we estimate moments of the form K � w h a h ⊗ 3 , M 3 = h =1 where { a h } K h =1 are assumed to be linearly independent. What if { a h } K h =1 has orthonormal columns? M 3 ( I, a i , a i ) = � h w h � a h , a i � 2 a h = w i a i , ∀ i . Analogous to matrix eigenvectors: Mv = M ( I , v ) = λ v . Define orthonormal { a h } K h =1 as eigenvectors of tensor M 3 . 43 / 75

Initial Ideas In many applications, we estimate moments of the form � K w h a h ⊗ 3 , M 3 = h =1 where { a h } K h =1 are assumed to be linearly independent. What if { a h } K h =1 has orthonormal columns? M 3 ( I, a i , a i ) = � h w h � a h , a i � 2 a h = w i a i , ∀ i . Analogous to matrix eigenvectors: Mv = M ( I , v ) = λ v . Define orthonormal { a h } K h =1 as eigenvectors of tensor M 3 . Two Problems { a h } K h =1 is not orthogonal in general. How to find eigenvectors of a tensor? 43 / 75

Whitening is the process of finding a whitening matrix W such that multi-linear operation (using W ) on M 3 orthogonalize its components: � w h ( W ⊤ a h ) ⊗ 3 M 3 ( W , W , W ) = h � w h v h ⊗ 3 , ⊥ v h ′ , ∀ h � = h ′ = v h ⊥ h 44 / 75

Whitening Given � � w h a h ⊗ 3 , M 3 = M 2 = w h a h ⊗ a h , h h 45 / 75

Whitening Given � � w h a h ⊗ 3 , M 3 = M 2 = w h a h ⊗ a h , h h Find whitening matrix W s.t. W ⊤ a h = v h are orthogonal. 45 / 75

Guaranteed Learning of Latent Variable Models through Tensor Methods - PowerPoint PPT Presentation

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1 / 75 Tutorial Topic Learning algorithms for latent variable models based on

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima

1 Latent variable models In the next section we will discuss latent variable models for

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

CPU+GPU Load Balance Guided by Execution Time Prediction Jean-Franois Dollinger, Vincent

Constant-Overhead Secure Computation using Preprocessing Ivan Damgrd, Sarah Zakarias Aarhus

Composable GPU programming GPUs -- what are they? Basic model: SIMD, SPMD, MIMD; blocks

Duality of upper and lower powerlocales on locally compact locales Tatsuji Kawai University of

Flexible ADMM for Block-Structured Convex and Nonconvex Optimization Zhi-Quan (Tom) Luo Joint

Unit 5: Inference for categorical variables Lecture 2: Inference for 2-sample proportions

C++ Program Information Database for Analysis Tools Wanghong Yuan, Xiangkui Chen, Tao Xie, Hong

Cryptographic 1972 Parnas On the criteria software engineering, to be used in decomposing