Learning Overcomplete Latent Variable Models through Tensor Methods - PowerPoint PPT Presentation

Matrix vs. Tensor Decomposition Uniqueness of decomposition. Matrix Decomposition Distinct weights. Orthogonal components, i.e., � a i , a j � = 0 , i � = j . Too limiting. Otherwise, only learning up to subspace is possible. Tensor Decomposition Same weights. Non-Orthogonal components ⇒ Overcomplete models. More general models.

Matrix vs. Tensor Decomposition Uniqueness of decomposition. Matrix Decomposition Distinct weights. Orthogonal components, i.e., � a i , a j � = 0 , i � = j . Too limiting. Otherwise, only learning up to subspace is possible. Tensor Decomposition Same weights. Non-Orthogonal components ⇒ Overcomplete models. More general models. Focus on tensor decomposition for learning LVMs.

Overcomplete Latent Variable Models Overcomplete Latent Representations Latent dimensionality > observed dimensionality, i.e., k > d . Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples.

Overcomplete Latent Variable Models Overcomplete Latent Representations Latent dimensionality > observed dimensionality, i.e., k > d . Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment.

Overcomplete Latent Variable Models Overcomplete Latent Representations Latent dimensionality > observed dimensionality, i.e., k > d . Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment. Example: T ∈ R 2 × 2 × 2 with rank 3 ( d = 2 , k = 3 ) � 1 � � � 0 0 1 T (: , : , 1) = , T (: , : , 2) = . − 1 0 1 0 � 1 � � 0 � � 0 � � � � 1 � � 1 � � 0 � � 1 � � � 1 1 ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ T = + + . − 1 − 1 1 1 1 0 0 1 1

Overcomplete Latent Variable Models Overcomplete Latent Representations Latent dimensionality > observed dimensionality, i.e., k > d . Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment. So far Learning LVMs. Spectral methods (method-of-moments). Overcomplete LVMs. This work: theoretical guarantees for above.

Outline Introduction 1 Summary of Results 2 Recap of Orthogonal Matrix and Tensor Decomposition 3 Overcomplete (Non-Orthogonal) Tensor Decomposition 4 Sample Complexity Analysis 5 Numerical Results 6 Conclusion 7

Spherical Gaussian Mixtures Assumptions k components, d : observed dimension. Component means a i incoherent: randomly drawn from the sphere. Spherical variance σ 2 d I (assume known).

Spherical Gaussian Mixtures Assumptions k components, d : observed dimension. Component means a i incoherent: randomly drawn from the sphere. Spherical variance σ 2 d I (assume known). In this talk: special case Noise norm σ 2 = 1 : same as signal. Uniform probability of components.

Spherical Gaussian Mixtures Assumptions k components, d : observed dimension. Component means a i incoherent: randomly drawn from the sphere. Spherical variance σ 2 d I (assume known). In this talk: special case Noise norm σ 2 = 1 : same as signal. Uniform probability of components. Tensor For Learning (Hsu, Kakade 2012) M 3 := E [ x ⊗ 3 ] − σ 2 � � ( E [ x ] ⊗ e i ⊗ e i + · · · ) ⇒ M 3 = w j a j ⊗ a j ⊗ a j . i ∈ [ d ] j ∈ [ k ]

Semi-supervised Learning of Gaussian Mixtures n unlabeled samples, m j : samples for component j . No. of mixture components: k = o ( d 1 . 5 ) No. of labeled samples: m j = ˜ Ω(1) . No. of unlabeled samples: n = ˜ Ω( k ) . Our result: achieved error with n unlabeled samples �� √ � k k a j − a j � = ˜ + ˜ max � � O O n d j Linear convergence. Can handle (polynomially) overcomplete mixtures. Extremely small number of labeled samples: polylog( d ) . Sample complexity is tight: need ˜ Ω( k ) samples! Approximation error: decaying in high dimensions.

Unsupervised Learning of Gaussian Mixtures No. of mixture components: k = C · d No. of unlabeled samples: n = ˜ Ω( k · d ) . � k C 2 � Computational complexity: ˜ O Our result: achieved error with n unlabeled samples �� √ � k k a j − a j � = ˜ + ˜ � � max O O n d j Linear convergence. Error: same as before, for semi-supervised setting. Sample complexity: worse than semi-supervised, but better than previous works (no dependence on condition number of A ). Computational complexity: polynomial when k = Θ( d ) .

Multi-view Mixture Models h · · · x 1 x 2 x 3 A = [ a 1 a 2 · · · a k ] ∈ R d × k , similarly B and C . Linear model: x 1 = Ah + z 1 , x 2 = Bh + z 2 , x 3 = Ch + z 3 .

Multi-view Mixture Models h · · · x 1 x 2 x 3 A = [ a 1 a 2 · · · a k ] ∈ R d × k , similarly B and C . Linear model: x 1 = Ah + z 1 , x 2 = Bh + z 2 , x 3 = Ch + z 3 . Incoherence: Component means a i ’s are incoherent (randomly drawn from unit sphere). Similarly b i ’s and c i ’s. The zero-mean noise z l ’s satisfy RIP, e.g., Gaussian, Bernoulli. Same results as Gaussian mixtures.

Independent Component Analysis h 1 h 2 h k x = Ah , independent sources, unknown mixing. Blind source separation of speech, image, video. A x 1 x 2 x d

Independent Component Analysis h 1 h 2 h k x = Ah , independent sources, unknown mixing. Blind source separation of speech, image, video. A Sources h are sub-Gaussian (but not Gaussian). x 1 x 2 x d Columns of A are incoherent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − · · · n samples. k sources. d dimensions.

Independent Component Analysis h 1 h 2 h k x = Ah , independent sources, unknown mixing. Blind source separation of speech, image, video. A Sources h are sub-Gaussian (but not Gaussian). x 1 x 2 x d Columns of A are incoherent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − · · · n samples. k sources. d dimensions. Learning Result Semi-supervised: k = o ( d 2 ) , n ≥ ˜ Ω(max( k 2 , k 4 /d 3 )) . Unsupervised: k = O ( d ) , n ≥ ˜ Ω( k 3 ) .   � √ � k 2 k a j − a j � = ˜  + ˜  max f ∈{− 1 , 1 } � f � min O � � O √ d 1 . 5 j d 3 n min n,

Sparse Coding Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.

Sparse Coding x = Ah , sparse coefficients, unknown dictionary. Image compression, feature learning, ...

Sparse Coding x = Ah , sparse coefficients, unknown dictionary. Image compression, feature learning, ... Coefficients h are independent Bernoulli Gaussian: Sparse ICA. Columns of A are incoherent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − · · · n samples. k dictionary elements. d dimensions. s avg. sparsity.

Sparse Coding x = Ah , sparse coefficients, unknown dictionary. Image compression, feature learning, ... Coefficients h are independent Bernoulli Gaussian: Sparse ICA. Columns of A are incoherent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − · · · n samples. k dictionary elements. d dimensions. s avg. sparsity. Learning Result Semi-supervised: k = o ( d 2 ) , n ≥ ˜ Ω(max( sk, s 2 k 2 /d 3 )) . Unsupervised: k = O ( d ) , n ≥ ˜ Ω( sk 2 ) .   � √ � sk k a j − a j � = ˜  + ˜  � � max f ∈{− 1 , 1 } � f � min O O √ d 1 . 5 j d 3 n min n,

Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j .

Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct.

Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � .

Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � . Convergence properties Let λ 1 > λ 2 > · · · > λ d . Only v i ’s are fixed points of power iteration. Mv i = λ i v i .

Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � . Convergence properties Let λ 1 > λ 2 > · · · > λ d . Only v i ’s are fixed points of power iteration. Mv i = λ i v i . • v 1 is the only robust fixed point.

Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � . Convergence properties Let λ 1 > λ 2 > · · · > λ d . Only v i ’s are fixed points of power iteration. Mv i = λ i v i . • v 1 is the only robust fixed point. • All other v i ’s are saddle points.

Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � . Convergence properties Let λ 1 > λ 2 > · · · > λ d . Only v i ’s are fixed points of power iteration. Mv i = λ i v i . • v 1 is the only robust fixed point. • All other v i ’s are saddle points. Power method recovers v 1 when initialization v satisfies � v, v 1 � � = 0 .

Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) .

Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T

Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T k : tensor rank, d : ambient dimension. k ≤ d : undercomplete and k > d : overcomplete.

Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T k : tensor rank, d : ambient dimension. k ≤ d : undercomplete and k > d : overcomplete. This talk: guarantees for overcomplete tensor decomposition

Background on Tensor Decomposition � a i , b i , c i ∈ S d − 1 . w i a i ⊗ b i ⊗ c i , T = i ∈ [ k ] Theoretical Guarantees Tensor decompositions in psychometrics (Cattell ‘44). CP tensor decomposition (Harshman ‘70, Carol & Chang ‘70). Identifiability of CP tensor decomposition (Kruskal ‘76). Orthogonal decomposition: (Zhang & Golub ‘01, Kolda ‘01, Anandkumar etal ‘12). Tensor decomposition through (lifted) linear equations (Lawthauwer ‘07): works for overcomplete tensors. Tensor decomposition through simultaneous diagonalization: perturbation analysis (Goyal et. al ‘13, Bhaskara ‘13)

Background on Tensor Decompositions (contd.) � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] Practice: Alternating least squares (ALS) Let A = [ a 1 | a 2 · · · a k ] and similarly B, C . Fix estimates of two of the modes (say for A and B ) and re-estimate the third. Iterative updates, low computational complexity. No theoretical guarantees. In this talk: analysis of alternating minimization

Tensors as Multilinear Transformations Tensor T ∈ R d × d × d . Vectors v, w ∈ R d .

Tensors as Multilinear Transformations Tensor T ∈ R d × d × d . Vectors v, w ∈ R d . � v j w l T (: , j, l ) ∈ R d . T ( I, v, w ) := j,l ∈ [ d ]

Tensors as Multilinear Transformations Tensor T ∈ R d × d × d . Vectors v, w ∈ R d . � v j w l T (: , j, l ) ∈ R d . T ( I, v, w ) := j,l ∈ [ d ] For matrix M ∈ R d × d : � w l M (: , l ) ∈ R d . M ( I, w ) = Mw = l ∈ [ d ]

Challenges in Tensor Decomposition Symmetric tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

Challenges in Tensor Decomposition Symmetric tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i .

Challenges in Tensor Decomposition Symmetric tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i . Bad news: There can be other eigenvectors (unlike matrix case). v = v 1 + v 2 satisfies T ( I, v, v ) = 1 √ √ λ i ≡ 1 . 2 v. 2

Challenges in Tensor Decomposition Symmetric tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i . Bad news: There can be other eigenvectors (unlike matrix case). v = v 1 + v 2 satisfies T ( I, v, v ) = 1 √ √ λ i ≡ 1 . 2 v. 2 How do we avoid spurious solutions (not part of decomposition)?

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ]

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � .

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � .

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points.

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points.

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

Matrix vs. tensor power iteration Matrix power iteration : Tensor power iteration :

Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Tensor power iteration : Requires gap between largest and second-largest λ i | c i | where 1 initialization vector v = � i c i v i . Property of the tensor and initialization v .

Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Converges to top eigenvector. 2 Tensor power iteration : Requires gap between largest and second-largest λ i | c i | where 1 initialization vector v = � i c i v i . Property of the tensor and initialization v . Converges to v i for which v i | c i | = max! could be any of them. 2

Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Converges to top eigenvector. 2 Linear convergence. Need O (log(1 /ǫ )) iterations. 3 Tensor power iteration : Requires gap between largest and second-largest λ i | c i | where 1 initialization vector v = � i c i v i . Property of the tensor and initialization v . Converges to v i for which v i | c i | = max! could be any of them. 2 Quadratic convergence. Need O (log log(1 /ǫ )) iterations. 3

Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors ( k > d ) . Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge.

Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors ( k > d ) . Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge. Undercomplete tensors ( k ≤ d ) with full rank components Non-orthogonal decomposition T 1 = � i w i a i ⊗ a i ⊗ a i . v 1 a 1 W Whitening matrix W : a 2 v 2 a 3 v 3 Multilinear transform: T 2 = T 1 ( W, W, W ) Limitations: depends on condition number, Tensor T 1 Tensor T 2 sensitive to noise.

Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors ( k > d ) . Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge. Undercomplete tensors ( k ≤ d ) with full rank components Non-orthogonal decomposition T 1 = � i w i a i ⊗ a i ⊗ a i . v 1 a 1 W Whitening matrix W : a 2 v 2 a 3 v 3 Multilinear transform: T 2 = T 1 ( W, W, W ) Limitations: depends on condition number, Tensor T 1 Tensor T 2 sensitive to noise. This talk: guarantees for overcomplete tensor decomposition

Non-orthogonal Tensor Decomposition Multiview linear mixture model Linear model: h · · · E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . x 1 x 2 x 3 E [ x 1 ⊗ x 2 ⊗ x 3 ] = � i ∈ [ k ] w i a i ⊗ b i ⊗ c i .

Non-orthogonal Tensor Decomposition � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] Practice: Alternating least squares (ALS) Many spurious local optima. No theoretical guarantee.

Non-orthogonal Tensor Decomposition � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] Practice: Alternating least squares (ALS) Many spurious local optima. No theoretical guarantee. Rank- 1 ALS (Best Rank- 1 Approximation) a,b,c ∈S d − 1 ,w ∈ R � T − w · a ⊗ b ⊗ c � F . min

Non-orthogonal Tensor Decomposition � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] Practice: Alternating least squares (ALS) Many spurious local optima. No theoretical guarantee. Rank- 1 ALS (Best Rank- 1 Approximation) a,b,c ∈S d − 1 ,w ∈ R � T − w · a ⊗ b ⊗ c � F . min Fix a ( t ) , b ( t ) and update c ( t +1) = ⇒ c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . Rank-1 ALS iteration ≡ asymmetric power iteration

Alternating minimization Rank- 1 ALS iteration (power iteration) Initialization: a (0) , b (0) , c (0) . Update in t th step: fix a ( t ) , b ( t ) and c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . After (approx.) convergence, restart.

Alternating minimization Rank- 1 ALS iteration (power iteration) Initialization: a (0) , b (0) , c (0) . Update in t th step: fix a ( t ) , b ( t ) and c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.

Alternating minimization Rank- 1 ALS iteration (power iteration) Initialization: a (0) , b (0) , c (0) . Update in t th step: fix a ( t ) , b ( t ) and c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs. Challenges Optimization problem: non-convex, multiple local optima. Alternating minimization: improves the objective in each step? Recovery of a i , b i , c i ’s? Not true in general. Noisy tensor decomposition.

Alternating minimization Rank- 1 ALS iteration (power iteration) Initialization: a (0) , b (0) , c (0) . Update in t th step: fix a ( t ) , b ( t ) and c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs. Challenges Optimization problem: non-convex, multiple local optima. Alternating minimization: improves the objective in each step? Recovery of a i , b i , c i ’s? Not true in general. Noisy tensor decomposition. Natural conditions under which Alt-Min has guarantees?

Special case: Orthogonal Setting � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] � a i , a j � = 0 , for i � = j . Similarly for b, c . Alternating updates: � c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) = w i � a i , a ( t ) �� b i , b ( t ) � c i . i ∈ [ k ] a i , b i , c i are stationary points.

Special case: Orthogonal Setting � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] � a i , a j � = 0 , for i � = j . Similarly for b, c . Alternating updates: � c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) = w i � a i , a ( t ) �� b i , b ( t ) � c i . i ∈ [ k ] a i , b i , c i are stationary points. ONLY local optima for best rank- 1 approximation problem. Guaranteed recovery through alternating minimization.

Special case: Orthogonal Setting � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] � a i , a j � = 0 , for i � = j . Similarly for b, c . Alternating updates: � c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) = w i � a i , a ( t ) �� b i , b ( t ) � c i . i ∈ [ k ] a i , b i , c i are stationary points. ONLY local optima for best rank- 1 approximation problem. Guaranteed recovery through alternating minimization. Perturbation Analysis [AGH + 2012]: Under poly ( d ) number of random initializations and bounded noise conditions.

Our Setup So far General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors? “Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A. Anandkumar, R. Ge. and M. Janzamin, Feb. 2014.

Our Setup So far General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors? Our framework: Incoherent Components � � √ |� a i , a j �| = O 1 / d for i � = j . Similarly for b, c . Can handle overcomplete tensors. Satisfied by random (generic) vectors. Guaranteed recovery for alternating minimization? “Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A. Anandkumar, R. Ge. and M. Janzamin, Feb. 2014.

Analysis of One Step Update � a i , b i , c i ∈ S d − 1 . w i a i ⊗ b i ⊗ c i , T = i ∈ [ k ] Basic Intuition a, ˆ Let ˆ b be “close to” a 1 , b 1 . Alternating update: � a, ˆ a �� b i , ˆ c ∝ T (ˆ ˆ b, I ) = w i � a i , ˆ b � c i , i ∈ [ k ] a �� b 1 , ˆ a, ˆ = w 1 � a 1 , ˆ b � c 1 + T − 1 (ˆ b, I ) .

Analysis of One Step Update � a i , b i , c i ∈ S d − 1 . w i a i ⊗ b i ⊗ c i , T = i ∈ [ k ] Basic Intuition a, ˆ Let ˆ b be “close to” a 1 , b 1 . Alternating update: � a, ˆ a �� b i , ˆ c ∝ T (ˆ ˆ b, I ) = w i � a i , ˆ b � c i , i ∈ [ k ] a �� b 1 , ˆ a, ˆ = w 1 � a 1 , ˆ b � c 1 + T − 1 (ˆ b, I ) . a, ˆ a = a 1 , ˆ T − 1 (ˆ b, I ) = 0 in orthogonal case, when ˆ b = b 1 .

Analysis of One Step Update � a i , b i , c i ∈ S d − 1 . w i a i ⊗ b i ⊗ c i , T = i ∈ [ k ] Basic Intuition a, ˆ Let ˆ b be “close to” a 1 , b 1 . Alternating update: � a, ˆ a �� b i , ˆ ˆ c ∝ T (ˆ b, I ) = w i � a i , ˆ b � c i , i ∈ [ k ] a �� b 1 , ˆ a, ˆ = w 1 � a 1 , ˆ b � c 1 + T − 1 (ˆ b, I ) . a, ˆ a = a 1 , ˆ T − 1 (ˆ b, I ) = 0 in orthogonal case, when ˆ b = b 1 . Can it be controlled for incoherent (random) vectors?

Results for one step update � � √ Incoherence: |� a i , a j �| = O 1 / d for i � = j . Similarly for b, c . �� k Spectral norm: � A � , � B � , � C � ≤ 1 + O . � T � ≤ (1 + o (1)) . d k = o ( d 1 . 5 ) . Weights: For simplicity, w i ≡ 1 . Tensor rank:

Results for one step update � � √ Incoherence: |� a i , a j �| = O 1 / d for i � = j . Similarly for b, c . �� k Spectral norm: � A � , � B � , � C � ≤ 1 + O . � T � ≤ (1 + o (1)) . d k = o ( d 1 . 5 ) . Weights: For simplicity, w i ≡ 1 . Tensor rank: Lemma [AGJ2014] a � , � b 1 − ˆ For small enough ǫ such that max {� a 1 − ˆ b �} ≤ ǫ , after one step � √ � � 1 � k , k ǫ + ǫ 2 � c 1 − ˆ c � ≤ O d + max √ . d 1 . 5 d √ k d : approximation error. rest: error contraction.

Main Result: Local Convergence a (0) � , � b 1 − ˆ b (0) �} ≤ ǫ 0 , and ǫ 0 < constant. Initialization: max {� a 1 − ˆ Noise: ˆ T := T + E , and � E � ≤ 1 / polylog( d ) . Rank: k = o ( d 1 . 5 ) . � √ � Recovery error: ǫ R := � E � + ˜ k O d

Main Result: Local Convergence a (0) � , � b 1 − ˆ b (0) �} ≤ ǫ 0 , and ǫ 0 < constant. Initialization: max {� a 1 − ˆ Noise: ˆ T := T + E , and � E � ≤ 1 / polylog( d ) . Rank: k = o ( d 1 . 5 ) . � √ � Recovery error: ǫ R := � E � + ˜ k O d Theorem (Local Convergence)[AGJ2014] After N = O (log(1 /ǫ R )) steps of alternating rank- 1 updates, a ( N ) � = O ( ǫ R ) . � a 1 − ˆ

Learning Overcomplete Latent Variable Models through Tensor Methods - PowerPoint PPT Presentation

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine Joint work with Anima Anandkumar Rong Ge UC Irvine Microsoft Research Latent Variable Modeling Goal: Discover hidden effects from observed

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Overcomplete models & Lateral interactions and Feedback Teppo Niinimki April 22, 2010

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning sparsely used overcomplete dictionaries Alekh Agarwal Microsoft Research Joint work

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Algebraic Methods for Tensor Data Neriman Tokcan (Broad Institute, MIT/Harvard) Harm Derksen

Research Committee Purpose : Advance the conduct, dissemination, and use of family nursing

Week 1, video 1 Intro to EDM Why EDM now? Which tools to use in class Big Data in Education

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS Pauli Miettinen TML 2013 27 September 2013 BOOLEAN

Pre-m aster psychology Evelien Wolthuis Study Adviser Pre-master and MSc Discover the world at

Evaluating the impact of an intensive Inspire . education workshop on evidence-informed decision

Relational learning with many relations Guillaume Obozinski Laboratoire dInformatique Gaspard

Self-Organization in Autonomous Sensor/Actuator Networks [SelfOrg] Dr.-Ing. Falko Dressler

Learning Overcomplete Latent Variable Models through Tensor Methods - PowerPoint PPT Presentation

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine Joint work with Anima Anandkumar Rong Ge UC Irvine Microsoft Research Latent Variable Modeling Goal: Discover hidden effects from observed

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Overcomplete models &amp; Lateral interactions and Feedback Teppo Niinimki April 22, 2010

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning sparsely used overcomplete dictionaries Alekh Agarwal Microsoft Research Joint work

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Algebraic Methods for Tensor Data Neriman Tokcan (Broad Institute, MIT/Harvard) Harm Derksen

Research Committee Purpose : Advance the conduct, dissemination, and use of family nursing

Week 1, video 1 Intro to EDM Why EDM now? Which tools to use in class Big Data in Education

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS Pauli Miettinen TML 2013 27 September 2013 BOOLEAN

Pre-m aster psychology Evelien Wolthuis Study Adviser Pre-master and MSc Discover the world at

Evaluating the impact of an intensive Inspire . education workshop on evidence-informed decision

Relational learning with many relations Guillaume Obozinski Laboratoire dInformatique Gaspard

Self-Organization in Autonomous Sensor/Actuator Networks [SelfOrg] Dr.-Ing. Falko Dressler

Overcomplete models & Lateral interactions and Feedback Teppo Niinimki April 22, 2010