learning overcomplete latent variable models through
play

Learning Overcomplete Latent Variable Models through Tensor Methods - PowerPoint PPT Presentation

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine Joint work with Anima Anandkumar Rong Ge UC Irvine Microsoft Research Latent Variable Modeling Goal: Discover hidden effects from observed


  1. Matrix vs. Tensor Decomposition Uniqueness of decomposition. Matrix Decomposition Distinct weights. Orthogonal components, i.e., � a i , a j � = 0 , i � = j . Too limiting. Otherwise, only learning up to subspace is possible. Tensor Decomposition Same weights. Non-Orthogonal components ⇒ Overcomplete models. More general models.

  2. Matrix vs. Tensor Decomposition Uniqueness of decomposition. Matrix Decomposition Distinct weights. Orthogonal components, i.e., � a i , a j � = 0 , i � = j . Too limiting. Otherwise, only learning up to subspace is possible. Tensor Decomposition Same weights. Non-Orthogonal components ⇒ Overcomplete models. More general models. Focus on tensor decomposition for learning LVMs.

  3. Overcomplete Latent Variable Models Overcomplete Latent Representations Latent dimensionality > observed dimensionality, i.e., k > d . Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples.

  4. Overcomplete Latent Variable Models Overcomplete Latent Representations Latent dimensionality > observed dimensionality, i.e., k > d . Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment.

  5. Overcomplete Latent Variable Models Overcomplete Latent Representations Latent dimensionality > observed dimensionality, i.e., k > d . Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment. Example: T ∈ R 2 × 2 × 2 with rank 3 ( d = 2 , k = 3 ) � 1 � � � 0 0 1 T (: , : , 1) = , T (: , : , 2) = . − 1 0 1 0 � 1 � � 0 � � 0 � � � � 1 � � 1 � � 0 � � 1 � � � 1 1 ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ T = + + . − 1 − 1 1 1 1 0 0 1 1

  6. Overcomplete Latent Variable Models Overcomplete Latent Representations Latent dimensionality > observed dimensionality, i.e., k > d . Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment. So far Learning LVMs. Spectral methods (method-of-moments). Overcomplete LVMs. This work: theoretical guarantees for above.

  7. Outline Introduction 1 Summary of Results 2 Recap of Orthogonal Matrix and Tensor Decomposition 3 Overcomplete (Non-Orthogonal) Tensor Decomposition 4 Sample Complexity Analysis 5 Numerical Results 6 Conclusion 7

  8. Spherical Gaussian Mixtures Assumptions k components, d : observed dimension. Component means a i incoherent: randomly drawn from the sphere. Spherical variance σ 2 d I (assume known).

  9. Spherical Gaussian Mixtures Assumptions k components, d : observed dimension. Component means a i incoherent: randomly drawn from the sphere. Spherical variance σ 2 d I (assume known). In this talk: special case Noise norm σ 2 = 1 : same as signal. Uniform probability of components.

  10. Spherical Gaussian Mixtures Assumptions k components, d : observed dimension. Component means a i incoherent: randomly drawn from the sphere. Spherical variance σ 2 d I (assume known). In this talk: special case Noise norm σ 2 = 1 : same as signal. Uniform probability of components. Tensor For Learning (Hsu, Kakade 2012) M 3 := E [ x ⊗ 3 ] − σ 2 � � ( E [ x ] ⊗ e i ⊗ e i + · · · ) ⇒ M 3 = w j a j ⊗ a j ⊗ a j . i ∈ [ d ] j ∈ [ k ]

  11. Semi-supervised Learning of Gaussian Mixtures n unlabeled samples, m j : samples for component j . No. of mixture components: k = o ( d 1 . 5 ) No. of labeled samples: m j = ˜ Ω(1) . No. of unlabeled samples: n = ˜ Ω( k ) . Our result: achieved error with n unlabeled samples �� � � √ � k k a j − a j � = ˜ + ˜ max � � O O n d j Linear convergence. Can handle (polynomially) overcomplete mixtures. Extremely small number of labeled samples: polylog( d ) . Sample complexity is tight: need ˜ Ω( k ) samples! Approximation error: decaying in high dimensions.

  12. Unsupervised Learning of Gaussian Mixtures No. of mixture components: k = C · d No. of unlabeled samples: n = ˜ Ω( k · d ) . � k C 2 � Computational complexity: ˜ O Our result: achieved error with n unlabeled samples �� � � √ � k k a j − a j � = ˜ + ˜ � � max O O n d j Linear convergence. Error: same as before, for semi-supervised setting. Sample complexity: worse than semi-supervised, but better than previous works (no dependence on condition number of A ). Computational complexity: polynomial when k = Θ( d ) .

  13. Multi-view Mixture Models h · · · x 1 x 2 x 3 A = [ a 1 a 2 · · · a k ] ∈ R d × k , similarly B and C . Linear model: x 1 = Ah + z 1 , x 2 = Bh + z 2 , x 3 = Ch + z 3 .

  14. Multi-view Mixture Models h · · · x 1 x 2 x 3 A = [ a 1 a 2 · · · a k ] ∈ R d × k , similarly B and C . Linear model: x 1 = Ah + z 1 , x 2 = Bh + z 2 , x 3 = Ch + z 3 . Incoherence: Component means a i ’s are incoherent (randomly drawn from unit sphere). Similarly b i ’s and c i ’s. The zero-mean noise z l ’s satisfy RIP, e.g., Gaussian, Bernoulli. Same results as Gaussian mixtures.

  15. Independent Component Analysis h 1 h 2 h k x = Ah , independent sources, unknown mixing. Blind source separation of speech, image, video. A x 1 x 2 x d

  16. Independent Component Analysis h 1 h 2 h k x = Ah , independent sources, unknown mixing. Blind source separation of speech, image, video. A Sources h are sub-Gaussian (but not Gaussian). x 1 x 2 x d Columns of A are incoherent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − · · · n samples. k sources. d dimensions.

  17. Independent Component Analysis h 1 h 2 h k x = Ah , independent sources, unknown mixing. Blind source separation of speech, image, video. A Sources h are sub-Gaussian (but not Gaussian). x 1 x 2 x d Columns of A are incoherent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − · · · n samples. k sources. d dimensions. Learning Result Semi-supervised: k = o ( d 2 ) , n ≥ ˜ Ω(max( k 2 , k 4 /d 3 )) . Unsupervised: k = O ( d ) , n ≥ ˜ Ω( k 3 ) .   � √ � k 2 k a j − a j � = ˜  + ˜  max f ∈{− 1 , 1 } � f � min O � � O √ d 1 . 5 j d 3 n min n,

  18. Sparse Coding Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.

  19. Sparse Coding x = Ah , sparse coefficients, unknown dictionary. Image compression, feature learning, ...

  20. Sparse Coding x = Ah , sparse coefficients, unknown dictionary. Image compression, feature learning, ... Coefficients h are independent Bernoulli Gaussian: Sparse ICA. Columns of A are incoherent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − · · · n samples. k dictionary elements. d dimensions. s avg. sparsity.

  21. Sparse Coding x = Ah , sparse coefficients, unknown dictionary. Image compression, feature learning, ... Coefficients h are independent Bernoulli Gaussian: Sparse ICA. Columns of A are incoherent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − · · · n samples. k dictionary elements. d dimensions. s avg. sparsity. Learning Result Semi-supervised: k = o ( d 2 ) , n ≥ ˜ Ω(max( sk, s 2 k 2 /d 3 )) . Unsupervised: k = O ( d ) , n ≥ ˜ Ω( sk 2 ) .   � √ � sk k a j − a j � = ˜  + ˜  � � max f ∈{− 1 , 1 } � f � min O O √ d 1 . 5 j d 3 n min n,

  22. Outline Introduction 1 Summary of Results 2 Recap of Orthogonal Matrix and Tensor Decomposition 3 Overcomplete (Non-Orthogonal) Tensor Decomposition 4 Sample Complexity Analysis 5 Numerical Results 6 Conclusion 7

  23. Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j .

  24. Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct.

  25. Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � .

  26. Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � . Convergence properties Let λ 1 > λ 2 > · · · > λ d . Only v i ’s are fixed points of power iteration. Mv i = λ i v i .

  27. Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � . Convergence properties Let λ 1 > λ 2 > · · · > λ d . Only v i ’s are fixed points of power iteration. Mv i = λ i v i . • v 1 is the only robust fixed point.

  28. Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � . Convergence properties Let λ 1 > λ 2 > · · · > λ d . Only v i ’s are fixed points of power iteration. Mv i = λ i v i . • v 1 is the only robust fixed point. • All other v i ’s are saddle points.

  29. Recap of Orthogonal Matrix Eigen Analysis Symmetric M ∈ R d × d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M = � i λ i v i v ⊤ i . Orthogonal: � v i , v j � = 0 , i � = j . Uniqueness (Identifiability): Iff. λ i ’s are distinct. Mv Algorithm: Power method: v �→ � Mv � . Convergence properties Let λ 1 > λ 2 > · · · > λ d . Only v i ’s are fixed points of power iteration. Mv i = λ i v i . • v 1 is the only robust fixed point. • All other v i ’s are saddle points. Power method recovers v 1 when initialization v satisfies � v, v 1 � � = 0 .

  30. Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) .

  31. Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T

  32. Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T k : tensor rank, d : ambient dimension. k ≤ d : undercomplete and k > d : overcomplete.

  33. Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T k : tensor rank, d : ambient dimension. k ≤ d : undercomplete and k > d : overcomplete. This talk: guarantees for overcomplete tensor decomposition

  34. Background on Tensor Decomposition � a i , b i , c i ∈ S d − 1 . w i a i ⊗ b i ⊗ c i , T = i ∈ [ k ] Theoretical Guarantees Tensor decompositions in psychometrics (Cattell ‘44). CP tensor decomposition (Harshman ‘70, Carol & Chang ‘70). Identifiability of CP tensor decomposition (Kruskal ‘76). Orthogonal decomposition: (Zhang & Golub ‘01, Kolda ‘01, Anandkumar etal ‘12). Tensor decomposition through (lifted) linear equations (Lawthauwer ‘07): works for overcomplete tensors. Tensor decomposition through simultaneous diagonalization: perturbation analysis (Goyal et. al ‘13, Bhaskara ‘13)

  35. Background on Tensor Decompositions (contd.) � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] Practice: Alternating least squares (ALS) Let A = [ a 1 | a 2 · · · a k ] and similarly B, C . Fix estimates of two of the modes (say for A and B ) and re-estimate the third. Iterative updates, low computational complexity. No theoretical guarantees. In this talk: analysis of alternating minimization

  36. Tensors as Multilinear Transformations Tensor T ∈ R d × d × d . Vectors v, w ∈ R d .

  37. Tensors as Multilinear Transformations Tensor T ∈ R d × d × d . Vectors v, w ∈ R d . � v j w l T (: , j, l ) ∈ R d . T ( I, v, w ) := j,l ∈ [ d ]

  38. Tensors as Multilinear Transformations Tensor T ∈ R d × d × d . Vectors v, w ∈ R d . � v j w l T (: , j, l ) ∈ R d . T ( I, v, w ) := j,l ∈ [ d ] For matrix M ∈ R d × d : � w l M (: , l ) ∈ R d . M ( I, w ) = Mw = l ∈ [ d ]

  39. Challenges in Tensor Decomposition Symmetric tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

  40. Challenges in Tensor Decomposition Symmetric tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i .

  41. Challenges in Tensor Decomposition Symmetric tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i . Bad news: There can be other eigenvectors (unlike matrix case). v = v 1 + v 2 satisfies T ( I, v, v ) = 1 √ √ λ i ≡ 1 . 2 v. 2

  42. Challenges in Tensor Decomposition Symmetric tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i . Bad news: There can be other eigenvectors (unlike matrix case). v = v 1 + v 2 satisfies T ( I, v, v ) = 1 √ √ λ i ≡ 1 . 2 v. 2 How do we avoid spurious solutions (not part of decomposition)?

  43. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ]

  44. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � .

  45. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � .

  46. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points.

  47. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points.

  48. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

  49. Matrix vs. tensor power iteration Matrix power iteration : Tensor power iteration :

  50. Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Tensor power iteration : Requires gap between largest and second-largest λ i | c i | where 1 initialization vector v = � i c i v i . Property of the tensor and initialization v .

  51. Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Converges to top eigenvector. 2 Tensor power iteration : Requires gap between largest and second-largest λ i | c i | where 1 initialization vector v = � i c i v i . Property of the tensor and initialization v . Converges to v i for which v i | c i | = max! could be any of them. 2

  52. Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Converges to top eigenvector. 2 Linear convergence. Need O (log(1 /ǫ )) iterations. 3 Tensor power iteration : Requires gap between largest and second-largest λ i | c i | where 1 initialization vector v = � i c i v i . Property of the tensor and initialization v . Converges to v i for which v i | c i | = max! could be any of them. 2 Quadratic convergence. Need O (log log(1 /ǫ )) iterations. 3

  53. Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors ( k > d ) . Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge.

  54. Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors ( k > d ) . Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge. Undercomplete tensors ( k ≤ d ) with full rank components Non-orthogonal decomposition T 1 = � i w i a i ⊗ a i ⊗ a i . v 1 a 1 W Whitening matrix W : a 2 v 2 a 3 v 3 Multilinear transform: T 2 = T 1 ( W, W, W ) Limitations: depends on condition number, Tensor T 1 Tensor T 2 sensitive to noise.

  55. Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors ( k > d ) . Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge. Undercomplete tensors ( k ≤ d ) with full rank components Non-orthogonal decomposition T 1 = � i w i a i ⊗ a i ⊗ a i . v 1 a 1 W Whitening matrix W : a 2 v 2 a 3 v 3 Multilinear transform: T 2 = T 1 ( W, W, W ) Limitations: depends on condition number, Tensor T 1 Tensor T 2 sensitive to noise. This talk: guarantees for overcomplete tensor decomposition

  56. Outline Introduction 1 Summary of Results 2 Recap of Orthogonal Matrix and Tensor Decomposition 3 Overcomplete (Non-Orthogonal) Tensor Decomposition 4 Sample Complexity Analysis 5 Numerical Results 6 Conclusion 7

  57. Non-orthogonal Tensor Decomposition Multiview linear mixture model Linear model: h · · · E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . x 1 x 2 x 3 E [ x 1 ⊗ x 2 ⊗ x 3 ] = � i ∈ [ k ] w i a i ⊗ b i ⊗ c i .

  58. Non-orthogonal Tensor Decomposition � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] Practice: Alternating least squares (ALS) Many spurious local optima. No theoretical guarantee.

  59. Non-orthogonal Tensor Decomposition � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] Practice: Alternating least squares (ALS) Many spurious local optima. No theoretical guarantee. Rank- 1 ALS (Best Rank- 1 Approximation) a,b,c ∈S d − 1 ,w ∈ R � T − w · a ⊗ b ⊗ c � F . min

  60. Non-orthogonal Tensor Decomposition � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] Practice: Alternating least squares (ALS) Many spurious local optima. No theoretical guarantee. Rank- 1 ALS (Best Rank- 1 Approximation) a,b,c ∈S d − 1 ,w ∈ R � T − w · a ⊗ b ⊗ c � F . min Fix a ( t ) , b ( t ) and update c ( t +1) = ⇒ c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . Rank-1 ALS iteration ≡ asymmetric power iteration

  61. Alternating minimization Rank- 1 ALS iteration (power iteration) Initialization: a (0) , b (0) , c (0) . Update in t th step: fix a ( t ) , b ( t ) and c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . After (approx.) convergence, restart.

  62. Alternating minimization Rank- 1 ALS iteration (power iteration) Initialization: a (0) , b (0) , c (0) . Update in t th step: fix a ( t ) , b ( t ) and c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.

  63. Alternating minimization Rank- 1 ALS iteration (power iteration) Initialization: a (0) , b (0) , c (0) . Update in t th step: fix a ( t ) , b ( t ) and c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs. Challenges Optimization problem: non-convex, multiple local optima. Alternating minimization: improves the objective in each step? Recovery of a i , b i , c i ’s? Not true in general. Noisy tensor decomposition.

  64. Alternating minimization Rank- 1 ALS iteration (power iteration) Initialization: a (0) , b (0) , c (0) . Update in t th step: fix a ( t ) , b ( t ) and c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) . After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs. Challenges Optimization problem: non-convex, multiple local optima. Alternating minimization: improves the objective in each step? Recovery of a i , b i , c i ’s? Not true in general. Noisy tensor decomposition. Natural conditions under which Alt-Min has guarantees?

  65. Special case: Orthogonal Setting � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] � a i , a j � = 0 , for i � = j . Similarly for b, c . Alternating updates: � c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) = w i � a i , a ( t ) �� b i , b ( t ) � c i . i ∈ [ k ] a i , b i , c i are stationary points.

  66. Special case: Orthogonal Setting � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] � a i , a j � = 0 , for i � = j . Similarly for b, c . Alternating updates: � c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) = w i � a i , a ( t ) �� b i , b ( t ) � c i . i ∈ [ k ] a i , b i , c i are stationary points. ONLY local optima for best rank- 1 approximation problem. Guaranteed recovery through alternating minimization.

  67. Special case: Orthogonal Setting � a i , b i , c i ∈ S d − 1 . T = w i a i ⊗ b i ⊗ c i , i ∈ [ k ] � a i , a j � = 0 , for i � = j . Similarly for b, c . Alternating updates: � c ( t +1) ∝ T ( a ( t ) , b ( t ) , I ) = w i � a i , a ( t ) �� b i , b ( t ) � c i . i ∈ [ k ] a i , b i , c i are stationary points. ONLY local optima for best rank- 1 approximation problem. Guaranteed recovery through alternating minimization. Perturbation Analysis [AGH + 2012]: Under poly ( d ) number of random initializations and bounded noise conditions.

  68. Our Setup So far General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors? “Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A. Anandkumar, R. Ge. and M. Janzamin, Feb. 2014.

  69. Our Setup So far General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors? Our framework: Incoherent Components � � √ |� a i , a j �| = O 1 / d for i � = j . Similarly for b, c . Can handle overcomplete tensors. Satisfied by random (generic) vectors. Guaranteed recovery for alternating minimization? “Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A. Anandkumar, R. Ge. and M. Janzamin, Feb. 2014.

  70. Analysis of One Step Update � a i , b i , c i ∈ S d − 1 . w i a i ⊗ b i ⊗ c i , T = i ∈ [ k ] Basic Intuition a, ˆ Let ˆ b be “close to” a 1 , b 1 . Alternating update: � a, ˆ a �� b i , ˆ c ∝ T (ˆ ˆ b, I ) = w i � a i , ˆ b � c i , i ∈ [ k ] a �� b 1 , ˆ a, ˆ = w 1 � a 1 , ˆ b � c 1 + T − 1 (ˆ b, I ) .

  71. Analysis of One Step Update � a i , b i , c i ∈ S d − 1 . w i a i ⊗ b i ⊗ c i , T = i ∈ [ k ] Basic Intuition a, ˆ Let ˆ b be “close to” a 1 , b 1 . Alternating update: � a, ˆ a �� b i , ˆ c ∝ T (ˆ ˆ b, I ) = w i � a i , ˆ b � c i , i ∈ [ k ] a �� b 1 , ˆ a, ˆ = w 1 � a 1 , ˆ b � c 1 + T − 1 (ˆ b, I ) . a, ˆ a = a 1 , ˆ T − 1 (ˆ b, I ) = 0 in orthogonal case, when ˆ b = b 1 .

  72. Analysis of One Step Update � a i , b i , c i ∈ S d − 1 . w i a i ⊗ b i ⊗ c i , T = i ∈ [ k ] Basic Intuition a, ˆ Let ˆ b be “close to” a 1 , b 1 . Alternating update: � a, ˆ a �� b i , ˆ ˆ c ∝ T (ˆ b, I ) = w i � a i , ˆ b � c i , i ∈ [ k ] a �� b 1 , ˆ a, ˆ = w 1 � a 1 , ˆ b � c 1 + T − 1 (ˆ b, I ) . a, ˆ a = a 1 , ˆ T − 1 (ˆ b, I ) = 0 in orthogonal case, when ˆ b = b 1 . Can it be controlled for incoherent (random) vectors?

  73. Results for one step update � � √ Incoherence: |� a i , a j �| = O 1 / d for i � = j . Similarly for b, c . �� � k Spectral norm: � A � , � B � , � C � ≤ 1 + O . � T � ≤ (1 + o (1)) . d k = o ( d 1 . 5 ) . Weights: For simplicity, w i ≡ 1 . Tensor rank:

  74. Results for one step update � � √ Incoherence: |� a i , a j �| = O 1 / d for i � = j . Similarly for b, c . �� � k Spectral norm: � A � , � B � , � C � ≤ 1 + O . � T � ≤ (1 + o (1)) . d k = o ( d 1 . 5 ) . Weights: For simplicity, w i ≡ 1 . Tensor rank: Lemma [AGJ2014] a � , � b 1 − ˆ For small enough ǫ such that max {� a 1 − ˆ b �} ≤ ǫ , after one step � √ � � 1 � k , k ǫ + ǫ 2 � c 1 − ˆ c � ≤ O d + max √ . d 1 . 5 d √ k d : approximation error. rest: error contraction.

  75. Main Result: Local Convergence a (0) � , � b 1 − ˆ b (0) �} ≤ ǫ 0 , and ǫ 0 < constant. Initialization: max {� a 1 − ˆ Noise: ˆ T := T + E , and � E � ≤ 1 / polylog( d ) . Rank: k = o ( d 1 . 5 ) . � √ � Recovery error: ǫ R := � E � + ˜ k O d

  76. Main Result: Local Convergence a (0) � , � b 1 − ˆ b (0) �} ≤ ǫ 0 , and ǫ 0 < constant. Initialization: max {� a 1 − ˆ Noise: ˆ T := T + E , and � E � ≤ 1 / polylog( d ) . Rank: k = o ( d 1 . 5 ) . � √ � Recovery error: ǫ R := � E � + ˜ k O d Theorem (Local Convergence)[AGJ2014] After N = O (log(1 /ǫ R )) steps of alternating rank- 1 updates, a ( N ) � = O ( ǫ R ) . � a 1 − ˆ

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend