SLIDE 1
Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin
UC Irvine Joint work with
Anima Anandkumar Rong Ge
UC Irvine Microsoft Research
SLIDE 2 Latent Variable Modeling
Goal: Discover hidden effects from observed measurements
Document modeling
Observed: words. Hidden: topics.
Nursing Home Is Faulted Over Care After Storm By MICHAEL POWELL and SHERI FINK Amid the worst hurricane to hit New York City in nearly 80 years, officials have claimed that the Promenade Rehabilitation and Health Care Center failed to provide the most basic care to its patients. In One Day, 11,000 Flee Syria as War and Hardship Worsen By RICK GLADSTONE and NEIL MacFARQUHAR The United Nations reported that 11,000 Syrians fled on Friday, the vast majority of them clambering for safety over the Turkish border. Obama to Insist on Tax Increase for the Wealthy By HELENE COOPER and JONATHAN WEISMAN Amid talk of compromise, President Obama and Speaker John A. Boehner both indicated unchanged stances on this issue, long a point
Hurricane Exposed Flaws in Protection of Tunnels By ELISABETH ROSENTHAL Nearly two weeks after Hurricane Sandy struck, the vital arteries that bring cars, trucks and subways into New York City’s transportation network have recovered, with
- ne major exception: the Brooklyn-Battery
Tunnel remains closed. Behind New York Gas Lines, Warnings and Crossed Fingers By DAVID W. CHEN, WINNIE HU and CLIFFORD KRAUSS The return of 1970s-era gas lines to the five boroughs of New York City was not the result
- f a single miscalculation, but a combination
- f ignored warnings and indecisiveness.
Social Network Modeling
Observed: social interactions. Hidden: communities, relationships.
Recommendation Systems
Observed: recommendations (e.g., reviews). Hidden: User and business attributes Applications in Speech, Vision, . . .
SLIDE 3
Latent Variable Modeling
Feature Learning
Learn good features/representations for classification tasks, e.g., image and speech recognition.
SLIDE 4 Latent Variable Modeling
Feature Learning
Learn good features/representations for classification tasks, e.g., image and speech recognition.
Sparse Coding, Dictionary Learning
Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.
(Image from Sanjeev Arora’s slides.)
SLIDE 5
Learning Latent Variable Models
Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples.
SLIDE 6
Learning Latent Variable Models
Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples. Semi-supervised learning: few labeled samples.
SLIDE 7
Learning Latent Variable Models
Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples. Semi-supervised learning: few labeled samples.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?
SLIDE 8
Learning Latent Variable Models
Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples. Semi-supervised learning: few labeled samples.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard in most cases. Practice: EM, Variational Bayes, but have no consistency guarantees. Scalable guaranteed learning algorithms?
⋆ Low computational and statistical complexity
SLIDE 9
Learning Latent Variable Models
Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples. Semi-supervised learning: few labeled samples.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard in most cases. Practice: EM, Variational Bayes, but have no consistency guarantees. Scalable guaranteed learning algorithms?
⋆ Low computational and statistical complexity
This talk: guaranteed and efficient learning through spectral methods.
SLIDE 10
LVMs as Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd.
SLIDE 11
LVMs as Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · ·
SLIDE 12
LVMs as Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh).
SLIDE 13
LVMs as Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh). ICA, Sparse Coding, HMM, Topic modeling, . . .
SLIDE 14
LVMs as Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh). ICA, Sparse Coding, HMM, Topic modeling, . . . Efficient Learning of the parameters ah, µh, . . . ?
SLIDE 15
Method-of-Moments (Spectral methods)
Multi-variate observed moments
M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].
SLIDE 16
Method-of-Moments (Spectral methods)
Multi-variate observed moments
M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].
Matrix
E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].
SLIDE 17
Method-of-Moments (Spectral methods)
Multi-variate observed moments
M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].
Matrix
E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].
Tensor
E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].
SLIDE 18
Method-of-Moments (Spectral methods)
Multi-variate observed moments
M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].
Matrix
E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].
Tensor
E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3]. Information in moments for learning LVMs?
SLIDE 19
Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · ·
SLIDE 20 Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · Ex[
x1x⊤
2
x1 ⊗ x2] = Eh[Ex[x1 ⊗ x2|h]] = Eh[ah ⊗ bh] =
wjaj ⊗ bj.
SLIDE 21 Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · E[x1 ⊗ x2] =
wjaj ⊗ bj,
SLIDE 22 Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · E[x1 ⊗ x2] =
wjaj ⊗ bj, E[x1 ⊗ x2 ⊗ x3] =
wjaj ⊗ bj ⊗ cj.
SLIDE 23 Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · E[x1 ⊗ x2] =
wjaj ⊗ bj, E[x1 ⊗ x2 ⊗ x3] =
wjaj ⊗ bj ⊗ cj. Tensor (matrix) factorization for learning LVMs.
SLIDE 24
Matrix vs. Tensor Decomposition
Uniqueness of decomposition.
Matrix Decomposition
Distinct weights. Orthogonal components, i.e., ai, aj = 0, i = j. Too limiting. Otherwise, only learning up to subspace is possible.
SLIDE 25
Matrix vs. Tensor Decomposition
Uniqueness of decomposition.
Matrix Decomposition
Distinct weights. Orthogonal components, i.e., ai, aj = 0, i = j. Too limiting. Otherwise, only learning up to subspace is possible.
Tensor Decomposition
Same weights. Non-Orthogonal components ⇒ Overcomplete models. More general models.
SLIDE 26
Matrix vs. Tensor Decomposition
Uniqueness of decomposition.
Matrix Decomposition
Distinct weights. Orthogonal components, i.e., ai, aj = 0, i = j. Too limiting. Otherwise, only learning up to subspace is possible.
Tensor Decomposition
Same weights. Non-Orthogonal components ⇒ Overcomplete models. More general models. Focus on tensor decomposition for learning LVMs.
SLIDE 27
Overcomplete Latent Variable Models
Overcomplete Latent Representations
Latent dimensionality > observed dimensionality, i.e., k > d. Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples.
SLIDE 28
Overcomplete Latent Variable Models
Overcomplete Latent Representations
Latent dimensionality > observed dimensionality, i.e., k > d. Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment.
SLIDE 29 Overcomplete Latent Variable Models
Overcomplete Latent Representations
Latent dimensionality > observed dimensionality, i.e., k > d. Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment.
Example: T ∈ R2×2×2 with rank 3 (d = 2, k = 3) T (:, :, 1) = 1 1
T (:, :, 2) =
−1
T = 1 1
1
1
−1
1
1
1
1 1
−1
SLIDE 30
Overcomplete Latent Variable Models
Overcomplete Latent Representations
Latent dimensionality > observed dimensionality, i.e., k > d. Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment.
So far
Learning LVMs. Spectral methods (method-of-moments). Overcomplete LVMs. This work: theoretical guarantees for above.
SLIDE 31
Outline
1
Introduction
2
Summary of Results
3
Recap of Orthogonal Matrix and Tensor Decomposition
4
Overcomplete (Non-Orthogonal) Tensor Decomposition
5
Sample Complexity Analysis
6
Numerical Results
7
Conclusion
SLIDE 32
Spherical Gaussian Mixtures
Assumptions
k components, d: observed dimension. Component means ai incoherent: randomly drawn from the sphere. Spherical variance σ2
d I (assume known).
SLIDE 33
Spherical Gaussian Mixtures
Assumptions
k components, d: observed dimension. Component means ai incoherent: randomly drawn from the sphere. Spherical variance σ2
d I (assume known).
In this talk: special case
Noise norm σ2 = 1: same as signal. Uniform probability of components.
SLIDE 34 Spherical Gaussian Mixtures
Assumptions
k components, d: observed dimension. Component means ai incoherent: randomly drawn from the sphere. Spherical variance σ2
d I (assume known).
In this talk: special case
Noise norm σ2 = 1: same as signal. Uniform probability of components.
Tensor For Learning (Hsu, Kakade 2012)
M3 := E[x⊗3] − σ2
i∈[d]
(E[x] ⊗ ei ⊗ ei + · · · ) ⇒ M3 =
wjaj ⊗ aj ⊗ aj.
SLIDE 35 Semi-supervised Learning of Gaussian Mixtures
n unlabeled samples, mj: samples for component j.
- No. of mixture components: k = o(d1.5)
- No. of labeled samples: mj = ˜
Ω(1).
- No. of unlabeled samples: n = ˜
Ω(k).
Our result: achieved error with n unlabeled samples
max
j
O
n
O √ k d
Can handle (polynomially) overcomplete mixtures. Extremely small number of labeled samples: polylog(d). Sample complexity is tight: need ˜ Ω(k) samples! Approximation error: decaying in high dimensions.
SLIDE 36 Unsupervised Learning of Gaussian Mixtures
- No. of mixture components: k = C · d
- No. of unlabeled samples: n = ˜
Ω(k · d). Computational complexity: ˜ O
Our result: achieved error with n unlabeled samples
max
j
O
n
O √ k d
Error: same as before, for semi-supervised setting. Sample complexity: worse than semi-supervised, but better than previous works (no dependence on condition number of A). Computational complexity: polynomial when k = Θ(d).
SLIDE 37
Multi-view Mixture Models
h x1 x2 x3
· · · A = [a1 a2 · · · ak] ∈ Rd×k, similarly B and C. Linear model: x1 = Ah + z1, x2 = Bh + z2, x3 = Ch + z3.
SLIDE 38
Multi-view Mixture Models
h x1 x2 x3
· · · A = [a1 a2 · · · ak] ∈ Rd×k, similarly B and C. Linear model: x1 = Ah + z1, x2 = Bh + z2, x3 = Ch + z3. Incoherence: Component means ai’s are incoherent (randomly drawn from unit sphere). Similarly bi’s and ci’s. The zero-mean noise zl’s satisfy RIP, e.g., Gaussian, Bernoulli. Same results as Gaussian mixtures.
SLIDE 39
Independent Component Analysis
x = Ah, independent sources, unknown mixing. Blind source separation of speech, image, video. h1 h2 hk x1 x2 xd A
SLIDE 40
Independent Component Analysis
x = Ah, independent sources, unknown mixing. Blind source separation of speech, image, video. Sources h are sub-Gaussian (but not Gaussian). Columns of A are incoherent. Form cumulant tensor M4 := E[x⊗4] − · · · n samples. k sources. d dimensions. h1 h2 hk x1 x2 xd A
SLIDE 41 Independent Component Analysis
x = Ah, independent sources, unknown mixing. Blind source separation of speech, image, video. Sources h are sub-Gaussian (but not Gaussian). Columns of A are incoherent. Form cumulant tensor M4 := E[x⊗4] − · · · n samples. k sources. d dimensions. h1 h2 hk x1 x2 xd A
Learning Result
Semi-supervised: k = o(d2), n ≥ ˜ Ω(max(k2, k4/d3)). Unsupervised: k = O(d), n ≥ ˜ Ω(k3). max
j
min
f∈{−1,1} f
aj − aj = ˜ O k2 min
√ d3n
+ ˜ O √ k d1.5
SLIDE 42
Sparse Coding
Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.
SLIDE 43
Sparse Coding
x = Ah, sparse coefficients, unknown dictionary. Image compression, feature learning, ...
SLIDE 44
Sparse Coding
x = Ah, sparse coefficients, unknown dictionary. Image compression, feature learning, ... Coefficients h are independent Bernoulli Gaussian: Sparse ICA. Columns of A are incoherent. Form cumulant tensor M4 := E[x⊗4] − · · · n samples. k dictionary elements. d dimensions. s avg. sparsity.
SLIDE 45 Sparse Coding
x = Ah, sparse coefficients, unknown dictionary. Image compression, feature learning, ... Coefficients h are independent Bernoulli Gaussian: Sparse ICA. Columns of A are incoherent. Form cumulant tensor M4 := E[x⊗4] − · · · n samples. k dictionary elements. d dimensions. s avg. sparsity.
Learning Result
Semi-supervised: k = o(d2), n ≥ ˜ Ω(max(sk, s2k2/d3)). Unsupervised: k = O(d), n ≥ ˜ Ω(sk2). max
j
min
f∈{−1,1} f
aj − aj = ˜ O sk min
√ d3n
+ ˜ O √ k d1.5
SLIDE 46
Outline
1
Introduction
2
Summary of Results
3
Recap of Orthogonal Matrix and Tensor Decomposition
4
Overcomplete (Non-Orthogonal) Tensor Decomposition
5
Sample Complexity Analysis
6
Numerical Results
7
Conclusion
SLIDE 47
Recap of Orthogonal Matrix Eigen Analysis
Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =
i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.
SLIDE 48 Recap of Orthogonal Matrix Eigen Analysis
Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =
i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.
Uniqueness (Identifiability):
SLIDE 49 Recap of Orthogonal Matrix Eigen Analysis
Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =
i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.
Uniqueness (Identifiability):
Algorithm: Power method: v → Mv Mv.
SLIDE 50 Recap of Orthogonal Matrix Eigen Analysis
Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =
i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.
Uniqueness (Identifiability):
Algorithm: Power method: v → Mv Mv.
Convergence properties
Let λ1 > λ2 > · · · > λd. Only vi’s are fixed points of power iteration. Mvi = λivi.
SLIDE 51 Recap of Orthogonal Matrix Eigen Analysis
Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =
i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.
Uniqueness (Identifiability):
Algorithm: Power method: v → Mv Mv.
Convergence properties
Let λ1 > λ2 > · · · > λd. Only vi’s are fixed points of power iteration. Mvi = λivi.
- v1 is the only robust fixed point.
SLIDE 52 Recap of Orthogonal Matrix Eigen Analysis
Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =
i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.
Uniqueness (Identifiability):
Algorithm: Power method: v → Mv Mv.
Convergence properties
Let λ1 > λ2 > · · · > λd. Only vi’s are fixed points of power iteration. Mvi = λivi.
- v1 is the only robust fixed point.
- All other vi’s are saddle points.
SLIDE 53 Recap of Orthogonal Matrix Eigen Analysis
Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =
i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.
Uniqueness (Identifiability):
Algorithm: Power method: v → Mv Mv.
Convergence properties
Let λ1 > λ2 > · · · > λd. Only vi’s are fixed points of power iteration. Mvi = λivi.
- v1 is the only robust fixed point.
- All other vi’s are saddle points.
Power method recovers v1 when initialization v satisfies v, v1 = 0.
SLIDE 54
Tensor Rank and Tensor Decomposition
Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).
SLIDE 55 Tensor Rank and Tensor Decomposition
Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).
CANDECOMP/PARAFAC (CP) Decomposition
T =
wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2
SLIDE 56 Tensor Rank and Tensor Decomposition
Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).
CANDECOMP/PARAFAC (CP) Decomposition
T =
wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2
k: tensor rank, d: ambient dimension. k ≤ d: undercomplete and k > d: overcomplete.
SLIDE 57 Tensor Rank and Tensor Decomposition
Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).
CANDECOMP/PARAFAC (CP) Decomposition
T =
wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2
k: tensor rank, d: ambient dimension. k ≤ d: undercomplete and k > d: overcomplete. This talk: guarantees for overcomplete tensor decomposition
SLIDE 58 Background on Tensor Decomposition
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.
Theoretical Guarantees
Tensor decompositions in psychometrics (Cattell ‘44). CP tensor decomposition (Harshman ‘70, Carol & Chang ‘70). Identifiability of CP tensor decomposition (Kruskal ‘76). Orthogonal decomposition: (Zhang & Golub ‘01, Kolda ‘01, Anandkumar etal ‘12). Tensor decomposition through (lifted) linear equations (Lawthauwer ‘07): works for overcomplete tensors. Tensor decomposition through simultaneous diagonalization: perturbation analysis (Goyal et. al ‘13, Bhaskara ‘13)
SLIDE 59 Background on Tensor Decompositions (contd.)
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.
Practice: Alternating least squares (ALS)
Let A = [a1|a2 · · · ak] and similarly B, C. Fix estimates of two of the modes (say for A and B) and re-estimate the third. Iterative updates, low computational complexity. No theoretical guarantees. In this talk: analysis of alternating minimization
SLIDE 60
Tensors as Multilinear Transformations
Tensor T ∈ Rd×d×d. Vectors v, w ∈ Rd.
SLIDE 61 Tensors as Multilinear Transformations
Tensor T ∈ Rd×d×d. Vectors v, w ∈ Rd. T(I, v, w) :=
vjwlT(:, j, l) ∈ Rd.
SLIDE 62 Tensors as Multilinear Transformations
Tensor T ∈ Rd×d×d. Vectors v, w ∈ Rd. T(I, v, w) :=
vjwlT(:, j, l) ∈ Rd. For matrix M ∈ Rd×d: M(I, w) = Mw =
wlM(:, l) ∈ Rd.
SLIDE 63 Challenges in Tensor Decomposition
Symmetric tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi.
Challenges in tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
SLIDE 64 Challenges in Tensor Decomposition
Symmetric tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi.
Challenges in tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)
{vi} are eigenvectors: T(I, vi, vi) = λivi.
SLIDE 65 Challenges in Tensor Decomposition
Symmetric tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi.
Challenges in tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)
{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1.
SLIDE 66 Challenges in Tensor Decomposition
Symmetric tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi.
Challenges in tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)
{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1. How do we avoid spurious solutions (not part of decomposition)?
SLIDE 67 Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi.
SLIDE 68 Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v).
SLIDE 69 Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).
SLIDE 70 Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).
- {vi}’s are the only robust fixed points.
SLIDE 71 Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).
- {vi}’s are the only robust fixed points.
- All other eigenvectors are saddle points.
SLIDE 72 Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d: T =
λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).
- {vi}’s are the only robust fixed points.
- All other eigenvectors are saddle points.
For an orthogonal tensor, no spurious local optima!
SLIDE 73
Matrix vs. tensor power iteration
Matrix power iteration: Tensor power iteration:
SLIDE 74 Matrix vs. tensor power iteration
Matrix power iteration:
1
Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration:
1
Requires gap between largest and second-largest λi|ci| where initialization vector v =
i civi.
Property of the tensor and initialization v.
SLIDE 75 Matrix vs. tensor power iteration
Matrix power iteration:
1
Requires gap between largest and second-largest eigenvalue. Property of the matrix only.
2
Converges to top eigenvector. Tensor power iteration:
1
Requires gap between largest and second-largest λi|ci| where initialization vector v =
i civi.
Property of the tensor and initialization v.
2
Converges to vi for which vi|ci| = max! could be any of them.
SLIDE 76 Matrix vs. tensor power iteration
Matrix power iteration:
1
Requires gap between largest and second-largest eigenvalue. Property of the matrix only.
2
Converges to top eigenvector.
3
Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration:
1
Requires gap between largest and second-largest λi|ci| where initialization vector v =
i civi.
Property of the tensor and initialization v.
2
Converges to vi for which vi|ci| = max! could be any of them.
3
Quadratic convergence. Need O(log log(1/ǫ)) iterations.
SLIDE 77
Beyond Orthogonal Tensor Decomposition
Limitations
Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors (k > d). Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge.
SLIDE 78
Beyond Orthogonal Tensor Decomposition
Limitations
Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors (k > d). Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge.
Undercomplete tensors (k ≤ d) with full rank components
Non-orthogonal decomposition T1 =
i wiai ⊗ ai ⊗ ai.
Whitening matrix W: Multilinear transform: T2 = T1(W, W, W) Limitations: depends on condition number, sensitive to noise. v1 v2 v3 W a1 a2 a3
Tensor T1 Tensor T2
SLIDE 79
Beyond Orthogonal Tensor Decomposition
Limitations
Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors (k > d). Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge.
Undercomplete tensors (k ≤ d) with full rank components
Non-orthogonal decomposition T1 =
i wiai ⊗ ai ⊗ ai.
Whitening matrix W: Multilinear transform: T2 = T1(W, W, W) Limitations: depends on condition number, sensitive to noise. v1 v2 v3 W a1 a2 a3
Tensor T1 Tensor T2
This talk: guarantees for overcomplete tensor decomposition
SLIDE 80
Outline
1
Introduction
2
Summary of Results
3
Recap of Orthogonal Matrix and Tensor Decomposition
4
Overcomplete (Non-Orthogonal) Tensor Decomposition
5
Sample Complexity Analysis
6
Numerical Results
7
Conclusion
SLIDE 81
Non-orthogonal Tensor Decomposition
Multiview linear mixture model Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch. E[x1 ⊗ x2 ⊗ x3] =
i∈[k] wiai ⊗ bi ⊗ ci.
h x1 x2 x3
· · ·
SLIDE 82 Non-orthogonal Tensor Decomposition
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.
Practice: Alternating least squares (ALS)
Many spurious local optima. No theoretical guarantee.
SLIDE 83 Non-orthogonal Tensor Decomposition
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.
Practice: Alternating least squares (ALS)
Many spurious local optima. No theoretical guarantee.
Rank-1 ALS (Best Rank-1 Approximation)
min
a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .
SLIDE 84 Non-orthogonal Tensor Decomposition
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.
Practice: Alternating least squares (ALS)
Many spurious local optima. No theoretical guarantee.
Rank-1 ALS (Best Rank-1 Approximation)
min
a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .
Fix a(t), b(t) and update c(t+1) = ⇒ c(t+1) ∝ T(a(t), b(t), I). Rank-1 ALS iteration ≡ asymmetric power iteration
SLIDE 85
Alternating minimization
Rank-1 ALS iteration (power iteration)
Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart.
SLIDE 86
Alternating minimization
Rank-1 ALS iteration (power iteration)
Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.
SLIDE 87
Alternating minimization
Rank-1 ALS iteration (power iteration)
Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.
Challenges
Optimization problem: non-convex, multiple local optima. Alternating minimization: improves the objective in each step? Recovery of ai, bi, ci’s? Not true in general. Noisy tensor decomposition.
SLIDE 88
Alternating minimization
Rank-1 ALS iteration (power iteration)
Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.
Challenges
Optimization problem: non-convex, multiple local optima. Alternating minimization: improves the objective in each step? Recovery of ai, bi, ci’s? Not true in general. Noisy tensor decomposition. Natural conditions under which Alt-Min has guarantees?
SLIDE 89 Special case: Orthogonal Setting
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1. ai, aj = 0, for i = j. Similarly for b, c. Alternating updates: c(t+1) ∝ T(a(t), b(t), I) =
wiai, a(t)bi, b(t)ci. ai, bi, ci are stationary points.
SLIDE 90 Special case: Orthogonal Setting
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1. ai, aj = 0, for i = j. Similarly for b, c. Alternating updates: c(t+1) ∝ T(a(t), b(t), I) =
wiai, a(t)bi, b(t)ci. ai, bi, ci are stationary points. ONLY local optima for best rank-1 approximation problem. Guaranteed recovery through alternating minimization.
SLIDE 91 Special case: Orthogonal Setting
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1. ai, aj = 0, for i = j. Similarly for b, c. Alternating updates: c(t+1) ∝ T(a(t), b(t), I) =
wiai, a(t)bi, b(t)ci. ai, bi, ci are stationary points. ONLY local optima for best rank-1 approximation problem. Guaranteed recovery through alternating minimization. Perturbation Analysis [AGH+2012]: Under poly(d) number of random initializations and bounded noise conditions.
SLIDE 92
Our Setup
So far
General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?
“Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A. Anandkumar, R. Ge. and M. Janzamin, Feb. 2014.
SLIDE 93 Our Setup
So far
General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?
Our framework: Incoherent Components
|ai, aj| = O
√ d
- for i = j. Similarly for b, c.
Can handle overcomplete tensors. Satisfied by random (generic) vectors. Guaranteed recovery for alternating minimization?
“Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A. Anandkumar, R. Ge. and M. Janzamin, Feb. 2014.
SLIDE 94 Analysis of One Step Update
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.
Basic Intuition
Let ˆ a,ˆ b be “close to” a1, b1. Alternating update: ˆ c ∝ T(ˆ a,ˆ b, I) =
wiai, ˆ abi,ˆ bci, = w1a1, ˆ ab1,ˆ bc1 + T−1(ˆ a,ˆ b, I).
SLIDE 95 Analysis of One Step Update
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.
Basic Intuition
Let ˆ a,ˆ b be “close to” a1, b1. Alternating update: ˆ c ∝ T(ˆ a,ˆ b, I) =
wiai, ˆ abi,ˆ bci, = w1a1, ˆ ab1,ˆ bc1 + T−1(ˆ a,ˆ b, I). T−1(ˆ a,ˆ b, I) = 0 in orthogonal case, when ˆ a = a1,ˆ b = b1.
SLIDE 96 Analysis of One Step Update
T =
wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.
Basic Intuition
Let ˆ a,ˆ b be “close to” a1, b1. Alternating update: ˆ c ∝ T(ˆ a,ˆ b, I) =
wiai, ˆ abi,ˆ bci, = w1a1, ˆ ab1,ˆ bc1 + T−1(ˆ a,ˆ b, I). T−1(ˆ a,ˆ b, I) = 0 in orthogonal case, when ˆ a = a1,ˆ b = b1. Can it be controlled for incoherent (random) vectors?
SLIDE 97 Results for one step update
Incoherence: |ai, aj| = O
√ d
- for i = j. Similarly for b, c.
Spectral norm: A, B, C ≤ 1 + O
d
Tensor rank: k = o(d1.5). Weights: For simplicity, wi ≡ 1.
SLIDE 98 Results for one step update
Incoherence: |ai, aj| = O
√ d
- for i = j. Similarly for b, c.
Spectral norm: A, B, C ≤ 1 + O
d
Tensor rank: k = o(d1.5). Weights: For simplicity, wi ≡ 1.
Lemma [AGJ2014]
For small enough ǫ such that max{a1 − ˆ a, b1 − ˆ b} ≤ ǫ, after one step c1 − ˆ c ≤ O √ k d + max 1 √ d , k d1.5
√ k d : approximation error. rest: error contraction.
SLIDE 99
Main Result: Local Convergence
Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5). Recovery error: ǫR := E + ˜ O √
k d
SLIDE 100 Main Result: Local Convergence
Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5). Recovery error: ǫR := E + ˜ O √
k d
- Theorem (Local Convergence)[AGJ2014]
After N = O(log(1/ǫR)) steps of alternating rank-1 updates, a1 − ˆ a(N) = O(ǫR).
SLIDE 101 Main Result: Local Convergence
Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5). Recovery error: ǫR := E + ˜ O √
k d
- Theorem (Local Convergence)[AGJ2014]
After N = O(log(1/ǫR)) steps of alternating rank-1 updates, a1 − ˆ a(N) = O(ǫR). Linear convergence: up to approximation error. Guarantees for overcomplete tensors: k = o(d1.5) and for pth-order tensors k = o(dp/2). Requires good initialization. What about global convergence?
SLIDE 102
Global Convergence k = O(d)
SVD Initialization
Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.
SLIDE 103 Global Convergence k = O(d)
SVD Initialization
Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.
Assumptions
Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)
- No. of Iterations: N = Θ (log(1/ǫR)). Recall ǫR: recovery error.
SLIDE 104 Global Convergence k = O(d)
SVD Initialization
Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.
Assumptions
Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)
- No. of Iterations: N = Θ (log(1/ǫR)). Recall ǫR: recovery error.
Theorem (Global Convergence)[AGJ2014]: a1 − ˆ a(N) ≤ O(ǫR).
SLIDE 105 Global Convergence k = O(d)
SVD Initialization
Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.
Assumptions
Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)
- No. of Iterations: N = Θ (log(1/ǫR)). Recall ǫR: recovery error.
Theorem (Global Convergence)[AGJ2014]: a1 − ˆ a(N) ≤ O(ǫR). Corollary: Differing Dimensions
If ai, bi ∈ Rdu and ci ∈ Rdo, and du ≥ k ≥ do. k = O(√dudo) for incoherent vectors. k = O(du) if A, B orthogonal. Same guarantees. Can handle one overcomplete mode.
SLIDE 106 Latest Result: Global Convergence
Assume Gaussian means ai’s. Improved initialization requirement for convergence of third order tensor power iteration |a1, ˆ a(0)| ≥ dβ √ k d , β > (log d)−c.
Spherical Gaussian Mixture or Multiview Mixture Model
Initialize with samples with norm of noise bounded by √ dσ such that σ = o
k
“Analyzing Tensor Power Method Dynamics: Applications to Learning Overcomplete Latent Variable Models” by A. Anandkumar, R. Ge. and M. Janzamin, Nov. 2014.
SLIDE 107
Outline
1
Introduction
2
Summary of Results
3
Recap of Orthogonal Matrix and Tensor Decomposition
4
Overcomplete (Non-Orthogonal) Tensor Decomposition
5
Sample Complexity Analysis
6
Numerical Results
7
Conclusion
SLIDE 108 High-level Intuition for Sample Bounds
Multi-view Model: x1 = Ah + z1, where z1 is noise. Exact moment T =
i wiai ⊗ bi ⊗ ci.
Sample moment: ˆ T = 1
n
1 ⊗ xi 2 ⊗ xi 3.
Na¨ ıve Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s.
SLIDE 109 High-level Intuition for Sample Bounds
Multi-view Model: x1 = Ah + z1, where z1 is noise. Exact moment T =
i wiai ⊗ bi ⊗ ci.
Sample moment: ˆ T = 1
n
1 ⊗ xi 2 ⊗ xi 3.
Na¨ ıve Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s. Our idea: Careful ǫ-net covering for ˆ T − T. ˆ T − T has many terms, e.g., all-noise term:
1 n
1 ⊗ zi 2 ⊗ zi 3 and
signal-noise terms. Need to bound 1 n
zi
1, uzi 2, vzi 3, w, for all u, v, w ∈ Sd−1.
Classify inner products into buckets and bound them separately. Tight sample bounds for a range of latent variable models
“Provable Learning of Overcomplete Latent Variable Models: Semi-supervised and Unsupervised Settings” by A. Anandkumar, R. Ge. and M. Janzamin, Aug. 2014.
SLIDE 110
Outline
1
Introduction
2
Summary of Results
3
Recap of Orthogonal Matrix and Tensor Decomposition
4
Overcomplete (Non-Orthogonal) Tensor Decomposition
5
Sample Complexity Analysis
6
Numerical Results
7
Conclusion
SLIDE 111 Synthetic experiments
Learning multiview Gaussian mixture. Random mixture components. d = 100, k = {10, 20, 50, 100, 200, 500}. n = 1000. Random initialization.
10 10
1
10
2
10
3
10
−3
10
−2
10
−1
10 d=100, k =10 d=100, k =20 d=100, k =50 d=100, k =100 d=100, k =200 d=100, k =500
recovery rate of algorithm number of initializations ratio of recovered components
SLIDE 112
Outline
1
Introduction
2
Summary of Results
3
Recap of Orthogonal Matrix and Tensor Decomposition
4
Overcomplete (Non-Orthogonal) Tensor Decomposition
5
Sample Complexity Analysis
6
Numerical Results
7
Conclusion
SLIDE 113
Conclusion
Learning overcomplete Latent variable models.
⋆ Method-of-moments. ⋆ Tensor power iteration.
Robustness to noise. Sample complexity bounds for a range of LVMs.
⋆ Unsupervised setting. ⋆ Semi-supervised setting.
SLIDE 114
Conclusion
Learning overcomplete Latent variable models.
⋆ Method-of-moments. ⋆ Tensor power iteration.
Robustness to noise. Sample complexity bounds for a range of LVMs.
⋆ Unsupervised setting. ⋆ Semi-supervised setting.
Coming: removing approximation error
√ k d .
SLIDE 115
Conclusion
Learning overcomplete Latent variable models.
⋆ Method-of-moments. ⋆ Tensor power iteration.
Robustness to noise. Sample complexity bounds for a range of LVMs.
⋆ Unsupervised setting. ⋆ Semi-supervised setting.
Coming: removing approximation error
√ k d .
Thank you!