SLIDE 1
Learning Overcomplete Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation
Learning Overcomplete Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation
Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine Joint work with Majid Janzamin Rong Ge UC Irvine Microsoft Research Latent Variable Probabilistic Models Latent (hidden) variable h R k ,
SLIDE 2
SLIDE 3
Latent Variable Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · ·
SLIDE 4
Latent Variable Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh).
SLIDE 5
Latent Variable Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh). ICA, Sparse Coding, HMM, Topic modeling, . . .
SLIDE 6
Latent Variable Probabilistic Models
Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh). ICA, Sparse Coding, HMM, Topic modeling, . . . Efficient Learning of the parameters ah, µh, . . . ?
SLIDE 7
Method-of-Moments (Spectral methods)
Multi-variate observed moments
M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].
SLIDE 8
Method-of-Moments (Spectral methods)
Multi-variate observed moments
M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].
Matrix
E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].
SLIDE 9
Method-of-Moments (Spectral methods)
Multi-variate observed moments
M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].
Matrix
E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].
Tensor
E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].
SLIDE 10
Method-of-Moments (Spectral methods)
Multi-variate observed moments
M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].
Matrix
E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].
Tensor
E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3]. Information in moments for learning LVMs?
SLIDE 11
Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · ·
SLIDE 12
Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · Ex[
x1x⊤
2
x1 ⊗ x2] = Eh[Ex[x1 ⊗ x2|h]] = Eh[ah ⊗ bh] =
- j∈[k]
wjaj ⊗ bj.
SLIDE 13
Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · E[x1 ⊗ x2] =
- j∈[k]
wjaj ⊗ bj,
SLIDE 14
Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · E[x1 ⊗ x2] =
- j∈[k]
wjaj ⊗ bj, E[x1 ⊗ x2 ⊗ x3] =
- j∈[k]
wjaj ⊗ bj ⊗ cj.
SLIDE 15
Multiview Mixture Model
[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.
h x1 x2 x3
· · · E[x1 ⊗ x2] =
- j∈[k]
wjaj ⊗ bj, E[x1 ⊗ x2 ⊗ x3] =
- j∈[k]
wjaj ⊗ bj ⊗ cj. Tensor (matrix) factorization for learning LVMs.
SLIDE 16
Tensor Rank and Tensor Decomposition
Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).
SLIDE 17
Tensor Rank and Tensor Decomposition
Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).
CANDECOMP/PARAFAC (CP) Decomposition
T =
- j∈[k]
wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2
SLIDE 18
Tensor Rank and Tensor Decomposition
Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).
CANDECOMP/PARAFAC (CP) Decomposition
T =
- j∈[k]
wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2
k: tensor rank, d: ambient dimension. k ≤ d: undercomplete and k > d: overcomplete.
SLIDE 19
Tensor Rank and Tensor Decomposition
Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).
CANDECOMP/PARAFAC (CP) Decomposition
T =
- j∈[k]
wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2
k: tensor rank, d: ambient dimension. k ≤ d: undercomplete and k > d: overcomplete. This talk: guarantees for overcomplete tensor decomposition
SLIDE 20
Challenges in Tensor Decomposition
Symmetric tensor T ∈ Rd×d×d: T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Challenges in tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
SLIDE 21
Challenges in Tensor Decomposition
Symmetric tensor T ∈ Rd×d×d: T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Challenges in tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)
Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).
- {vi}’s are the only robust fixed points.
SLIDE 22
Challenges in Tensor Decomposition
Symmetric tensor T ∈ Rd×d×d: T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Challenges in tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)
Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).
- {vi}’s are the only robust fixed points.
- All other eigenvectors are saddle points.
SLIDE 23
Challenges in Tensor Decomposition
Symmetric tensor T ∈ Rd×d×d: T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Challenges in tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)
Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).
- {vi}’s are the only robust fixed points.
- All other eigenvectors are saddle points.
For an orthogonal tensor, no spurious local optima!
SLIDE 24
Beyond Orthogonal Tensor Decomposition
Limitations
Not ALL tensors have orthogonal decomposition (unlike matrices).
SLIDE 25
Beyond Orthogonal Tensor Decomposition
Limitations
Not ALL tensors have orthogonal decomposition (unlike matrices).
Undercomplete tensors (k ≤ d) with full rank components
Non-orthogonal decomposition T1 =
i wiai ⊗ ai ⊗ ai.
Whitening matrix W: Multilinear transform: T2 = T1(W, W, W) v1 v2 v3 W a1 a2 a3
Tensor T1 Tensor T2
SLIDE 26
Beyond Orthogonal Tensor Decomposition
Limitations
Not ALL tensors have orthogonal decomposition (unlike matrices).
Undercomplete tensors (k ≤ d) with full rank components
Non-orthogonal decomposition T1 =
i wiai ⊗ ai ⊗ ai.
Whitening matrix W: Multilinear transform: T2 = T1(W, W, W) v1 v2 v3 W a1 a2 a3
Tensor T1 Tensor T2
This talk: guarantees for overcomplete tensor decomposition
SLIDE 27
Outline
1
Introduction
2
Overcomplete tensor decomposition
3
Sample Complexity Analysis
4
Conclusion
SLIDE 28
Our Setup
So far
General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?
SLIDE 29
Our Setup
So far
General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?
Our framework: Incoherent Components
|ai, aj| = O
- 1/
√ d
- for i = j. Similarly for b, c.
Can handle overcomplete tensors. Satisfied by random vectors. Guaranteed recovery for alternating minimization?
SLIDE 30
Alternating minimization
min
a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .
Rank-1 ALS iteration (power iteration)
Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart.
SLIDE 31
Alternating minimization
min
a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .
Rank-1 ALS iteration (power iteration)
Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.
SLIDE 32
Alternating minimization
min
a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .
Rank-1 ALS iteration (power iteration)
Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs. Rank-1 ALS iteration ≡ asymmetric power iteration
SLIDE 33
Main Result: Local Convergence
Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5).
SLIDE 34
Main Result: Local Convergence
Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5).
Theorem (Local Convergence)[AGJ2014]
After N = O(log(1/E)) steps of alternating rank-1 updates, a1 − ˆ a(N) = O (E) .
SLIDE 35
Main Result: Local Convergence
Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5).
Theorem (Local Convergence)[AGJ2014]
After N = O(log(1/E)) steps of alternating rank-1 updates, a1 − ˆ a(N) = O (E) . Linear convergence: up to approximation error. Guarantees for overcomplete tensors: k = o(d1.5) and for pth-order tensors k = o(dp/2). Requires good initialization. What about global convergence?
SLIDE 36
Global Convergence k = O(d)
SVD Initialization
Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.
SLIDE 37
Global Convergence k = O(d)
SVD Initialization
Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.
Assumptions
Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)
- No. of Iterations: N = Θ (log(1/E)). Recall E: recovery error.
SLIDE 38
Global Convergence k = O(d)
SVD Initialization
Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.
Assumptions
Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)
- No. of Iterations: N = Θ (log(1/E)). Recall E: recovery error.
Theorem (Global Convergence)[AGJ2014]: a1 − ˆ a(N) ≤ O(ǫR).
SLIDE 39
Outline
1
Introduction
2
Overcomplete tensor decomposition
3
Sample Complexity Analysis
4
Conclusion
SLIDE 40
High-level Intuition for Sample Bounds
Multi-view Model: x1 = Ah + z1, where z1 is noise. Exact moment T =
i wiai ⊗ bi ⊗ ci.
Sample moment: ˆ T = 1
n
- i xi
1 ⊗ xi 2 ⊗ xi 3.
Na¨ ıve Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s.
SLIDE 41
High-level Intuition for Sample Bounds
Multi-view Model: x1 = Ah + z1, where z1 is noise. Exact moment T =
i wiai ⊗ bi ⊗ ci.
Sample moment: ˆ T = 1
n
- i xi
1 ⊗ xi 2 ⊗ xi 3.
Na¨ ıve Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s. Our idea: Careful ǫ-net covering for ˆ T − T. ˆ T − T has many terms, e.g., all-noise term:
1 n
- i zi
1 ⊗ zi 2 ⊗ zi 3 and
signal-noise terms. Need to bound 1 n
- i
zi
1, uzi 2, vzi 3, w, for all u, v, w ∈ Sd−1.
Classify inner products into buckets and bound them separately. Tight sample bounds for a range of latent variable models
SLIDE 42
Unsupervised Learning of Gaussian Mixtures
- No. of mixture components: k = C · d
- No. of unlabeled samples: n = ˜
Ω(k · d). Computational complexity: ˜ O
- kC2
Our result: achieved error with n unlabeled samples
max
j
- aj − aj = ˜
O
- k
n
- Linear convergence.
Error: same as before, for semi-supervised setting. Computational complexity: polynomial when k = Θ(d).
SLIDE 43
Semi-supervised Learning of Gaussian Mixtures
n unlabeled samples, mj: samples for component j.
- No. of mixture components: k = o(d1.5)
- No. of labeled samples: mj = ˜
Ω(1).
- No. of unlabeled samples: n = ˜
Ω(k).
Our result: achieved error with n unlabeled samples
max
j
- aj − aj = ˜
O
- k
n
- Linear convergence.
Can handle (polynomially) overcomplete mixtures. Extremely small number of labeled samples: polylog(d). Sample complexity is tight: need ˜ Ω(k) samples!
SLIDE 44
Outline
1
Introduction
2
Overcomplete tensor decomposition
3
Sample Complexity Analysis
4
Conclusion
SLIDE 45
Conclusion
Learning overcomplete Latent variable models.
⋆ Method-of-moments. ⋆ Tensor power iteration.
Robustness to noise. Sample complexity bounds for a range of LVMs.
⋆ Unsupervised setting. ⋆ Semi-supervised setting.
SLIDE 46
Conclusion
Learning overcomplete Latent variable models.
⋆ Method-of-moments. ⋆ Tensor power iteration.
Robustness to noise. Sample complexity bounds for a range of LVMs.
⋆ Unsupervised setting. ⋆ Semi-supervised setting.
Latest result: improved initialization for tensor with Gaussian components.
SLIDE 47