SLIDE 1
Guaranteed Learning of Latent Variable Models through Spectral and - - PowerPoint PPT Presentation
Guaranteed Learning of Latent Variable Models through Spectral and - - PowerPoint PPT Presentation
Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Guaranteed Unsupervised Learning Unsupervised Learning: no labeled samples available for training. Guaranteed Unsupervised Learning
SLIDE 2
SLIDE 3
Guaranteed Unsupervised Learning
Unsupervised Learning: no labeled samples available for training.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?
SLIDE 4
Guaranteed Unsupervised Learning
Unsupervised Learning: no labeled samples available for training.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities?
SLIDE 5
Guaranteed Unsupervised Learning
Unsupervised Learning: no labeled samples available for training.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this series: guaranteed and efficient learning through spectral methods
SLIDE 6
Probabilistic Models
Latent Variable Models
Concise statistical description through graphical modeling Conditional independence relationships
- r hierarchy of variables.
x h
SLIDE 7
Probabilistic Models
Latent Variable Models
Concise statistical description through graphical modeling Conditional independence relationships
- r hierarchy of variables.
x1 x2 x3 x4 x5 h
SLIDE 8
Probabilistic Models
Latent Variable Models
Concise statistical description through graphical modeling Conditional independence relationships
- r hierarchy of variables.
x1 x2 x3 x4 x5 h1 h2 h3
SLIDE 9
Probabilistic Models
Latent Variable Models
Concise statistical description through graphical modeling Conditional independence relationships
- r hierarchy of variables.
x1 x2 x3 x4 x5 h1 h2 h3
Maximum Likelihood vs. Moment method
Finding MLE is NP-hard in general. Expectation maximization (EM) converges to a local optimum.
SLIDE 10
Probabilistic Models
Latent Variable Models
Concise statistical description through graphical modeling Conditional independence relationships
- r hierarchy of variables.
x1 x2 x3 x4 x5 h1 h2 h3
Maximum Likelihood vs. Moment method
Finding MLE is NP-hard in general. Expectation maximization (EM) converges to a local optimum. Moment estimate: polynomial computational & sample complexity. Le Cam theory: Newton-Ralphson on moment estimate leads to efficient estimator asymptotically. Scalable implementation: linear and multilinear algebraic operations.
SLIDE 11
Game Plan: In this talk
Recall Yesterday’s Talk
Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.
SLIDE 12
Game Plan: In this talk
Recall Yesterday’s Talk
Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.
Today’s talk
Moments for various latent variable models. Analysis of tensor power method.
SLIDE 13
Game Plan: In this talk
Recall Yesterday’s Talk
Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.
Today’s talk
Moments for various latent variable models. Analysis of tensor power method.
Tomorrow’s talk
Implementation of tensor method.
SLIDE 14
Outline
1
Introduction
2
Latent Variable Models and Moments
3
Community Detection in Graphs
4
Analysis of Tensor Power Method
5
Advanced Topics
6
Conclusion
SLIDE 15
Recap: Gaussian Mixtures and (single) Topic Models
(spherical) Mixture of Gaussian: k means: a1, . . . ak Component h = i with prob. wi
- bserve x, with spherical noise,
x = ai + z, z ∼ N(0, σ2
i I)
(single) Topic Models k topics: a1, . . . ak Topic h = i with prob. wi
- bserve l (exchangeable) words
x1, x2, . . . xl i.i.d. from ai Unified Linear Model: E[x|h] = Ah Gaussian mixture: single view, spherical noise. Topic model: multi-view, heteroskedastic noise. M3 =
i wiai ⊗ ai ⊗ ai,
M2 =
i wiai ⊗ ai.
SLIDE 16
Recap: Geometric Picture for Topic Models
Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h)
Document
SLIDE 17
Recap: Geometric Picture for Topic Models
Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Single topic (h)
SLIDE 18
Recap: Geometric Picture for Topic Models
Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h)
SLIDE 19
Recap: Geometric Picture for Topic Models
Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h) A A A x1 x2 x3 Word generation (x1, x2, . . .)
SLIDE 20
Latent Dirichlet Allocation
l words in a document x1, . . . , xl. Word xi generated from topic yi. Exchangeability: x1 ⊥ ⊥ x2 ⊥ ⊥ . . . |h A(i, j) := P[xm = i|ym = j] : topic-word matrix. Words Topics Topic Mixture
x1 x2 x3 x4 x5 y1 y2 y3 y4 y5
A A A A A h If there are k topics, distribution of h over the simplex ∆k−1 ∆k−1 := {h ∈ Rk, hi ∈ [0, 1],
- i
hi = 1}. Latent Dirichlet Allocation: h is drawn from a Dirichlet distribution.
SLIDE 21
Dirichlet Distribution
P[h] ∝ k
j=1 h(j)αj−1, k j=1 h(j) = 1
SLIDE 22
Dirichlet Distribution
P[h] ∝ k
j=1 h(j)αj−1, k j=1 h(j) = 1
Dir(α)
SLIDE 23
Dirichlet Distribution
P[h] ∝ k
j=1 h(j)αj−1, k j=1 h(j) = 1
Dir(α) Dirichlet concentration parameter α0 :=
j αj
Sparsity level in h is O(α0).
SLIDE 24
Dirichlet Distribution
P[h] ∝ k
j=1 h(j)αj−1, k j=1 h(j) = 1
αj → 0 Dirichlet concentration parameter α0 :=
j αj
Sparsity level in h is O(α0).
SLIDE 25
Dirichlet Distribution
P[h] ∝ k
j=1 h(j)αj−1, k j=1 h(j) = 1
αj < 1 Dirichlet concentration parameter α0 :=
j αj
Sparsity level in h is O(α0).
SLIDE 26
Dirichlet Distribution
P[h] ∝ k
j=1 h(j)αj−1, k j=1 h(j) = 1
Large αj Dirichlet concentration parameter α0 :=
j αj
Sparsity level in h is O(α0).
SLIDE 27
Dirichlet Distribution
P[h] ∝ k
j=1 h(j)αj−1, k j=1 h(j) = 1
αj → ∞ Dirichlet concentration parameter α0 :=
j αj
Sparsity level in h is O(α0).
SLIDE 28
Dirichlet Distribution
P[h] ∝ k
j=1 h(j)αj−1, k j=1 h(j) = 1
Dir(α) Dirichlet concentration parameter α0 :=
j αj
Sparsity level in h is O(α0).
SLIDE 29
Moments under LDA
M2 := E[x1 ⊗ x2] − α0 α0 + 1E[x1] ⊗ E[x1] M3 := E[x1 ⊗ x2 ⊗ x3] − α0 α0 + 2E[x1 ⊗ x2 ⊗ E[x1]] − more stuff... Then M2 =
- ˜
wi ai ⊗ ai M3 =
- ˜
wi ai ⊗ ai ⊗ ai. Three words per document suffice for learning LDA.
SLIDE 30
General Multiview Mixtures (Naive Bayes)
E[xi|h] = Aih and multiple views.
SLIDE 31
General Multiview Mixtures (Naive Bayes)
E[xi|h] = Aih and multiple views. ˜ x1 := E[x3 ⊗ x2]E[x1 ⊗ x2]†x1, ˜ x2 := E[x3 ⊗ x1]E[x2 ⊗ x1]†x2, M2: = E[ ˜ x1 ⊗ ˜ x1], E[x1 ⊗ x2]†x M3 = E[ ˜ x1 ⊗ ˜ x2 ⊗ x3].
SLIDE 32
General Multiview Mixtures (Naive Bayes)
E[xi|h] = Aih and multiple views. ˜ x1 := E[x3 ⊗ x2]E[x1 ⊗ x2]†x1, ˜ x2 := E[x3 ⊗ x1]E[x2 ⊗ x1]†x2, M2: = E[ ˜ x1 ⊗ ˜ x1], E[x1 ⊗ x2]†x M3 = E[ ˜ x1 ⊗ ˜ x2 ⊗ x3]. M2 =
i wia3,i ⊗ a3,i,
M3 =
i wia3,i ⊗ a3,i ⊗ a3,i.
SLIDE 33
Hidden Markov Models
P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). T T O O O h1 h2 h3 x1 x2 x3
SLIDE 34
Hidden Markov Models
P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3
SLIDE 35
Hidden Markov Models
P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3 E[x1|h2] = ODiag(π)T ⊤Diag(w)−1h2 E[x2|h2] = Oh2 E[x3|h2] = OTh2.
SLIDE 36
Hidden Markov Models
P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3 E[x1|h2] = ODiag(π)T ⊤Diag(w)−1h2 E[x2|h2] = Oh2 E[x3|h2] = OTh2.
Condition for non-degeneracy
O ∈ Rd×k has full column rank. T is invertible, π and Tπ have positive entries.
SLIDE 37
Independent Component Analysis
Independent sources, unknown mixing. Blind source separation. Application: speech, image, video.. k sources. d dimensions. h1 h2 hk x1 x2 xd A x = Ah + z. z ∼ N(0, σ2I). Sources hi are independent. Form cumulant tensor M4 :=E[x⊗4] − E[xi1xi2]E[xi3xi4] . . . =
- i
κiai ⊗ ai ⊗ ai ⊗ ai. Kurtosis: κi := E[h4
i ] − 3.
Assumption: sources have non-zero kurtosis (κi = 0).
SLIDE 38
Outline
1
Introduction
2
Latent Variable Models and Moments
3
Community Detection in Graphs
4
Analysis of Tensor Power Method
5
Advanced Topics
6
Conclusion
SLIDE 39
Social Networks & Recommender Systems
Social Networks
Network of social ties, e.g. friendships, co-authorships Hidden: communities of actors.
Recommender Systems
Observed: Ratings of users for various products. Goal: New recommendations. Modeling: User/product groups.
SLIDE 40
Network Community Models
How are communities formed? How do communities interact?
SLIDE 41
Network Community Models
How are communities formed? How do communities interact?
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
SLIDE 42
Network Community Models
How are communities formed? How do communities interact?
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
SLIDE 43
Network Community Models
How are communities formed? How do communities interact?
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
0.9
SLIDE 44
Network Community Models
How are communities formed? How do communities interact?
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
0.1
SLIDE 45
Network Community Models
How are communities formed? How do communities interact?
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
SLIDE 46
Mixed Membership Model (Airoldi et al)
k communities and n nodes. Graph G ∈ Rn×n (adjacency matrix). Fractional memberships: πx ∈ Rk membership of node x. ∆k−1 := {πx ∈ Rk, πx(i) ∈ [0, 1],
- i
πx(i) = 1, ∀ x ∈ [n]}. Node memberships {πu} drawn from Dirichlet distribution.
SLIDE 47
Mixed Membership Model (Airoldi et al)
k communities and n nodes. Graph G ∈ Rn×n (adjacency matrix). Fractional memberships: πx ∈ Rk membership of node x. ∆k−1 := {πx ∈ Rk, πx(i) ∈ [0, 1],
- i
πx(i) = 1, ∀ x ∈ [n]}. Node memberships {πu} drawn from Dirichlet distribution. Edges conditionally independent given community memberships: Gi,j ⊥ ⊥ Ga,b|πi, πj, πa, πb. Edge probability averaged over community memberships P[Gi,j = 1|πi, πj] = E[Gi,j|πi, πj] = π⊤
i Pπj.
P ∈ Rk×k: average edge connectivity for pure communities.
Airoldi, Blei, Fienberg, and Xing. Mixed membership stochastic blockmodels. J. of Machine Learning Research, June 2008.
SLIDE 48
Networks under Community Models
SLIDE 49
Networks under Community Models
Stochastic Block Model
α0 = 0
SLIDE 50
Networks under Community Models
Stochastic Block Model
α0 = 0
Mixed Membership Model
α0 = 1
SLIDE 51
Networks under Community Models
Stochastic Block Model
α0 = 0
Mixed Membership Model
α0 = 10
SLIDE 52
Networks under Community Models
Stochastic Block Model
α0 = 0
Mixed Membership Model
α0 = 10
Unifying Assumption
Edges conditionally independent given community memberships
SLIDE 53
Subgraph Counts as Graph Moments
SLIDE 54
Subgraph Counts as Graph Moments
SLIDE 55
Subgraph Counts as Graph Moments
3-star counts sufficient for identifiability and learning of MMSB
SLIDE 56
Subgraph Counts as Graph Moments
3-star counts sufficient for identifiability and learning of MMSB
3-Star Count Tensor
˜ M3(a, b, c) = 1 |X|# of common neighbors in X = 1 |X|
- x∈X
G(x, a)G(x, b)G(x, c). ˜ M3 = 1 |X|
- x∈X
[G⊤
x,A ⊗ G⊤ x,B ⊗ G⊤ x,C]
x a b c A B C X
SLIDE 57
Multi-view Representation
Conditional independence of the three views πx: community membership vector of node x.
3-stars
x X A B C
Graphical model
πx G⊤
x,A
G⊤
x,B
G⊤
x,C
U V W Linear Multiview Model: E[G⊤
x,A|Π] = Π⊤ AP ⊤πx = Uπx.
SLIDE 58
Subgraph Counts as Graph Moments
Second and Third Order Moments
ˆ M2 :=
1 |X|
- x
ZCG⊤
x,CGx,BZ⊤ B − shift
ˆ M3 :=
1 |X|
- x
- G⊤
x,A ⊗ ZBG⊤ x,B ⊗ ZCG⊤ x,C
- − shift
Symmetrize Transition Matrices PairsC,B := G⊤
X,C ⊗ G⊤ X,B
ZB := Pairs (A, C) (Pairs (B, C))† ZC := Pairs (A, B) (Pairs (C, B))† x a b c A B C X Linear Multiview Model: E[G⊤
x,A|Π] = Uπx.
E[ ˆ M2|ΠA,B,C] =
- i
αi α0 ui ⊗ ui, E[ ˆ M3|ΠA,B,C] =
- i
αi α0 ui ⊗ ui ⊗ ui.
SLIDE 59
Outline
1
Introduction
2
Latent Variable Models and Moments
3
Community Detection in Graphs
4
Analysis of Tensor Power Method
5
Advanced Topics
6
Conclusion
SLIDE 60
Recap of Tensor Method
M2 =
- i
wiai ⊗ ai, M3 =
- i
wiai ⊗ ai ⊗ ai. Whitening matrix W from SVD of M2. v1 v2 v3 W a1 a2 a3 Multilinear transform: T = M3(W, W, W).
Tensor M3 Tensor T
Eigenvectors of T through power method and deflation. v → T(I, v, v) T(I, v, v).
SLIDE 61
Orthogonal Tensor Eigen Decomposition
T =
- i∈[k]
λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =
i λivi, v12vi = λ1v1.
vi are eigenvectors of tensor T.
SLIDE 62
Orthogonal Tensor Eigen Decomposition
T =
- i∈[k]
λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =
i λivi, v12vi = λ1v1.
vi are eigenvectors of tensor T.
Tensor Power Method
Start from an initial vector v. v → T(I, v, v) T(I, v, v).
SLIDE 63
Orthogonal Tensor Eigen Decomposition
T =
- i∈[k]
λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =
i λivi, v12vi = λ1v1.
vi are eigenvectors of tensor T.
Tensor Power Method
Start from an initial vector v. v → T(I, v, v) T(I, v, v).
Questions
Is there convergence? Does the convergence depend on initialization? What about performance under noise?
SLIDE 64
Recap of Matrix Eigen Analysis
For symmetric M ∈ Rk×k, eigen decomposition: M =
i λiviv⊤ i .
Eigen vectors are fixed points: Mv = λv.
◮ In our notation: M(I, v) = λv.
Uniqueness (Identifiability):
- Iff. λi are distinct.
SLIDE 65
Recap of Matrix Eigen Analysis
For symmetric M ∈ Rk×k, eigen decomposition: M =
i λiviv⊤ i .
Eigen vectors are fixed points: Mv = λv.
◮ In our notation: M(I, v) = λv.
Uniqueness (Identifiability):
- Iff. λi are distinct.
Power method: v → M(I, v) M(I, v).
SLIDE 66
Recap of Matrix Eigen Analysis
For symmetric M ∈ Rk×k, eigen decomposition: M =
i λiviv⊤ i .
Eigen vectors are fixed points: Mv = λv.
◮ In our notation: M(I, v) = λv.
Uniqueness (Identifiability):
- Iff. λi are distinct.
Power method: v → M(I, v) M(I, v).
Convergence properties
Let λ1 > λ2 . . . > λd. {vi} form a basis. Let initialization v =
i civi.
If c1 = 0, power method converges to v1.
SLIDE 67
Recap of Matrix Eigen Analysis
For symmetric M ∈ Rk×k, eigen decomposition: M =
i λiviv⊤ i .
Eigen vectors are fixed points: Mv = λv.
◮ In our notation: M(I, v) = λv.
Uniqueness (Identifiability):
- Iff. λi are distinct.
Power method: v → M(I, v) M(I, v).
Convergence properties
Let λ1 > λ2 . . . > λd. {vi} form a basis. Let initialization v =
i civi.
If c1 = 0, power method converges to v1.
Perturbation analysis (Davis-Kahan): T + E
Require E < mini=j |λi − λj|.
SLIDE 68
Optimization viewpoint of matrix analysis
M =
- i∈[k]
λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =
- i
λivi, v2. Optimization problem: max
v
M(v, v) s.t. v = 1.
SLIDE 69
Optimization viewpoint of matrix analysis
M =
- i∈[k]
λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =
- i
λivi, v2. Optimization problem: max
v
M(v, v) s.t. v = 1. Non-convex problem. Global maximizer is v1 (top eigenvector).
SLIDE 70
Optimization viewpoint of matrix analysis
M =
- i∈[k]
λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =
- i
λivi, v2. Optimization problem: max
v
M(v, v) s.t. v = 1. Non-convex problem. Global maximizer is v1 (top eigenvector). What are the local optimizers?
SLIDE 71
Optimization viewpoint of matrix analysis
Optimization: max
v
M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).
SLIDE 72
Optimization viewpoint of matrix analysis
Optimization: max
v
M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv).
SLIDE 73
Optimization viewpoint of matrix analysis
Optimization: max
v
M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0.
SLIDE 74
Optimization viewpoint of matrix analysis
Optimization: max
v
M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
M(I,v) M(I,v) is a version of gradient ascent.
SLIDE 75
Optimization viewpoint of matrix analysis
Optimization: max
v
M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
M(I,v) M(I,v) is a version of gradient ascent.
Second derivative: ∇2L(v, λ) = 2(M − λI).
SLIDE 76
Optimization viewpoint of matrix analysis
Optimization: max
v
M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
M(I,v) M(I,v) is a version of gradient ascent.
Second derivative: ∇2L(v, λ) = 2(M − λI).
Local optimality condition for constrained optimization
w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v.
SLIDE 77
Optimization viewpoint of matrix analysis
Optimization: max
v
M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
M(I,v) M(I,v) is a version of gradient ascent.
Second derivative: ∇2L(v, λ) = 2(M − λI).
Local optimality condition for constrained optimization
w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: v1 is the only local optimum. Verify: All other eigenvectors are saddle points.
SLIDE 78
Optimization viewpoint of matrix analysis
Optimization: max
v
M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
M(I,v) M(I,v) is a version of gradient ascent.
Second derivative: ∇2L(v, λ) = 2(M − λI).
Local optimality condition for constrained optimization
w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: v1 is the only local optimum. Verify: All other eigenvectors are saddle points. Power method recovers v1 when initialization v satisfies v, v1 = 0.
SLIDE 79
Analysis of Tensor Power Method
T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Bad news about tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
SLIDE 80
Analysis of Tensor Power Method
T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Bad news about tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an
- rthogonal decomposition exists.
SLIDE 81
Analysis of Tensor Power Method
T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Bad news about tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an
- rthogonal decomposition exists.
Characterization of components {vi}
{vi} are eigenvectors: T(I, vi, vi) = λivi.
SLIDE 82
Analysis of Tensor Power Method
T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Bad news about tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an
- rthogonal decomposition exists.
Characterization of components {vi}
{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1.
SLIDE 83
Analysis of Tensor Power Method
T =
- i∈[k]
λivi ⊗ vi ⊗ vi.
Bad news about tensors
Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an
- rthogonal decomposition exists.
Characterization of components {vi}
{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1. How do we avoid spurious solutions (not part of decomposition)?
SLIDE 84
Optimization viewpoint of tensor analysis
Optimization: max
v
T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).
SLIDE 85
Optimization viewpoint of tensor analysis
Optimization: max
v
T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv).
SLIDE 86
Optimization viewpoint of tensor analysis
Optimization: max
v
T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0.
SLIDE 87
Optimization viewpoint of tensor analysis
Optimization: max
v
T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
T(I,v,v) T(I,v,v) is a version of gradient ascent.
SLIDE 88
Optimization viewpoint of tensor analysis
Optimization: max
v
T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
T(I,v,v) T(I,v,v) is a version of gradient ascent.
Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).
SLIDE 89
Optimization viewpoint of tensor analysis
Optimization: max
v
T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
T(I,v,v) T(I,v,v) is a version of gradient ascent.
Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).
Local optimality condition for constrained optimization
w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v.
SLIDE 90
Optimization viewpoint of tensor analysis
Optimization: max
v
T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
T(I,v,v) T(I,v,v) is a version of gradient ascent.
Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).
Local optimality condition for constrained optimization
w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: {vi} are the only local optima. Verify: All other eigenvectors are saddle points.
SLIDE 91
Optimization viewpoint of tensor analysis
Optimization: max
v
T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →
T(I,v,v) T(I,v,v) is a version of gradient ascent.
Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).
Local optimality condition for constrained optimization
w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: {vi} are the only local optima. Verify: All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!
SLIDE 92
Review: matrix power iteration
Recall matrix power iteration for matrix M :=
i λi viv⊤ i :
Start with some v, and for j = 1, 2, . . . : v → Mv =
- i
λi
- v⊤
i v
- vi.
i.e., component in vi direction is scaled by λi.
SLIDE 93
Review: matrix power iteration
Recall matrix power iteration for matrix M :=
i λi viv⊤ i :
Start with some v, and for j = 1, 2, . . . : v → Mv =
- i
λi
- v⊤
i v
- vi.
i.e., component in vi direction is scaled by λi. If λ1 > λ2 ≥ · · · , then in t iterations,
- v⊤
1 v
2
- i
- v⊤
i v
2 ≥ 1 − k λ2 λ1 2t . Converges linearly to v1 assuming gap λ2/λ1 < 1.
SLIDE 94
Tensor power iteration convergence analysis
Let ci := v⊤
i v initial component in vi direction; assume WLOG
λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · .
SLIDE 95
Tensor power iteration convergence analysis
Let ci := v⊤
i v initial component in vi direction; assume WLOG
λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · . Then v →
- i
λi
- v⊤
i v
2vi =
- i
λic2
i vi
i.e., component in vi direction is squared then scaled by λi.
SLIDE 96
Tensor power iteration convergence analysis
Let ci := v⊤
i v initial component in vi direction; assume WLOG
λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · . Then v →
- i
λi
- v⊤
i v
2vi =
- i
λic2
i vi
i.e., component in vi direction is squared then scaled by λi. By induction, in t iterations v =
- i
λ2t−1
i
c2t
i
vi, so
- v⊤
1 v
2
- i
- v⊤
i v
2 ≥ 1 − k
- λ1
maxi=1 λi 2
- v2c2
v1c1
- 2t+1
.
SLIDE 97
Matrix vs. tensor power iteration
Matrix power iteration: Tensor power iteration:
SLIDE 98
Matrix vs. tensor power iteration
Matrix power iteration:
1
Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration:
1
Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.
SLIDE 99
Matrix vs. tensor power iteration
Matrix power iteration:
1
Requires gap between largest and second-largest eigenvalue. Property of the matrix only.
2
Converges to top eigenvector. Tensor power iteration:
1
Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.
2
Converges to vi for which vi|ci| = max! could be any of them.
SLIDE 100
Matrix vs. tensor power iteration
Matrix power iteration:
1
Requires gap between largest and second-largest eigenvalue. Property of the matrix only.
2
Converges to top eigenvector.
3
Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration:
1
Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.
2
Converges to vi for which vi|ci| = max! could be any of them.
3
Quadratic convergence. Need O(log log(1/ǫ)) iterations.
SLIDE 101
Perturbation Analysis
ˆ T = T + E, T =
- i
λivi ⊗ vi ⊗ vi, E := max
x:x=1 |E(x, x, x)| ≤ ǫ.
Theorem: Let N be number of iterations. If N ≥ log k + log log λmax
ǫ
, ǫ < λmin
k ,
then output (v, λ) (after polynomial restarts) satisfies v − v1 ≤ O ǫ λ1
- ,
λ − λ1 ≤ O(ǫ), where v1 is s.t. λ1|c1| > λ2|c2| . . . , ci := vi, v, and v is the (successful) initializer. Careful analysis of deflation: avoid buildup of errors. Implies polynomial sample complexity for learning.
SLIDE 102
Outline
1
Introduction
2
Latent Variable Models and Moments
3
Community Detection in Graphs
4
Analysis of Tensor Power Method
5
Advanced Topics
6
Conclusion
SLIDE 103
Beyond Orthogonal Tensor Decomposition
a ⊗ a ⊗ a is a rank-1 tensor whose ith entry is a(i1) · a(i2) · a(i3). For tensor T, find decomposition into rank one terms T =
- j∈[k]
wjaj ⊗ aj ⊗ aj, aj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2
k: tensor rank, d: ambient dimension. k > d: overcomplete. A is incoherent: ai, aj ∼
1 √ d for i = j.
Guaranteed Recovery when k = o(d1.5) .
“Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A., R. Ge, M. Janzamin. Preprint, Feb. 2014. “Provable Learning of Overcomplete Latent Variable Models: Semi-supervised & Unsupervised”.
SLIDE 104
Semi-supervised Learning of Gaussian Mixtures
n unlabeled samples, mj: samples for component j.
- No. of mixture components: k = o(d1.5)
- No. of labeled samples: mj = ˜
Ω(1).
- No. of unlabeled samples: n = ˜
Ω(k).
Our result: achieved error with n unlabeled samples
max
i
- ai − ai = ˜
O
- k
n
- + ˜
O √ k d
- Can handle (polynomially) overcomplete mixtures.
Extremely small number of labeled samples: polylog(d). Sample complexity is tight: need ˜ Ω(k) samples! Approximation error: decaying in high dimensions.
SLIDE 105
Unsupervised Learning of Gaussian Mixtures
Conditions for recovery
- No. of mixture components: k = C · d
- No. of unlabeled samples: n = ˜
Ω(k · d). Computational complexity: ˜ O
- eC2
Our result: achieved error with n unlabeled samples
max
i
- ai − ai = ˜
O
- k
n
- + ˜
O √ k d
- Error: same as before, for semi-supervised setting.
Sample complexity: worse than semi-supervised, but better than previous works (no dependence on condition number of A). Computational complexity: polynomial when k = Θ(d).
SLIDE 106
Learning Overcomplete Dictionaries
=
Y ∈ Rd×n A ∈ Rd×k X ∈ Rk×n Linear model: Y = AX, both A, X unknown. Sparse X: each column is randomly s-sparse Overcomplete dictionary A ∈ Rd×k: k ≥ d. Incoherence: max
i=j |ai, aj| ≈ 0. (satisfied by random vectors) “Learning Sparsely Used Overcomplete Dictionaries” by A. Agarwal, A., P. Jain, P. Netrapalli,
- R. Tandon. COLT 2014.
SLIDE 107
Experiments on MNIST
Original Reconstruction Learnt Representation
SLIDE 108
Outline
1
Introduction
2
Latent Variable Models and Moments
3
Community Detection in Graphs
4
Analysis of Tensor Power Method
5
Advanced Topics
6
Conclusion
SLIDE 109
Conclusion
Guaranteed Learning of Latent Variable Models
Guaranteed to recover correct model Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. Tensor approach: mixed membership communities, topic models, latent trees... Sparsity-based approach: overcomplete models, e.g sparse coding and topic models.
=