[PPT] - Guaranteed Learning of Latent Variable Models through Spectral and PowerPoint Presentation

SLIDE 1

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Anima Anandkumar

U.C. Irvine

SLIDE 2

Guaranteed Unsupervised Learning

Unsupervised Learning: no labeled samples available for training.

SLIDE 3

Guaranteed Unsupervised Learning

Unsupervised Learning: no labeled samples available for training.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

SLIDE 4

Guaranteed Unsupervised Learning

Unsupervised Learning: no labeled samples available for training.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities?

SLIDE 5

Guaranteed Unsupervised Learning

Unsupervised Learning: no labeled samples available for training.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this series: guaranteed and efficient learning through spectral methods

SLIDE 6

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

r hierarchy of variables.

x h

SLIDE 7

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

r hierarchy of variables.

x1 x2 x3 x4 x5 h

SLIDE 8

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

r hierarchy of variables.

x1 x2 x3 x4 x5 h1 h2 h3

SLIDE 9

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

r hierarchy of variables.

x1 x2 x3 x4 x5 h1 h2 h3

Maximum Likelihood vs. Moment method

Finding MLE is NP-hard in general. Expectation maximization (EM) converges to a local optimum.

SLIDE 10

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

r hierarchy of variables.

x1 x2 x3 x4 x5 h1 h2 h3

Maximum Likelihood vs. Moment method

Finding MLE is NP-hard in general. Expectation maximization (EM) converges to a local optimum. Moment estimate: polynomial computational & sample complexity. Le Cam theory: Newton-Ralphson on moment estimate leads to efficient estimator asymptotically. Scalable implementation: linear and multilinear algebraic operations.

SLIDE 11

Game Plan: In this talk

Recall Yesterday’s Talk

Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.

SLIDE 12

Game Plan: In this talk

Recall Yesterday’s Talk

Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.

Today’s talk

Moments for various latent variable models. Analysis of tensor power method.

SLIDE 13

Game Plan: In this talk

Recall Yesterday’s Talk

Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.

Today’s talk

Moments for various latent variable models. Analysis of tensor power method.

Tomorrow’s talk

Implementation of tensor method.

SLIDE 14

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

SLIDE 15

Recap: Gaussian Mixtures and (single) Topic Models

(spherical) Mixture of Gaussian: k means: a1, . . . ak Component h = i with prob. wi

bserve x, with spherical noise,

x = ai + z, z ∼ N(0, σ2

i I)

(single) Topic Models k topics: a1, . . . ak Topic h = i with prob. wi

bserve l (exchangeable) words

x1, x2, . . . xl i.i.d. from ai Unified Linear Model: E[x|h] = Ah Gaussian mixture: single view, spherical noise. Topic model: multi-view, heteroskedastic noise. M3 =

i wiai ⊗ ai ⊗ ai,

M2 =

i wiai ⊗ ai.

SLIDE 16

Recap: Geometric Picture for Topic Models

Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h)

Document

SLIDE 17

Recap: Geometric Picture for Topic Models

Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Single topic (h)

SLIDE 18

Recap: Geometric Picture for Topic Models

Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h)

SLIDE 19

Recap: Geometric Picture for Topic Models

Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h) A A A x1 x2 x3 Word generation (x1, x2, . . .)

SLIDE 20

Latent Dirichlet Allocation

l words in a document x1, . . . , xl. Word xi generated from topic yi. Exchangeability: x1 ⊥ ⊥ x2 ⊥ ⊥ . . . |h A(i, j) := P[xm = i|ym = j] : topic-word matrix. Words Topics Topic Mixture

x1 x2 x3 x4 x5 y1 y2 y3 y4 y5

A A A A A h If there are k topics, distribution of h over the simplex ∆k−1 ∆k−1 := {h ∈ Rk, hi ∈ [0, 1],

i

hi = 1}. Latent Dirichlet Allocation: h is drawn from a Dirichlet distribution.

SLIDE 21

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

SLIDE 22

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

Dir(α)

SLIDE 23

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

Dir(α) Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

SLIDE 24

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

αj → 0 Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

SLIDE 25

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

αj < 1 Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

SLIDE 26

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

Large αj Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

SLIDE 27

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

αj → ∞ Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

SLIDE 28

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

Dir(α) Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

SLIDE 29

Moments under LDA

M2 := E[x1 ⊗ x2] − α0 α0 + 1E[x1] ⊗ E[x1] M3 := E[x1 ⊗ x2 ⊗ x3] − α0 α0 + 2E[x1 ⊗ x2 ⊗ E[x1]] − more stuff... Then M2 =

˜

wi ai ⊗ ai M3 =

˜

wi ai ⊗ ai ⊗ ai. Three words per document suffice for learning LDA.

SLIDE 30

General Multiview Mixtures (Naive Bayes)

E[xi|h] = Aih and multiple views.

SLIDE 31

General Multiview Mixtures (Naive Bayes)

E[xi|h] = Aih and multiple views. ˜ x1 := E[x3 ⊗ x2]E[x1 ⊗ x2]†x1, ˜ x2 := E[x3 ⊗ x1]E[x2 ⊗ x1]†x2, M2: = E[ ˜ x1 ⊗ ˜ x1], E[x1 ⊗ x2]†x M3 = E[ ˜ x1 ⊗ ˜ x2 ⊗ x3].

SLIDE 32

General Multiview Mixtures (Naive Bayes)

E[xi|h] = Aih and multiple views. ˜ x1 := E[x3 ⊗ x2]E[x1 ⊗ x2]†x1, ˜ x2 := E[x3 ⊗ x1]E[x2 ⊗ x1]†x2, M2: = E[ ˜ x1 ⊗ ˜ x1], E[x1 ⊗ x2]†x M3 = E[ ˜ x1 ⊗ ˜ x2 ⊗ x3]. M2 =

i wia3,i ⊗ a3,i,

M3 =

i wia3,i ⊗ a3,i ⊗ a3,i.

SLIDE 33

Hidden Markov Models

P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). T T O O O h1 h2 h3 x1 x2 x3

SLIDE 34

Hidden Markov Models

P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3

SLIDE 35

Hidden Markov Models

P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3 E[x1|h2] = ODiag(π)T ⊤Diag(w)−1h2 E[x2|h2] = Oh2 E[x3|h2] = OTh2.

SLIDE 36

Hidden Markov Models

P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3 E[x1|h2] = ODiag(π)T ⊤Diag(w)−1h2 E[x2|h2] = Oh2 E[x3|h2] = OTh2.

Condition for non-degeneracy

O ∈ Rd×k has full column rank. T is invertible, π and Tπ have positive entries.

SLIDE 37

Independent Component Analysis

Independent sources, unknown mixing. Blind source separation. Application: speech, image, video.. k sources. d dimensions. h1 h2 hk x1 x2 xd A x = Ah + z. z ∼ N(0, σ2I). Sources hi are independent. Form cumulant tensor M4 :=E[x⊗4] − E[xi1xi2]E[xi3xi4] . . . =

i

κiai ⊗ ai ⊗ ai ⊗ ai. Kurtosis: κi := E[h4

i ] − 3.

Assumption: sources have non-zero kurtosis (κi = 0).

SLIDE 38

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

SLIDE 39

Social Networks & Recommender Systems

Social Networks

Network of social ties, e.g. friendships, co-authorships Hidden: communities of actors.

Recommender Systems

Observed: Ratings of users for various products. Goal: New recommendations. Modeling: User/product groups.

SLIDE 40

Network Community Models

How are communities formed? How do communities interact?

SLIDE 41

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

SLIDE 42

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

SLIDE 43

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

0.9

SLIDE 44

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

0.1

SLIDE 45

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

SLIDE 46

Mixed Membership Model (Airoldi et al)

k communities and n nodes. Graph G ∈ Rn×n (adjacency matrix). Fractional memberships: πx ∈ Rk membership of node x. ∆k−1 := {πx ∈ Rk, πx(i) ∈ [0, 1],

i

πx(i) = 1, ∀ x ∈ [n]}. Node memberships {πu} drawn from Dirichlet distribution.

SLIDE 47

Mixed Membership Model (Airoldi et al)

k communities and n nodes. Graph G ∈ Rn×n (adjacency matrix). Fractional memberships: πx ∈ Rk membership of node x. ∆k−1 := {πx ∈ Rk, πx(i) ∈ [0, 1],

i

πx(i) = 1, ∀ x ∈ [n]}. Node memberships {πu} drawn from Dirichlet distribution. Edges conditionally independent given community memberships: Gi,j ⊥ ⊥ Ga,b|πi, πj, πa, πb. Edge probability averaged over community memberships P[Gi,j = 1|πi, πj] = E[Gi,j|πi, πj] = π⊤

i Pπj.

P ∈ Rk×k: average edge connectivity for pure communities.

Airoldi, Blei, Fienberg, and Xing. Mixed membership stochastic blockmodels. J. of Machine Learning Research, June 2008.

SLIDE 48

Networks under Community Models

SLIDE 49

Networks under Community Models

Stochastic Block Model

α0 = 0

SLIDE 50

Networks under Community Models

Stochastic Block Model

α0 = 0

Mixed Membership Model

α0 = 1

SLIDE 51

Networks under Community Models

Stochastic Block Model

α0 = 0

Mixed Membership Model

α0 = 10

SLIDE 52

Networks under Community Models

Stochastic Block Model

α0 = 0

Mixed Membership Model

α0 = 10

Unifying Assumption

Edges conditionally independent given community memberships

SLIDE 53

Subgraph Counts as Graph Moments

SLIDE 54

Subgraph Counts as Graph Moments

SLIDE 55

Subgraph Counts as Graph Moments

3-star counts sufficient for identifiability and learning of MMSB

SLIDE 56

Subgraph Counts as Graph Moments

3-star counts sufficient for identifiability and learning of MMSB

3-Star Count Tensor

˜ M3(a, b, c) = 1 |X|# of common neighbors in X = 1 |X|

x∈X

G(x, a)G(x, b)G(x, c). ˜ M3 = 1 |X|

x∈X

[G⊤

x,A ⊗ G⊤ x,B ⊗ G⊤ x,C]

x a b c A B C X

SLIDE 57

Multi-view Representation

Conditional independence of the three views πx: community membership vector of node x.

3-stars

x X A B C

Graphical model

πx G⊤

x,A

G⊤

x,B

G⊤

x,C

U V W Linear Multiview Model: E[G⊤

x,A|Π] = Π⊤ AP ⊤πx = Uπx.

SLIDE 58

Subgraph Counts as Graph Moments

Second and Third Order Moments

ˆ M2 :=

1 |X|

x

ZCG⊤

x,CGx,BZ⊤ B − shift

ˆ M3 :=

1 |X|

x
G⊤

x,A ⊗ ZBG⊤ x,B ⊗ ZCG⊤ x,C

− shift

Symmetrize Transition Matrices PairsC,B := G⊤

X,C ⊗ G⊤ X,B

ZB := Pairs (A, C) (Pairs (B, C))† ZC := Pairs (A, B) (Pairs (C, B))† x a b c A B C X Linear Multiview Model: E[G⊤

x,A|Π] = Uπx.

E[ ˆ M2|ΠA,B,C] =

i

αi α0 ui ⊗ ui, E[ ˆ M3|ΠA,B,C] =

i

αi α0 ui ⊗ ui ⊗ ui.

SLIDE 59

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

SLIDE 60

Recap of Tensor Method

M2 =

i

wiai ⊗ ai, M3 =

i

wiai ⊗ ai ⊗ ai. Whitening matrix W from SVD of M2. v1 v2 v3 W a1 a2 a3 Multilinear transform: T = M3(W, W, W).

Tensor M3 Tensor T

Eigenvectors of T through power method and deflation. v → T(I, v, v) T(I, v, v).

SLIDE 61

Orthogonal Tensor Eigen Decomposition

T =

i∈[k]

λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =

i λivi, v12vi = λ1v1.

vi are eigenvectors of tensor T.

SLIDE 62

Orthogonal Tensor Eigen Decomposition

T =

i∈[k]

λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =

i λivi, v12vi = λ1v1.

vi are eigenvectors of tensor T.

Tensor Power Method

Start from an initial vector v. v → T(I, v, v) T(I, v, v).

SLIDE 63

Orthogonal Tensor Eigen Decomposition

T =

i∈[k]

λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =

i λivi, v12vi = λ1v1.

vi are eigenvectors of tensor T.

Tensor Power Method

Start from an initial vector v. v → T(I, v, v) T(I, v, v).

Questions

Is there convergence? Does the convergence depend on initialization? What about performance under noise?

SLIDE 64

Recap of Matrix Eigen Analysis

For symmetric M ∈ Rk×k, eigen decomposition: M =

i λiviv⊤ i .

Eigen vectors are fixed points: Mv = λv.

◮ In our notation: M(I, v) = λv.

Uniqueness (Identifiability):

Iff. λi are distinct.

SLIDE 65

Recap of Matrix Eigen Analysis

For symmetric M ∈ Rk×k, eigen decomposition: M =

i λiviv⊤ i .

Eigen vectors are fixed points: Mv = λv.

◮ In our notation: M(I, v) = λv.

Uniqueness (Identifiability):

Iff. λi are distinct.

Power method: v → M(I, v) M(I, v).

SLIDE 66

Recap of Matrix Eigen Analysis

For symmetric M ∈ Rk×k, eigen decomposition: M =

i λiviv⊤ i .

Eigen vectors are fixed points: Mv = λv.

◮ In our notation: M(I, v) = λv.

Uniqueness (Identifiability):

Iff. λi are distinct.

Power method: v → M(I, v) M(I, v).

Convergence properties

Let λ1 > λ2 . . . > λd. {vi} form a basis. Let initialization v =

i civi.

If c1 = 0, power method converges to v1.

SLIDE 67

Recap of Matrix Eigen Analysis

For symmetric M ∈ Rk×k, eigen decomposition: M =

i λiviv⊤ i .

Eigen vectors are fixed points: Mv = λv.

◮ In our notation: M(I, v) = λv.

Uniqueness (Identifiability):

Iff. λi are distinct.

Power method: v → M(I, v) M(I, v).

Convergence properties

Let λ1 > λ2 . . . > λd. {vi} form a basis. Let initialization v =

i civi.

If c1 = 0, power method converges to v1.

Perturbation analysis (Davis-Kahan): T + E

Require E < mini=j |λi − λj|.

SLIDE 68

Optimization viewpoint of matrix analysis

M =

i∈[k]

λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =

i

λivi, v2. Optimization problem: max

v

M(v, v) s.t. v = 1.

SLIDE 69

Optimization viewpoint of matrix analysis

M =

i∈[k]

λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =

i

λivi, v2. Optimization problem: max

v

M(v, v) s.t. v = 1. Non-convex problem. Global maximizer is v1 (top eigenvector).

SLIDE 70

Optimization viewpoint of matrix analysis

M =

i∈[k]

λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =

i

λivi, v2. Optimization problem: max

v

M(v, v) s.t. v = 1. Non-convex problem. Global maximizer is v1 (top eigenvector). What are the local optimizers?

SLIDE 71

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).

SLIDE 72

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv).

SLIDE 73

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0.

SLIDE 74

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

SLIDE 75

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 2(M − λI).

SLIDE 76

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 2(M − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v.

SLIDE 77

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 2(M − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: v1 is the only local optimum. Verify: All other eigenvectors are saddle points.

SLIDE 78

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 2(M − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: v1 is the only local optimum. Verify: All other eigenvectors are saddle points. Power method recovers v1 when initialization v satisfies v, v1 = 0.

SLIDE 79

Analysis of Tensor Power Method

T =

i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

SLIDE 80

Analysis of Tensor Power Method

T =

i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an

rthogonal decomposition exists.

SLIDE 81

Analysis of Tensor Power Method

T =

i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an

rthogonal decomposition exists.

Characterization of components {vi}

{vi} are eigenvectors: T(I, vi, vi) = λivi.

SLIDE 82

Analysis of Tensor Power Method

T =

i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an

rthogonal decomposition exists.

Characterization of components {vi}

{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1.

SLIDE 83

Analysis of Tensor Power Method

T =

i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an

rthogonal decomposition exists.

Characterization of components {vi}

{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1. How do we avoid spurious solutions (not part of decomposition)?

SLIDE 84

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).

SLIDE 85

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv).

SLIDE 86

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0.

SLIDE 87

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

SLIDE 88

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).

SLIDE 89

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v.

SLIDE 90

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: {vi} are the only local optima. Verify: All other eigenvectors are saddle points.

SLIDE 91

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: {vi} are the only local optima. Verify: All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

SLIDE 92

Review: matrix power iteration

Recall matrix power iteration for matrix M :=

i λi viv⊤ i :

Start with some v, and for j = 1, 2, . . . : v → Mv =

i

λi

v⊤

i v

vi.

i.e., component in vi direction is scaled by λi.

SLIDE 93

Review: matrix power iteration

Recall matrix power iteration for matrix M :=

i λi viv⊤ i :

Start with some v, and for j = 1, 2, . . . : v → Mv =

i

λi

v⊤

i v

vi.

i.e., component in vi direction is scaled by λi. If λ1 > λ2 ≥ · · · , then in t iterations,

v⊤

1 v

2

i
v⊤

i v

2 ≥ 1 − k λ2 λ1 2t . Converges linearly to v1 assuming gap λ2/λ1 < 1.

SLIDE 94

Tensor power iteration convergence analysis

Let ci := v⊤

i v initial component in vi direction; assume WLOG

λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · .

SLIDE 95

Tensor power iteration convergence analysis

Let ci := v⊤

i v initial component in vi direction; assume WLOG

λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · . Then v →

i

λi

v⊤

i v

2vi =

i

λic2

i vi

i.e., component in vi direction is squared then scaled by λi.

SLIDE 96

Tensor power iteration convergence analysis

Let ci := v⊤

i v initial component in vi direction; assume WLOG

λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · . Then v →

i

λi

v⊤

i v

2vi =

i

λic2

i vi

i.e., component in vi direction is squared then scaled by λi. By induction, in t iterations v =

i

λ2t−1

i

c2t

i

vi, so

v⊤

1 v

2

i
v⊤

i v

2 ≥ 1 − k

λ1

maxi=1 λi 2

v2c2

v1c1

2t+1

.

SLIDE 97

Matrix vs. tensor power iteration

Matrix power iteration: Tensor power iteration:

SLIDE 98

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.

SLIDE 99

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only.

2

Converges to top eigenvector. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.

2

Converges to vi for which vi|ci| = max! could be any of them.

SLIDE 100

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only.

2

Converges to top eigenvector.

3

Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.

2

Converges to vi for which vi|ci| = max! could be any of them.

3

Quadratic convergence. Need O(log log(1/ǫ)) iterations.

SLIDE 101

Perturbation Analysis

ˆ T = T + E, T =

i

λivi ⊗ vi ⊗ vi, E := max

x:x=1 |E(x, x, x)| ≤ ǫ.

Theorem: Let N be number of iterations. If N ≥ log k + log log λmax

ǫ

, ǫ < λmin

k ,

then output (v, λ) (after polynomial restarts) satisfies v − v1 ≤ O ǫ λ1

,

λ − λ1 ≤ O(ǫ), where v1 is s.t. λ1|c1| > λ2|c2| . . . , ci := vi, v, and v is the (successful) initializer. Careful analysis of deflation: avoid buildup of errors. Implies polynomial sample complexity for learning.

SLIDE 102

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

SLIDE 103

Beyond Orthogonal Tensor Decomposition

a ⊗ a ⊗ a is a rank-1 tensor whose ith entry is a(i1) · a(i2) · a(i3). For tensor T, find decomposition into rank one terms T =

j∈[k]

wjaj ⊗ aj ⊗ aj, aj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2

k: tensor rank, d: ambient dimension. k > d: overcomplete. A is incoherent: ai, aj ∼

1 √ d for i = j.

Guaranteed Recovery when k = o(d1.5) .

“Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A., R. Ge, M. Janzamin. Preprint, Feb. 2014. “Provable Learning of Overcomplete Latent Variable Models: Semi-supervised & Unsupervised”.

SLIDE 104

Semi-supervised Learning of Gaussian Mixtures

n unlabeled samples, mj: samples for component j.

No. of mixture components: k = o(d1.5)
No. of labeled samples: mj = ˜

Ω(1).

No. of unlabeled samples: n = ˜

Ω(k).

Our result: achieved error with n unlabeled samples

max

i

ai − ai = ˜

O

k

n

+ ˜

O √ k d

Can handle (polynomially) overcomplete mixtures.

Extremely small number of labeled samples: polylog(d). Sample complexity is tight: need ˜ Ω(k) samples! Approximation error: decaying in high dimensions.

SLIDE 105

Unsupervised Learning of Gaussian Mixtures

Conditions for recovery

No. of mixture components: k = C · d
No. of unlabeled samples: n = ˜

Ω(k · d). Computational complexity: ˜ O

eC2

Our result: achieved error with n unlabeled samples

max

i

ai − ai = ˜

O

k

n

+ ˜

O √ k d

Error: same as before, for semi-supervised setting.

Sample complexity: worse than semi-supervised, but better than previous works (no dependence on condition number of A). Computational complexity: polynomial when k = Θ(d).

SLIDE 106

Learning Overcomplete Dictionaries

=

Y ∈ Rd×n A ∈ Rd×k X ∈ Rk×n Linear model: Y = AX, both A, X unknown. Sparse X: each column is randomly s-sparse Overcomplete dictionary A ∈ Rd×k: k ≥ d. Incoherence: max

i=j |ai, aj| ≈ 0. (satisfied by random vectors) “Learning Sparsely Used Overcomplete Dictionaries” by A. Agarwal, A., P. Jain, P. Netrapalli,

R. Tandon. COLT 2014.

SLIDE 107

Experiments on MNIST

Original Reconstruction Learnt Representation

SLIDE 108

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

SLIDE 109

Conclusion

Guaranteed Learning of Latent Variable Models

Guaranteed to recover correct model Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. Tensor approach: mixed membership communities, topic models, latent trees... Sparsity-based approach: overcomplete models, e.g sparse coding and topic models.

=