Guaranteed Learning of Latent Variable Models through Spectral and - - PowerPoint PPT Presentation

guaranteed learning of latent variable models through
SMART_READER_LITE
LIVE PREVIEW

Guaranteed Learning of Latent Variable Models through Spectral and - - PowerPoint PPT Presentation

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Guaranteed Unsupervised Learning Unsupervised Learning: no labeled samples available for training. Guaranteed Unsupervised Learning


slide-1
SLIDE 1

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Anima Anandkumar

U.C. Irvine

slide-2
SLIDE 2

Guaranteed Unsupervised Learning

Unsupervised Learning: no labeled samples available for training.

slide-3
SLIDE 3

Guaranteed Unsupervised Learning

Unsupervised Learning: no labeled samples available for training.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

slide-4
SLIDE 4

Guaranteed Unsupervised Learning

Unsupervised Learning: no labeled samples available for training.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities?

slide-5
SLIDE 5

Guaranteed Unsupervised Learning

Unsupervised Learning: no labeled samples available for training.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this series: guaranteed and efficient learning through spectral methods

slide-6
SLIDE 6

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

  • r hierarchy of variables.

x h

slide-7
SLIDE 7

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

  • r hierarchy of variables.

x1 x2 x3 x4 x5 h

slide-8
SLIDE 8

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

  • r hierarchy of variables.

x1 x2 x3 x4 x5 h1 h2 h3

slide-9
SLIDE 9

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

  • r hierarchy of variables.

x1 x2 x3 x4 x5 h1 h2 h3

Maximum Likelihood vs. Moment method

Finding MLE is NP-hard in general. Expectation maximization (EM) converges to a local optimum.

slide-10
SLIDE 10

Probabilistic Models

Latent Variable Models

Concise statistical description through graphical modeling Conditional independence relationships

  • r hierarchy of variables.

x1 x2 x3 x4 x5 h1 h2 h3

Maximum Likelihood vs. Moment method

Finding MLE is NP-hard in general. Expectation maximization (EM) converges to a local optimum. Moment estimate: polynomial computational & sample complexity. Le Cam theory: Newton-Ralphson on moment estimate leads to efficient estimator asymptotically. Scalable implementation: linear and multilinear algebraic operations.

slide-11
SLIDE 11

Game Plan: In this talk

Recall Yesterday’s Talk

Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.

slide-12
SLIDE 12

Game Plan: In this talk

Recall Yesterday’s Talk

Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.

Today’s talk

Moments for various latent variable models. Analysis of tensor power method.

slide-13
SLIDE 13

Game Plan: In this talk

Recall Yesterday’s Talk

Gaussian mixtures and (single) topic models. Analysis of third order moments. Tensor decomposition method: whitening and power method.

Today’s talk

Moments for various latent variable models. Analysis of tensor power method.

Tomorrow’s talk

Implementation of tensor method.

slide-14
SLIDE 14

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

slide-15
SLIDE 15

Recap: Gaussian Mixtures and (single) Topic Models

(spherical) Mixture of Gaussian: k means: a1, . . . ak Component h = i with prob. wi

  • bserve x, with spherical noise,

x = ai + z, z ∼ N(0, σ2

i I)

(single) Topic Models k topics: a1, . . . ak Topic h = i with prob. wi

  • bserve l (exchangeable) words

x1, x2, . . . xl i.i.d. from ai Unified Linear Model: E[x|h] = Ah Gaussian mixture: single view, spherical noise. Topic model: multi-view, heteroskedastic noise. M3 =

i wiai ⊗ ai ⊗ ai,

M2 =

i wiai ⊗ ai.

slide-16
SLIDE 16

Recap: Geometric Picture for Topic Models

Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h)

Document

slide-17
SLIDE 17

Recap: Geometric Picture for Topic Models

Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Single topic (h)

slide-18
SLIDE 18

Recap: Geometric Picture for Topic Models

Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h)

slide-19
SLIDE 19

Recap: Geometric Picture for Topic Models

Topic models are exchangeble multiview models. M2 = E[x1 ⊗ x2]. M3 = E[x1 ⊗ x2 ⊗ x3]. Topic proportions vector (h) A A A x1 x2 x3 Word generation (x1, x2, . . .)

slide-20
SLIDE 20

Latent Dirichlet Allocation

l words in a document x1, . . . , xl. Word xi generated from topic yi. Exchangeability: x1 ⊥ ⊥ x2 ⊥ ⊥ . . . |h A(i, j) := P[xm = i|ym = j] : topic-word matrix. Words Topics Topic Mixture

x1 x2 x3 x4 x5 y1 y2 y3 y4 y5

A A A A A h If there are k topics, distribution of h over the simplex ∆k−1 ∆k−1 := {h ∈ Rk, hi ∈ [0, 1],

  • i

hi = 1}. Latent Dirichlet Allocation: h is drawn from a Dirichlet distribution.

slide-21
SLIDE 21

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

slide-22
SLIDE 22

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

Dir(α)

slide-23
SLIDE 23

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

Dir(α) Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

slide-24
SLIDE 24

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

αj → 0 Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

slide-25
SLIDE 25

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

αj < 1 Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

slide-26
SLIDE 26

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

Large αj Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

slide-27
SLIDE 27

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

αj → ∞ Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

slide-28
SLIDE 28

Dirichlet Distribution

P[h] ∝ k

j=1 h(j)αj−1, k j=1 h(j) = 1

Dir(α) Dirichlet concentration parameter α0 :=

j αj

Sparsity level in h is O(α0).

slide-29
SLIDE 29

Moments under LDA

M2 := E[x1 ⊗ x2] − α0 α0 + 1E[x1] ⊗ E[x1] M3 := E[x1 ⊗ x2 ⊗ x3] − α0 α0 + 2E[x1 ⊗ x2 ⊗ E[x1]] − more stuff... Then M2 =

  • ˜

wi ai ⊗ ai M3 =

  • ˜

wi ai ⊗ ai ⊗ ai. Three words per document suffice for learning LDA.

slide-30
SLIDE 30

General Multiview Mixtures (Naive Bayes)

E[xi|h] = Aih and multiple views.

slide-31
SLIDE 31

General Multiview Mixtures (Naive Bayes)

E[xi|h] = Aih and multiple views. ˜ x1 := E[x3 ⊗ x2]E[x1 ⊗ x2]†x1, ˜ x2 := E[x3 ⊗ x1]E[x2 ⊗ x1]†x2, M2: = E[ ˜ x1 ⊗ ˜ x1], E[x1 ⊗ x2]†x M3 = E[ ˜ x1 ⊗ ˜ x2 ⊗ x3].

slide-32
SLIDE 32

General Multiview Mixtures (Naive Bayes)

E[xi|h] = Aih and multiple views. ˜ x1 := E[x3 ⊗ x2]E[x1 ⊗ x2]†x1, ˜ x2 := E[x3 ⊗ x1]E[x2 ⊗ x1]†x2, M2: = E[ ˜ x1 ⊗ ˜ x1], E[x1 ⊗ x2]†x M3 = E[ ˜ x1 ⊗ ˜ x2 ⊗ x3]. M2 =

i wia3,i ⊗ a3,i,

M3 =

i wia3,i ⊗ a3,i ⊗ a3,i.

slide-33
SLIDE 33

Hidden Markov Models

P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). T T O O O h1 h2 h3 x1 x2 x3

slide-34
SLIDE 34

Hidden Markov Models

P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3

slide-35
SLIDE 35

Hidden Markov Models

P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3 E[x1|h2] = ODiag(π)T ⊤Diag(w)−1h2 E[x2|h2] = Oh2 E[x3|h2] = OTh2.

slide-36
SLIDE 36

Hidden Markov Models

P[ht+1 = i|ht = j] = Ti,j. E[xt|ht = j] = Oej. π: Initial distribution (of x1). Three view model. w := Tπ. T T O O O h1 h2 h3 x1 x2 x3 E[x1|h2] = ODiag(π)T ⊤Diag(w)−1h2 E[x2|h2] = Oh2 E[x3|h2] = OTh2.

Condition for non-degeneracy

O ∈ Rd×k has full column rank. T is invertible, π and Tπ have positive entries.

slide-37
SLIDE 37

Independent Component Analysis

Independent sources, unknown mixing. Blind source separation. Application: speech, image, video.. k sources. d dimensions. h1 h2 hk x1 x2 xd A x = Ah + z. z ∼ N(0, σ2I). Sources hi are independent. Form cumulant tensor M4 :=E[x⊗4] − E[xi1xi2]E[xi3xi4] . . . =

  • i

κiai ⊗ ai ⊗ ai ⊗ ai. Kurtosis: κi := E[h4

i ] − 3.

Assumption: sources have non-zero kurtosis (κi = 0).

slide-38
SLIDE 38

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

slide-39
SLIDE 39

Social Networks & Recommender Systems

Social Networks

Network of social ties, e.g. friendships, co-authorships Hidden: communities of actors.

Recommender Systems

Observed: Ratings of users for various products. Goal: New recommendations. Modeling: User/product groups.

slide-40
SLIDE 40

Network Community Models

How are communities formed? How do communities interact?

slide-41
SLIDE 41

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

slide-42
SLIDE 42

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

slide-43
SLIDE 43

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

0.9

slide-44
SLIDE 44

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

0.1

slide-45
SLIDE 45

Network Community Models

How are communities formed? How do communities interact?

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

slide-46
SLIDE 46

Mixed Membership Model (Airoldi et al)

k communities and n nodes. Graph G ∈ Rn×n (adjacency matrix). Fractional memberships: πx ∈ Rk membership of node x. ∆k−1 := {πx ∈ Rk, πx(i) ∈ [0, 1],

  • i

πx(i) = 1, ∀ x ∈ [n]}. Node memberships {πu} drawn from Dirichlet distribution.

slide-47
SLIDE 47

Mixed Membership Model (Airoldi et al)

k communities and n nodes. Graph G ∈ Rn×n (adjacency matrix). Fractional memberships: πx ∈ Rk membership of node x. ∆k−1 := {πx ∈ Rk, πx(i) ∈ [0, 1],

  • i

πx(i) = 1, ∀ x ∈ [n]}. Node memberships {πu} drawn from Dirichlet distribution. Edges conditionally independent given community memberships: Gi,j ⊥ ⊥ Ga,b|πi, πj, πa, πb. Edge probability averaged over community memberships P[Gi,j = 1|πi, πj] = E[Gi,j|πi, πj] = π⊤

i Pπj.

P ∈ Rk×k: average edge connectivity for pure communities.

Airoldi, Blei, Fienberg, and Xing. Mixed membership stochastic blockmodels. J. of Machine Learning Research, June 2008.

slide-48
SLIDE 48

Networks under Community Models

slide-49
SLIDE 49

Networks under Community Models

Stochastic Block Model

α0 = 0

slide-50
SLIDE 50

Networks under Community Models

Stochastic Block Model

α0 = 0

Mixed Membership Model

α0 = 1

slide-51
SLIDE 51

Networks under Community Models

Stochastic Block Model

α0 = 0

Mixed Membership Model

α0 = 10

slide-52
SLIDE 52

Networks under Community Models

Stochastic Block Model

α0 = 0

Mixed Membership Model

α0 = 10

Unifying Assumption

Edges conditionally independent given community memberships

slide-53
SLIDE 53

Subgraph Counts as Graph Moments

slide-54
SLIDE 54

Subgraph Counts as Graph Moments

slide-55
SLIDE 55

Subgraph Counts as Graph Moments

3-star counts sufficient for identifiability and learning of MMSB

slide-56
SLIDE 56

Subgraph Counts as Graph Moments

3-star counts sufficient for identifiability and learning of MMSB

3-Star Count Tensor

˜ M3(a, b, c) = 1 |X|# of common neighbors in X = 1 |X|

  • x∈X

G(x, a)G(x, b)G(x, c). ˜ M3 = 1 |X|

  • x∈X

[G⊤

x,A ⊗ G⊤ x,B ⊗ G⊤ x,C]

x a b c A B C X

slide-57
SLIDE 57

Multi-view Representation

Conditional independence of the three views πx: community membership vector of node x.

3-stars

x X A B C

Graphical model

πx G⊤

x,A

G⊤

x,B

G⊤

x,C

U V W Linear Multiview Model: E[G⊤

x,A|Π] = Π⊤ AP ⊤πx = Uπx.

slide-58
SLIDE 58

Subgraph Counts as Graph Moments

Second and Third Order Moments

ˆ M2 :=

1 |X|

  • x

ZCG⊤

x,CGx,BZ⊤ B − shift

ˆ M3 :=

1 |X|

  • x
  • G⊤

x,A ⊗ ZBG⊤ x,B ⊗ ZCG⊤ x,C

  • − shift

Symmetrize Transition Matrices PairsC,B := G⊤

X,C ⊗ G⊤ X,B

ZB := Pairs (A, C) (Pairs (B, C))† ZC := Pairs (A, B) (Pairs (C, B))† x a b c A B C X Linear Multiview Model: E[G⊤

x,A|Π] = Uπx.

E[ ˆ M2|ΠA,B,C] =

  • i

αi α0 ui ⊗ ui, E[ ˆ M3|ΠA,B,C] =

  • i

αi α0 ui ⊗ ui ⊗ ui.

slide-59
SLIDE 59

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

slide-60
SLIDE 60

Recap of Tensor Method

M2 =

  • i

wiai ⊗ ai, M3 =

  • i

wiai ⊗ ai ⊗ ai. Whitening matrix W from SVD of M2. v1 v2 v3 W a1 a2 a3 Multilinear transform: T = M3(W, W, W).

Tensor M3 Tensor T

Eigenvectors of T through power method and deflation. v → T(I, v, v) T(I, v, v).

slide-61
SLIDE 61

Orthogonal Tensor Eigen Decomposition

T =

  • i∈[k]

λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =

i λivi, v12vi = λ1v1.

vi are eigenvectors of tensor T.

slide-62
SLIDE 62

Orthogonal Tensor Eigen Decomposition

T =

  • i∈[k]

λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =

i λivi, v12vi = λ1v1.

vi are eigenvectors of tensor T.

Tensor Power Method

Start from an initial vector v. v → T(I, v, v) T(I, v, v).

slide-63
SLIDE 63

Orthogonal Tensor Eigen Decomposition

T =

  • i∈[k]

λivi ⊗ vi ⊗ vi, vi, vj = δi,j, ∀ i, j. T(I, v1, v1) =

i λivi, v12vi = λ1v1.

vi are eigenvectors of tensor T.

Tensor Power Method

Start from an initial vector v. v → T(I, v, v) T(I, v, v).

Questions

Is there convergence? Does the convergence depend on initialization? What about performance under noise?

slide-64
SLIDE 64

Recap of Matrix Eigen Analysis

For symmetric M ∈ Rk×k, eigen decomposition: M =

i λiviv⊤ i .

Eigen vectors are fixed points: Mv = λv.

◮ In our notation: M(I, v) = λv.

Uniqueness (Identifiability):

  • Iff. λi are distinct.
slide-65
SLIDE 65

Recap of Matrix Eigen Analysis

For symmetric M ∈ Rk×k, eigen decomposition: M =

i λiviv⊤ i .

Eigen vectors are fixed points: Mv = λv.

◮ In our notation: M(I, v) = λv.

Uniqueness (Identifiability):

  • Iff. λi are distinct.

Power method: v → M(I, v) M(I, v).

slide-66
SLIDE 66

Recap of Matrix Eigen Analysis

For symmetric M ∈ Rk×k, eigen decomposition: M =

i λiviv⊤ i .

Eigen vectors are fixed points: Mv = λv.

◮ In our notation: M(I, v) = λv.

Uniqueness (Identifiability):

  • Iff. λi are distinct.

Power method: v → M(I, v) M(I, v).

Convergence properties

Let λ1 > λ2 . . . > λd. {vi} form a basis. Let initialization v =

i civi.

If c1 = 0, power method converges to v1.

slide-67
SLIDE 67

Recap of Matrix Eigen Analysis

For symmetric M ∈ Rk×k, eigen decomposition: M =

i λiviv⊤ i .

Eigen vectors are fixed points: Mv = λv.

◮ In our notation: M(I, v) = λv.

Uniqueness (Identifiability):

  • Iff. λi are distinct.

Power method: v → M(I, v) M(I, v).

Convergence properties

Let λ1 > λ2 . . . > λd. {vi} form a basis. Let initialization v =

i civi.

If c1 = 0, power method converges to v1.

Perturbation analysis (Davis-Kahan): T + E

Require E < mini=j |λi − λj|.

slide-68
SLIDE 68

Optimization viewpoint of matrix analysis

M =

  • i∈[k]

λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =

  • i

λivi, v2. Optimization problem: max

v

M(v, v) s.t. v = 1.

slide-69
SLIDE 69

Optimization viewpoint of matrix analysis

M =

  • i∈[k]

λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =

  • i

λivi, v2. Optimization problem: max

v

M(v, v) s.t. v = 1. Non-convex problem. Global maximizer is v1 (top eigenvector).

slide-70
SLIDE 70

Optimization viewpoint of matrix analysis

M =

  • i∈[k]

λivi ⊗ vi, λ1 > λ2 . . . . Rayleigh quotient at v: M(v, v) = v⊤Mv =

  • i

λivi, v2. Optimization problem: max

v

M(v, v) s.t. v = 1. Non-convex problem. Global maximizer is v1 (top eigenvector). What are the local optimizers?

slide-71
SLIDE 71

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).

slide-72
SLIDE 72

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv).

slide-73
SLIDE 73

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0.

slide-74
SLIDE 74

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

slide-75
SLIDE 75

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 2(M − λI).

slide-76
SLIDE 76

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 2(M − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v.

slide-77
SLIDE 77

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 2(M − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: v1 is the only local optimum. Verify: All other eigenvectors are saddle points.

slide-78
SLIDE 78

Optimization viewpoint of matrix analysis

Optimization: max

v

M(v, v) s.t. v = 1. Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1). First derivative: ∇L(v, λ) = 2(M(I, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

M(I,v) M(I,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 2(M − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: v1 is the only local optimum. Verify: All other eigenvectors are saddle points. Power method recovers v1 when initialization v satisfies v, v1 = 0.

slide-79
SLIDE 79

Analysis of Tensor Power Method

T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

slide-80
SLIDE 80

Analysis of Tensor Power Method

T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an

  • rthogonal decomposition exists.
slide-81
SLIDE 81

Analysis of Tensor Power Method

T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an

  • rthogonal decomposition exists.

Characterization of components {vi}

{vi} are eigenvectors: T(I, vi, vi) = λivi.

slide-82
SLIDE 82

Analysis of Tensor Power Method

T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an

  • rthogonal decomposition exists.

Characterization of components {vi}

{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1.

slide-83
SLIDE 83

Analysis of Tensor Power Method

T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Bad news about tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an

  • rthogonal decomposition exists.

Characterization of components {vi}

{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1. How do we avoid spurious solutions (not part of decomposition)?

slide-84
SLIDE 84

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).

slide-85
SLIDE 85

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv).

slide-86
SLIDE 86

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0.

slide-87
SLIDE 87

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

slide-88
SLIDE 88

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).

slide-89
SLIDE 89

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v.

slide-90
SLIDE 90

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: {vi} are the only local optima. Verify: All other eigenvectors are saddle points.

slide-91
SLIDE 91

Optimization viewpoint of tensor analysis

Optimization: max

v

T(v, v, v) s.t. v = 1. Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1). First derivative: ∇L(v, λ) = 3(T(I, v, v) − λv). Stationary points are eigenvectors: ∇L(v, λ) = 0. Power method v →

T(I,v,v) T(I,v,v) is a version of gradient ascent.

Second derivative: ∇2L(v, λ) = 3(2T(I, I, v) − λI).

Local optimality condition for constrained optimization

w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v. Verify: {vi} are the only local optima. Verify: All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

slide-92
SLIDE 92

Review: matrix power iteration

Recall matrix power iteration for matrix M :=

i λi viv⊤ i :

Start with some v, and for j = 1, 2, . . . : v → Mv =

  • i

λi

  • v⊤

i v

  • vi.

i.e., component in vi direction is scaled by λi.

slide-93
SLIDE 93

Review: matrix power iteration

Recall matrix power iteration for matrix M :=

i λi viv⊤ i :

Start with some v, and for j = 1, 2, . . . : v → Mv =

  • i

λi

  • v⊤

i v

  • vi.

i.e., component in vi direction is scaled by λi. If λ1 > λ2 ≥ · · · , then in t iterations,

  • v⊤

1 v

2

  • i
  • v⊤

i v

2 ≥ 1 − k λ2 λ1 2t . Converges linearly to v1 assuming gap λ2/λ1 < 1.

slide-94
SLIDE 94

Tensor power iteration convergence analysis

Let ci := v⊤

i v initial component in vi direction; assume WLOG

λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · .

slide-95
SLIDE 95

Tensor power iteration convergence analysis

Let ci := v⊤

i v initial component in vi direction; assume WLOG

λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · . Then v →

  • i

λi

  • v⊤

i v

2vi =

  • i

λic2

i vi

i.e., component in vi direction is squared then scaled by λi.

slide-96
SLIDE 96

Tensor power iteration convergence analysis

Let ci := v⊤

i v initial component in vi direction; assume WLOG

λ1|c1| > λ2|c2| ≥ λ3|c3| ≥ · · · . Then v →

  • i

λi

  • v⊤

i v

2vi =

  • i

λic2

i vi

i.e., component in vi direction is squared then scaled by λi. By induction, in t iterations v =

  • i

λ2t−1

i

c2t

i

vi, so

  • v⊤

1 v

2

  • i
  • v⊤

i v

2 ≥ 1 − k

  • λ1

maxi=1 λi 2

  • v2c2

v1c1

  • 2t+1

.

slide-97
SLIDE 97

Matrix vs. tensor power iteration

Matrix power iteration: Tensor power iteration:

slide-98
SLIDE 98

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.

slide-99
SLIDE 99

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only.

2

Converges to top eigenvector. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.

2

Converges to vi for which vi|ci| = max! could be any of them.

slide-100
SLIDE 100

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only.

2

Converges to top eigenvector.

3

Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci|. Property of the tensor and initialization v.

2

Converges to vi for which vi|ci| = max! could be any of them.

3

Quadratic convergence. Need O(log log(1/ǫ)) iterations.

slide-101
SLIDE 101

Perturbation Analysis

ˆ T = T + E, T =

  • i

λivi ⊗ vi ⊗ vi, E := max

x:x=1 |E(x, x, x)| ≤ ǫ.

Theorem: Let N be number of iterations. If N ≥ log k + log log λmax

ǫ

, ǫ < λmin

k ,

then output (v, λ) (after polynomial restarts) satisfies v − v1 ≤ O ǫ λ1

  • ,

λ − λ1 ≤ O(ǫ), where v1 is s.t. λ1|c1| > λ2|c2| . . . , ci := vi, v, and v is the (successful) initializer. Careful analysis of deflation: avoid buildup of errors. Implies polynomial sample complexity for learning.

slide-102
SLIDE 102

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

slide-103
SLIDE 103

Beyond Orthogonal Tensor Decomposition

a ⊗ a ⊗ a is a rank-1 tensor whose ith entry is a(i1) · a(i2) · a(i3). For tensor T, find decomposition into rank one terms T =

  • j∈[k]

wjaj ⊗ aj ⊗ aj, aj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2

k: tensor rank, d: ambient dimension. k > d: overcomplete. A is incoherent: ai, aj ∼

1 √ d for i = j.

Guaranteed Recovery when k = o(d1.5) .

“Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A., R. Ge, M. Janzamin. Preprint, Feb. 2014. “Provable Learning of Overcomplete Latent Variable Models: Semi-supervised & Unsupervised”.

slide-104
SLIDE 104

Semi-supervised Learning of Gaussian Mixtures

n unlabeled samples, mj: samples for component j.

  • No. of mixture components: k = o(d1.5)
  • No. of labeled samples: mj = ˜

Ω(1).

  • No. of unlabeled samples: n = ˜

Ω(k).

Our result: achieved error with n unlabeled samples

max

i

  • ai − ai = ˜

O

  • k

n

  • + ˜

O √ k d

  • Can handle (polynomially) overcomplete mixtures.

Extremely small number of labeled samples: polylog(d). Sample complexity is tight: need ˜ Ω(k) samples! Approximation error: decaying in high dimensions.

slide-105
SLIDE 105

Unsupervised Learning of Gaussian Mixtures

Conditions for recovery

  • No. of mixture components: k = C · d
  • No. of unlabeled samples: n = ˜

Ω(k · d). Computational complexity: ˜ O

  • eC2

Our result: achieved error with n unlabeled samples

max

i

  • ai − ai = ˜

O

  • k

n

  • + ˜

O √ k d

  • Error: same as before, for semi-supervised setting.

Sample complexity: worse than semi-supervised, but better than previous works (no dependence on condition number of A). Computational complexity: polynomial when k = Θ(d).

slide-106
SLIDE 106

Learning Overcomplete Dictionaries

=

Y ∈ Rd×n A ∈ Rd×k X ∈ Rk×n Linear model: Y = AX, both A, X unknown. Sparse X: each column is randomly s-sparse Overcomplete dictionary A ∈ Rd×k: k ≥ d. Incoherence: max

i=j |ai, aj| ≈ 0. (satisfied by random vectors) “Learning Sparsely Used Overcomplete Dictionaries” by A. Agarwal, A., P. Jain, P. Netrapalli,

  • R. Tandon. COLT 2014.
slide-107
SLIDE 107

Experiments on MNIST

Original Reconstruction Learnt Representation

slide-108
SLIDE 108

Outline

1

Introduction

2

Latent Variable Models and Moments

3

Community Detection in Graphs

4

Analysis of Tensor Power Method

5

Advanced Topics

6

Conclusion

slide-109
SLIDE 109

Conclusion

Guaranteed Learning of Latent Variable Models

Guaranteed to recover correct model Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. Tensor approach: mixed membership communities, topic models, latent trees... Sparsity-based approach: overcomplete models, e.g sparse coding and topic models.

=

Y A X

Tomorrow’s lecture

Implementation of tensor approaches.