Learning Latent Variable Models through Tensor Methods Anima - - PowerPoint PPT Presentation

learning latent variable models through tensor methods
SMART_READER_LITE
LIVE PREVIEW

Learning Latent Variable Models through Tensor Methods Anima - - PowerPoint PPT Presentation

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges in Unsupervised Learning Learn a latent variable model without labeled examples. E.g. topic models, hidden Markov models, Gaussian mixtures,


slide-1
SLIDE 1

Learning Latent Variable Models through Tensor Methods

Anima Anandkumar

U.C. Irvine

slide-2
SLIDE 2

Challenges in Unsupervised Learning

Learn a latent variable model without labeled examples. E.g. topic models, hidden Markov models, Gaussian mixtures, community detection. Maximum likelihood is NP-hard in most scenarios. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this talk: guaranteed and efficient learning through tensor methods

slide-3
SLIDE 3

How to model hidden effects?

Basic Approach: mixtures/clusters

Hidden variable h is categorical.

Advanced: Probabilistic models

Hidden variable h has more general distributions. Can model mixed memberships. x1 x2 x3 x4 x5 h1 h2 h3

slide-4
SLIDE 4

Moment Based Approaches

Multivariate Moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

Tensor

E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].

slide-5
SLIDE 5

Outline

1

Introduction

2

Spectral Methods: Matrices to Tensors

3

Tensor Forms for Different Models

4

Experimental Results

5

Overcomplete Tensors

6

Conclusion

slide-6
SLIDE 6

Classical Spectral Methods: Matrix PCA

Learning through Spectral Clustering

Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means).

slide-7
SLIDE 7

Classical Spectral Methods: Matrix PCA

Learning through Spectral Clustering

Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Basic method works only for single memberships. Failure to cluster under small separation. Require long documents for good concentration bounds.

slide-8
SLIDE 8

Classical Spectral Methods: Matrix PCA

Learning through Spectral Clustering

Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Basic method works only for single memberships. Failure to cluster under small separation. Require long documents for good concentration bounds. Efficient Learning Without Separation Constraints?

slide-9
SLIDE 9

Beyond SVD: Spectral Methods on Tensors

How to learn the mixture components without separation constraints?

◮ Are higher order moments helpful?

Unified framework?

◮ Moment-based Estimation of probabilistic latent variable models?

SVD gives spectral decomposition of matrices.

◮ What are the analogues for tensors?

slide-10
SLIDE 10

Spectral Decomposition

M2 =

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

slide-11
SLIDE 11

Spectral Decomposition

M2 =

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

M3 =

i

λiui ⊗ vi ⊗ wi

= + ....

Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2

u ⊗ v ⊗ w is a rank-1 tensor since its (i1, i2, i3)th entry is ui1vi2wi3.

slide-12
SLIDE 12

Decomposition of Orthogonal Tensors

A has orthogonal columns. M3 =

  • i

wiai ⊗ ai ⊗ ai.

slide-13
SLIDE 13

Decomposition of Orthogonal Tensors

A has orthogonal columns. M3 =

  • i

wiai ⊗ ai ⊗ ai. M3(I, a1, a1) =

i wiai, a12ai = w1a1.

slide-14
SLIDE 14

Decomposition of Orthogonal Tensors

A has orthogonal columns. M3 =

  • i

wiai ⊗ ai ⊗ ai. M3(I, a1, a1) =

i wiai, a12ai = w1a1.

ai are eigenvectors of tensor M3. Analogous to matrix eigenvectors: Mv = M(I, v) = λv.

slide-15
SLIDE 15

Decomposition of Orthogonal Tensors

A has orthogonal columns. M3 =

  • i

wiai ⊗ ai ⊗ ai. M3(I, a1, a1) =

i wiai, a12ai = w1a1.

ai are eigenvectors of tensor M3. Analogous to matrix eigenvectors: Mv = M(I, v) = λv.

Two Problems

How to find eigenvectors of a tensor? A is not orthogonal in general.

slide-16
SLIDE 16

Whitening

M3 =

  • i

wiai ⊗ ai ⊗ ai, M2 =

  • i

wiai ⊗ ai. Find whitening matrix W s.t. W ⊤A = V is an orthogonal matrix. When A ∈ Rd×k has full column rank, it is an invertible transformation. v1 v2 v3 W a1 a2 a3 Use pairwise moments M2 to find W s.t. W ⊤M2W = I. Eigen-decomposition of M2 = UDiag(˜ λ)U ⊤, then W = UDiag(˜ λ−1/2).

slide-17
SLIDE 17

Using Whitening to Obtain Orthogonal Tensor

Tensor M3 Tensor T

Multi-linear transform

M3 ∈ Rd×d×d and T ∈ Rk×k×k. T = M3(W, W, W) =

i wi(W ⊤ai)⊗3.

T =

i∈[k]

λi · vi ⊗ vi ⊗ vi is orthogonal. Dimensionality reduction when k ≪ d.

slide-18
SLIDE 18

Putting it together

M2 =

  • i

wiai ⊗ ai, M3 =

  • i

wiai ⊗ ai ⊗ ai. Obtain whitening matrix W from SVD of M2. Use W for multi-linear transform: T = M3(W, W, W). Find eigenvectors of T through power method and deflation. For what models can we obtain M2 and M3 forms?

slide-19
SLIDE 19

Outline

1

Introduction

2

Spectral Methods: Matrices to Tensors

3

Tensor Forms for Different Models

4

Experimental Results

5

Overcomplete Tensors

6

Conclusion

slide-20
SLIDE 20

Topic Modeling

slide-21
SLIDE 21

Geometric Picture for Topic Models

Topic proportions vector (h)

Document

slide-22
SLIDE 22

Geometric Picture for Topic Models

Single topic (h)

slide-23
SLIDE 23

Geometric Picture for Topic Models

Single topic (h) A A A x1 x2 x3 Word generation (x1, x2, . . .)

slide-24
SLIDE 24

Geometric Picture for Topic Models

Single topic (h) A A A x1 x2 x3 Word generation (x1, x2, . . .) Linear model: E[xi|h] = Ah .

slide-25
SLIDE 25

Moments for Single Topic Models

E[xi|h] = Ah. w := E[h]. Learn topic-word matrix A, vector w

x1 x2 x3 x4 x5 A A A A A

h

slide-26
SLIDE 26

Moments for Single Topic Models

E[xi|h] = Ah. w := E[h]. Learn topic-word matrix A, vector w

x1 x2 x3 x4 x5 A A A A A

h

Pairwise Co-occurence Matrix Mx

M2 := E[x1 ⊗ x2] = E[E[x1 ⊗ x2|h]] =

k

  • i=1

wiai ⊗ ai

Triples Tensor M3

M3 := E[x1 ⊗ x2 ⊗ x3] = E[E[x1 ⊗ x2 ⊗ x3|h]] =

k

  • i=1

wiai ⊗ ai ⊗ ai

slide-27
SLIDE 27

Moments under LDA

M2 := E[x1 ⊗ x2] − α0 α0 + 1E[x1] ⊗ E[x1] M3 := E[x1 ⊗ x2 ⊗ x3] − α0 α0 + 2E[x1 ⊗ x2 ⊗ E[x1]] − more stuff... Then M2 =

  • ˜

wi ai ⊗ ai M3 =

  • ˜

wi ai ⊗ ai ⊗ ai. Three words per document suffice for learning LDA. Similar forms for HMM, ICA, etc.

slide-28
SLIDE 28

Network Community Models

slide-29
SLIDE 29

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

slide-30
SLIDE 30

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

slide-31
SLIDE 31

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

0.9

slide-32
SLIDE 32

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

0.1

slide-33
SLIDE 33

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1

slide-34
SLIDE 34

Subgraph Counts as Graph Moments

slide-35
SLIDE 35

Subgraph Counts as Graph Moments

slide-36
SLIDE 36

Subgraph Counts as Graph Moments

3-star counts sufficient for identifiability and learning of MMSB

slide-37
SLIDE 37

Subgraph Counts as Graph Moments

3-star counts sufficient for identifiability and learning of MMSB

3-Star Count Tensor

˜ M3(a, b, c) = 1 |X|# of common neighbors in X = 1 |X|

  • x∈X

G(x, a)G(x, b)G(x, c). ˜ M3 = 1 |X|

  • x∈X

[G⊤

x,A ⊗ G⊤ x,B ⊗ G⊤ x,C]

x a b c A B C X

slide-38
SLIDE 38

Multi-view Representation

Conditional independence of the three views πx: community membership vector of node x.

3-stars

x X A B C

Graphical model

πx G⊤

x,A

G⊤

x,B

G⊤

x,C

Similar form as M2 and M3 for topic models

slide-39
SLIDE 39

Main Results

k communities, n nodes. Uniform communities. α0: Sparsity level of community memberships (Dirichlet parameter). p, q: intra/inter-community edge density.

Scaling Requirements

n = ˜ Ω(k2(α0 + 1)3), p − q √p = ˜ Ω (α0 + 1)1.5k √n

  • .

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

slide-40
SLIDE 40

Main Results

k communities, n nodes. Uniform communities. α0: Sparsity level of community memberships (Dirichlet parameter). p, q: intra/inter-community edge density.

Scaling Requirements

n = ˜ Ω(k2(α0 + 1)3), p − q √p = ˜ Ω (α0 + 1)1.5k √n

  • .

For stochastic block model (α0 = 0), tight results Tight guarantees for sparse graphs (scaling of p, q) Tight guarantees on community size: require at least √n sized communities Efficient scaling w.r.t. sparsity level of memberships α0

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

slide-41
SLIDE 41

Main Results (Contd)

α0: Sparsity level of community memberships (Dirichlet parameter). Π: Community membership matrix, Π(i): ith community

  • S: Estimated supports,

S(i, j): Support for node j in community i.

Norm Guarantees

1 n max

i

  • Πi − Πi1 = ˜

O

  • (α0 + 1)3/2√p

(p − q)√n

slide-42
SLIDE 42

Main Results (Contd)

α0: Sparsity level of community memberships (Dirichlet parameter). Π: Community membership matrix, Π(i): ith community

  • S: Estimated supports,

S(i, j): Support for node j in community i.

Norm Guarantees

1 n max

i

  • Πi − Πi1 = ˜

O

  • (α0 + 1)3/2√p

(p − q)√n

  • Support Recovery

∃ ξ s.t. for all nodes j ∈ [n] and all communities i ∈ [k], w.h.p Π(i, j) ≥ ξ ⇒ S(i, j) = 1 and Π(i, j) ≤ ξ 2 ⇒ S(i, j) = 0. Zero-error Support Recovery of Significant Memberships of All Nodes

slide-43
SLIDE 43

Outline

1

Introduction

2

Spectral Methods: Matrices to Tensors

3

Tensor Forms for Different Models

4

Experimental Results

5

Overcomplete Tensors

6

Conclusion

slide-44
SLIDE 44

Computational Complexity (k ≪ n)

n = # of nodes N = # of iterations k = # of communities. c = # of cores. Whiten STGD Unwhiten Space O(nk) O(k2) O(nk) Time O(nsk/c + k3) O(Nk3/c) O(nsk/c) Whiten: matrix/vector products and SVD. STGD: Stochastic Tensor Gradient Descent Unwhiten: matrix/vector products Our approach: O(nsk

c + k3)

Embarrassingly Parallel and fast!

slide-45
SLIDE 45

Scaling Of The Stochastic Iterations

10

2

10

3

10

−1

10 10

1

10

2

10

3

10

4

Number of communities k Running time(secs)

MATLAB Tensor Toolbox(CPU) CULA Standard Interface(GPU) CULA Device Interface(GPU) Eigen Sparse(CPU)

slide-46
SLIDE 46

Summary of Results

Friend Users

Facebook n ∼ 20k

Business User Reviews

Yelp n ∼ 40k

Author Coauthor

DBLP(sub) n ∼ 1 million(∼ 100k) Error (E) and Recovery ratio (R) Dataset ˆ k Method Running Time E R Facebook(k=360) 500

  • urs

468 0.0175 100% Facebook(k=360) 500 variational 86,808 0.0308 100% . Yelp(k=159) 100

  • urs

287 0.046 86% Yelp(k=159) 100 variational N.A. . DBLP sub(k=250) 500

  • urs

10,157 0.139 89% DBLP sub(k=250) 500 variational 558,723 16.38 99% DBLP(k=6000) 100

  • urs

5407 0.105 95%

Thanks to Prem Gopalan and David Mimno for providing variational code.

slide-47
SLIDE 47

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang’s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31

slide-48
SLIDE 48

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang’s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31

Bridgeness: Distance from vector [1/ˆ k, . . . , 1/ˆ k]⊤

Top-5 bridging nodes (businesses)

Business Categories Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe Pizzeria Bianco Restaurants, Pizza, Phoenix FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

slide-49
SLIDE 49

Outline

1

Introduction

2

Spectral Methods: Matrices to Tensors

3

Tensor Forms for Different Models

4

Experimental Results

5

Overcomplete Tensors

6

Conclusion

slide-50
SLIDE 50

Beyond Orthogonal Tensor Decomposition

T =

  • j∈[k]

wjaj ⊗ aj ⊗ aj. k: tensor rank, d: ambient dimension. k > d: overcomplete.

slide-51
SLIDE 51

Beyond Orthogonal Tensor Decomposition

T =

  • j∈[k]

wjaj ⊗ aj ⊗ aj. k: tensor rank, d: ambient dimension. k > d: overcomplete. A is incoherent: ai, aj ∼

1 √ d for i = j.

slide-52
SLIDE 52

Beyond Orthogonal Tensor Decomposition

T =

  • j∈[k]

wjaj ⊗ aj ⊗ aj. k: tensor rank, d: ambient dimension. k > d: overcomplete. A is incoherent: ai, aj ∼

1 √ d for i = j.

Guaranteed Recovery when k = o(d1.5) .

slide-53
SLIDE 53

Beyond Orthogonal Tensor Decomposition

T =

  • j∈[k]

wjaj ⊗ aj ⊗ aj. k: tensor rank, d: ambient dimension. k > d: overcomplete. A is incoherent: ai, aj ∼

1 √ d for i = j.

Guaranteed Recovery when k = o(d1.5) . Tight sample complexity bounds.

“Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A., R. Ge, M. Janzamin. Preprint, Feb. 2014. “Provable Learning of Overcomplete Latent Variable Models: Semi-supervised & Unsupervised”.

slide-54
SLIDE 54

High-level Intuition for Sample Bounds

Gaussian mixture model: x = Ah + z, where z is noise. Exact moment T =

i wiai ⊗ ai ⊗ ai.

Sample moment: ˆ T = 1

n

  • i xi ⊗ xi ⊗ xi − . . ..

Naive Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s. Our idea: Careful ǫ-net covering for ˆ T − T. ˆ T − T has many terms, e.g.

1 n

  • i zi ⊗ zi ⊗ zi.

Need to bound 1 n

  • i

zi, u3, for all u ∈ Sd−1. Classify inner products into buckets and bound them separately.

slide-55
SLIDE 55

High-level Intuition for Sample Bounds

Gaussian mixture model: x = Ah + z, where z is noise. Exact moment T =

i wiai ⊗ ai ⊗ ai.

Sample moment: ˆ T = 1

n

  • i xi ⊗ xi ⊗ xi − . . ..

Naive Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s. Our idea: Careful ǫ-net covering for ˆ T − T. ˆ T − T has many terms, e.g.

1 n

  • i zi ⊗ zi ⊗ zi.

Need to bound 1 n

  • i

zi, u3, for all u ∈ Sd−1. Classify inner products into buckets and bound them separately. Tight sample bounds for a range of latent variable models. E.g. Require ˜ Ω(k) samples for k-Gaussian mixtures in low-noise regime.

slide-56
SLIDE 56

Main Result: Local Convergence

Initialization: a1 − a(0) ≤ ǫ0, and ǫ0 < const. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Error: ǫT := E + ˜ O √

k d

  • Theorem (Local Convergence)

After O(log(1/ǫT )) steps of alternating rank-1 updates, a1 − a(t) = O(ǫT ). Linear convergence: up to approximation error. Guarantees for overcomplete tensors: k = o(d1.5) and for pth-order tensors k = o(dp/2). Requires good initialization. What about global convergence?

slide-57
SLIDE 57

Global Convergence k = O(d)

SVD Initialization

Find the top singular vector of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

Conditions for global convergence

Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)

  • No. of Iterations: N = Θ (log(1/ǫT )). Recall ǫT :
  • approx. error.
slide-58
SLIDE 58

Global Convergence k = O(d)

SVD Initialization

Find the top singular vector of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

Conditions for global convergence

Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)

  • No. of Iterations: N = Θ (log(1/ǫT )). Recall ǫT :
  • approx. error.

Latest Improvement (Assuming Gaussian aj’s)

Improved initialization requirements for convergence. |x(0), aj| ≥ dβ √ k d .

slide-59
SLIDE 59

Global Convergence k = O(d)

SVD Initialization

Find the top singular vector of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

Conditions for global convergence

Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)

  • No. of Iterations: N = Θ (log(1/ǫT )). Recall ǫT :
  • approx. error.

Latest Improvement (Assuming Gaussian aj’s)

Improved initialization requirements for convergence. |x(0), aj| ≥ dβ √ k d . Initialize with samples with noise variance dσ2 s.t. σ = o √ d √ k

slide-60
SLIDE 60

Outline

1

Introduction

2

Spectral Methods: Matrices to Tensors

3

Tensor Forms for Different Models

4

Experimental Results

5

Overcomplete Tensors

6

Conclusion

slide-61
SLIDE 61

Conclusion

Guaranteed Learning of Latent Variable Models

Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc.

In practice

Scalable and embarrassingly parallel: handle large datasets. Efficient performance: perplexity or ground truth validation.

Software Code

Topic modeling https://github.com/FurongHuang/TopicModeling Community detection https://github.com/FurongHuang/Fast-Detection-of-Overlappi Youtube videos and slides from ML summer school http://newport.eecs.uci.edu/anandkumar/MLSS.html