[PPT] - Tensor Methods for large-scale Machine Learning Anima Anandkumar PowerPoint Presentation

SLIDE 1

Tensor Methods for large-scale Machine Learning

Anima Anandkumar

U.C. Irvine

SLIDE 2

Learning with Big Data

SLIDE 3

Data vs. Information

SLIDE 4

Data vs. Information

SLIDE 5

Data vs. Information

Missing observations, gross corruptions, outliers.

SLIDE 6

Data vs. Information

Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables !

SLIDE 7

Data vs. Information

Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables ! Data deluge an information desert!

SLIDE 8

Learning in High Dimensional Regime

Useful information: low-dimensional structures. Learning with big data: ill-posed problem.

SLIDE 9

Learning in High Dimensional Regime

Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack

SLIDE 10

Learning in High Dimensional Regime

Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack Learning with big data: computationally challenging! Principled approaches for finding low dimensional structures?

SLIDE 11

How to model information structures?

Latent variable models

Incorporate hidden or latent variables. Information structures: Relationships between latent variables and

bserved data.

SLIDE 12

How to model information structures?

Latent variable models

Incorporate hidden or latent variables. Information structures: Relationships between latent variables and

bserved data.

Basic Approach: mixtures/clusters

Hidden variable is categorical.

SLIDE 13

How to model information structures?

Latent variable models

Incorporate hidden or latent variables. Information structures: Relationships between latent variables and

bserved data.

Basic Approach: mixtures/clusters

Hidden variable is categorical.

Advanced: Probabilistic models

Hidden variables have more general distributions. Can model mixed membership/hierarchical groups. x1 x2 x3 x4 x5 h1 h2 h3

SLIDE 14

Latent Variable Models (LVMs)

Document modeling

Observed: words. Hidden: topics.

Social Network Modeling

Observed: social interactions. Hidden: communities, relationships.

Recommendation Systems

Observed: recommendations (e.g., reviews). Hidden: User and business attributes Unsupervised Learning: Learn LVM without labeled examples.

SLIDE 15

LVM for Feature Engineering

Learn good features/representations for classification tasks, e.g., computer vision and NLP.

Sparse Coding/Dictionary Learning

Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.

SLIDE 16

Challenges in Learning LVMs

Computational Challenges

Maximum likelihood: non-convex optimization. NP-hard. Practice: Local search approaches such as gradient descent, EM, Variational Bayes have no consistency guarantees. Can get stuck in bad local optima. Poor convergence rates. Hard to parallelize.

Alternatives? Guaranteed and efficient learning?

SLIDE 17

Outline

1

Introduction

2

Spectral Methods

3

Moment Tensors of Latent Variable Models

4

Experiments

5

Conclusion

SLIDE 18

Classical Spectral Methods: Matrix PCA

For centered samples {xi}, find projection P with Rank(P) = k s.t. min

P

1 n

i∈[n]

xi − Pxi2. Result: Eigen-decomposition of Cov(X). Beyond PCA: Spectral Methods on Tensors?

SLIDE 19

Moment Matrices and Tensors

Multivariate Moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

Tensor

E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].

SLIDE 20

Spectral Decomposition of Tensors

M2 =

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

SLIDE 21

Spectral Decomposition of Tensors

M2 =

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

M3 =

i

λiui ⊗ vi ⊗ wi

= + ....

Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2

u ⊗ v ⊗ w is a rank-1 tensor since its (i1, i2, i3)th entry is ui1vi2wi3. How to solve this non-convex problem?

SLIDE 22

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

i∈[k]

λivi ⊗ vi ⊗ vi.

SLIDE 23

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v).

SLIDE 24

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).

SLIDE 25

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v). How do we avoid spurious solutions (not part of decomposition)?

SLIDE 26

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v). How do we avoid spurious solutions (not part of decomposition)?

{vi}’s are the only robust fixed points.

SLIDE 27

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v). How do we avoid spurious solutions (not part of decomposition)?

{vi}’s are the only robust fixed points.
All other eigenvectors are saddle points.

SLIDE 28

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v). How do we avoid spurious solutions (not part of decomposition)?

{vi}’s are the only robust fixed points.
All other eigenvectors are saddle points.

For an orthogonal tensor, no spurious local optima!

SLIDE 29

Putting it together

Non-orthogonal tensor M3 =

i wiai ⊗ ai ⊗ ai, M2 = i wiai ⊗ ai.

Whitening matrix W: Multilinear transform: T = M3(W, W, W) v1 v2 v3 W a1 a2 a3

Tensor M3 Tensor T

SLIDE 30

Putting it together

Non-orthogonal tensor M3 =

i wiai ⊗ ai ⊗ ai, M2 = i wiai ⊗ ai.

Whitening matrix W: Multilinear transform: T = M3(W, W, W) v1 v2 v3 W a1 a2 a3

Tensor M3 Tensor T

Tensor Decomposition: Guaranteed Non-Convex Optimization!

SLIDE 31

Putting it together

Non-orthogonal tensor M3 =

i wiai ⊗ ai ⊗ ai, M2 = i wiai ⊗ ai.

Whitening matrix W: Multilinear transform: T = M3(W, W, W) v1 v2 v3 W a1 a2 a3

Tensor M3 Tensor T

Tensor Decomposition: Guaranteed Non-Convex Optimization! For what latent variable models can we obtain M2 and M3 forms?

SLIDE 32

Outline

1

Introduction

2

Spectral Methods

3

Moment Tensors of Latent Variable Models

4

Experiments

5

Conclusion

SLIDE 33

Topic Modeling

SLIDE 34

Moments for Single Topic Models

E[xi|h] = Ah. w := E[h]. Learn topic-word matrix A, vector w

x1 x2 x3 x4 x5 A A A A A

h

SLIDE 35

Moments for Single Topic Models

E[xi|h] = Ah. w := E[h]. Learn topic-word matrix A, vector w

x1 x2 x3 x4 x5 A A A A A

h

Pairwise Co-occurence Matrix M2

M2 := E[x1 ⊗ x2] = E[E[x1 ⊗ x2|h]] =

k

i=1

wiai ⊗ ai

Triples Tensor M3

M3 := E[x1 ⊗ x2 ⊗ x3] = E[E[x1 ⊗ x2 ⊗ x3|h]] =

k

i=1

wiai ⊗ ai ⊗ ai Can be extended to learning LDA: mutiple topics in a document.

SLIDE 36

Tractable Learning for LVMs

GMM HMM

h1 h2 h3 x1 x2 x3

ICA

h1 h2 hk x1 x2 xd

Multiview and Topic Models

SLIDE 37

Overall Framework

= +

....

Unlabeled Data Probabilistic admixture models Tensor Method Inference

SLIDE 38

Outline

1

Introduction

2

Spectral Methods

3

Moment Tensors of Latent Variable Models

4

Experiments

5

Conclusion

SLIDE 39

Learning Communities through Tensor Methods

Business User Reviews

Yelp n ∼ 40k

Author Coauthor

DBLP(sub) n ∼ 1 million(∼ 100k) Error (E) and Recovery ratio (R) Dataset ˆ k Method Running Time E R DBLP sub(k=250) 500

urs

10,157 0.139 89% DBLP sub(k=250) 500 variational 558,723 16.38 99% DBLP(k=6000) 100

urs

5407 0.105 95%

Thanks to Prem Gopalan and David Mimno for providing variational code.

SLIDE 40

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang’s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31

SLIDE 41

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang’s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31

Bridgeness: Distance from vector [1/ˆ k, . . . , 1/ˆ k]⊤

Top-5 bridging nodes (businesses)

Business Categories Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe Pizzeria Bianco Restaurants, Pizza, Phoenix FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

SLIDE 42

Tensor Decomposition on GPUs

Embarrassingly Parallel and fast!

10

2

10

3

10

−1

10 10

1

10

2

10

3

10

4

Number of communities k Running time(secs)

MATLAB Tensor Toolbox(CPU) CULA Standard Interface(GPU) CULA Device Interface(GPU) Eigen Sparse(CPU)

SLIDE 43

Tensor Methods on the Cloud

Communication and System Architecture Overhead

Map-Reduce Framework

k tensor slices (∈ Rk2) in HDFS

A B C Disk Disk Disk Disk Disk Disk Read Read Read Write Write Write Container Container Container Allocation Allocation Allocation ALS ALS ALS mode a mode b mode c

Overhead: Disk reading, Container Allocation, Intense Key/Value Design

SLIDE 44

Tensor Methods on Cloud

Solution: Retainable Evaluator Execution Framework (REEF)

A B C Disk Read Container Allocation ALS ALS ALS mode a mode b mode c

Open source distributed system One time container allocation keep the tensor in memory www.reef-project.org

SLIDE 45

Initial Results from Cloud Implementation

New York Times Corpus

Stochastic Variational Inference (SVI) Tensor Decomposition Perplexity 4000 3400 SVI 1 node Map Red 1 node REEF 4 node REEF

verall

2 hours 4 hours 31 mins 68 mins 36 mins Whiten 16 mins 16 mins 16 mins Matricize 15 mins 15 mins 4 mins ALS 4 hours 37 mins 16 mins

SLIDE 46

Outline

1

Introduction

2

Spectral Methods

3

Moment Tensors of Latent Variable Models

4

Experiments

5

Conclusion

SLIDE 47