Guaranteed Learning of Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation

guaranteed learning of latent variable models through
SMART_READER_LITE
LIVE PREVIEW

Guaranteed Learning of Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1 / 75 Tutorial Topic Learning algorithms for latent variable models based on


slide-1
SLIDE 1

Guaranteed Learning of Latent Variable Models through Tensor Methods

Furong Huang

University of Maryland

furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018

1 / 75

slide-2
SLIDE 2

Tutorial Topic

Learning algorithms for latent variable models based on decompositions of moment tensors.

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Latent variable model T ensor Decomposition Inference

= + +

“Method-of-moments” (Pearson, 1894)

2 / 75

slide-3
SLIDE 3

Tutorial Topic

Learning algorithms (parameter estimation) for latent variable models based on decompositions of moment tensors.

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Latent variable model T ensor Decomposition Inference

= + +

“Method-of-moments” (Pearson, 1894)

2 / 75

slide-4
SLIDE 4

Application 1: Clustering

Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group.

3 / 75

slide-5
SLIDE 5

Application 1: Clustering

Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group.

Probabilistic/latent variable viewpoint

The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures).

3 / 75

slide-6
SLIDE 6

Application 2: Topic Modeling

Document modeling

Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization.

4 / 75

slide-7
SLIDE 7

Application 3: Understanding Human Communities

Social Networks

Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of social actors.

5 / 75

slide-8
SLIDE 8

Application 4: Recommender Systems

Recommender System

Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products.

6 / 75

slide-9
SLIDE 9

Application 5: Feature Learning

Feature Engineering

Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures.

7 / 75

slide-10
SLIDE 10

Application 6: Computational Biology

Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups

8 / 75

slide-11
SLIDE 11

Application 7: Human Disease Hierarchy Discovery

CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases.

” Scalable Latent TreeModel and its Application to Health Analytics ” by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun,

  • A. Anandkumar, NIPS 2015 MLHC workshop.

9 / 75

slide-12
SLIDE 12

How to model hidden effects?

Basic Approach: mixtures/clusters

Hidden variable h is categorical.

Advanced: Probabilistic models

Hidden variable h has more general distributions. Can model mixed memberships. x1 x2 x3 x4 x5 h1 h2 h3 This talk: basic mixture model and some advanced models.

10 / 75

slide-13
SLIDE 13

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Latent variable model Learning Algorithm Inference 11 / 75

slide-14
SLIDE 14

Challenges in Learning – find hidden structure in data

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Latent variable model Learning Algorithm Inference

Challenge: Conditions for Identifiability

Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability?

11 / 75

slide-15
SLIDE 15

Challenges in Learning – find hidden structure in data

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Latent Variable model MCMC Inference

Challenge: Conditions for Identifiability

Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability?

Challenge: Efficient Learning of Latent Variable Models

MCMC: random sampling, slow

◮ Exponential mixing time 11 / 75

slide-16
SLIDE 16

Challenges in Learning – find hidden structure in data

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Latent variable model

▲ ✁el ✐✂ ✂ ✄ ☎ ✆ t ✐✂ ✄✝

Inference

Challenge: Conditions for Identifiability

Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability?

Challenge: Efficient Learning of Latent Variable Models

MCMC: random sampling, slow

◮ Exponential mixing time

Likelihood: non-convex, not scalable

◮ Exponential critical points 11 / 75

slide-17
SLIDE 17

Challenges in Learning – find hidden structure in data

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Latent variable model

✞ ✟ ✠el ✟ ✡☛ ☛ ☞ ✌ ✍ ✎ ✡☛ ☞✏

Inference

Challenge: Conditions for Identifiability

Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability?

Challenge: Efficient Learning of Latent Variable Models

MCMC: random sampling, slow

◮ Exponential mixing time

Likelihood: non-convex, not scalable

◮ Exponential critical points

Efficient computational and sample complexities?

11 / 75

slide-18
SLIDE 18

Challenges in Learning – find hidden structure in data

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Latent variable model

ensor Decomposition Inference

= + +

Challenge: Conditions for Identifiability

Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability?

Challenge: Efficient Learning of Latent Variable Models

MCMC: random sampling, slow

◮ Exponential mixing time

Likelihood: non-convex, not scalable

◮ Exponential critical points

Efficient computational and sample complexities? Guaranteed and efficient learning through spectral methods

11 / 75

slide-19
SLIDE 19

What this tutorial will cover

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

12 / 75

slide-20
SLIDE 20

What this tutorial will cover

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

12 / 75

slide-21
SLIDE 21

What this tutorial will cover

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

12 / 75

slide-22
SLIDE 22

What this tutorial will cover

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

◮ Identifiability ◮ Parameter recovery via decomposition of exact moments 12 / 75

slide-23
SLIDE 23

What this tutorial will cover

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

◮ Identifiability ◮ Parameter recovery via decomposition of exact moments 5

Error-tolerant Algorithms for Tensor Decompositions

12 / 75

slide-24
SLIDE 24

What this tutorial will cover

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

◮ Identifiability ◮ Parameter recovery via decomposition of exact moments 5

Error-tolerant Algorithms for Tensor Decompositions

◮ Decomposition for tensors with linearly independent components ◮ Decomposition for tensors with orthogonal components 12 / 75

slide-25
SLIDE 25

What this tutorial will cover

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

◮ Identifiability ◮ Parameter recovery via decomposition of exact moments 5

Error-tolerant Algorithms for Tensor Decompositions

◮ Decomposition for tensors with linearly independent components ◮ Decomposition for tensors with orthogonal components 6

Tensor Decomposition for Neural Network Compression

7

Conclusion

12 / 75

slide-26
SLIDE 26

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

5

Algorithms for Tensor Decompositions

6

Tensor Decomposition for Neural Network Compression

7

Conclusion

13 / 75

slide-27
SLIDE 27

Gaussian Mixture Model

Generative Model

Samples are comprised of K different Gaussians according to Cat(π1, π2, . . . , πK) Each sample is from one of the K Gaussians, N(µh, Σh), ∀h ∈ [K] H ∼ Cat(π1, π2, . . . , πK) X|H=h ∼ N(µh, Σh), ∀h ∈ [K]

14 / 75

slide-28
SLIDE 28

Gaussian Mixture Model

Generative Model

Samples are comprised of K different Gaussians according to Cat(π1, π2, . . . , πK) Each sample is from one of the K Gaussians, N(µh, Σh), ∀h ∈ [K] H ∼ Cat(π1, π2, . . . , πK) X|H=h ∼ N(µh, Σh), ∀h ∈ [K]

Learning Problem

Estimate mean vector µh, covariance matrix Σh, and mixing weight Cat(π1, π2, . . . , πK) of each subpopulation from unlabeled data.

14 / 75

slide-29
SLIDE 29

Maximum Likelihood Estimator (MLE)

Data {xi}n

i=1

Likelihood Prθ(data) iid = n

i=1 Prθ(xi)

Model parameter estimation θmle := argmax

θ∈Θ

log Prθ(data) Latent variable models: some variables are hidden

◮ No “direct” estimators when some variables are hidden ◮ Local optimization via Expectation-Maximization (EM) (Dempster, Laird,

& Rubin, 1977)

15 / 75

slide-30
SLIDE 30

MLE for Gaussian Mixture Models

Given data {xi}n

i=1 and the number of Gaussian components K, the

model parameters to be estimated are θ = {(µh, Σh, πh)}K

h=1.

  • θmle for Gaussian Mixture Models
  • θmle := argmax

θ n

  • i=1

log K

  • h=1

πh det(Σh)1/2 exp

  • −1

2(xi − µh)⊤Σ−1

h (xi − µh)

  • Solving MLE estimator is NP-hard (Dasgupta, 2008; Aloise, Deshpande,

Hansen, & Popat, 2009; Mahajan, Nimbhorkar, & Varadarajan, 2009; Vattani, 2009; Awasthi, Charikar, Krishnaswamy, & Sinop, 2015).

16 / 75

slide-31
SLIDE 31

Consistent Estimator

Definition

Suppose iid samples {xi}n

i=1 are generated by distribution Prθ(xi) where

the model parameters θ ∈ Θ are unknown. An estimator θ is consistent if E θ − θ → 0 as n → ∞

Spherical Gaussian Mixtures Σh = I (as n → ∞)

For K = 2 and πh = 1/2: EM is consistent (Xu, H., & Maleki, 2016;

Daskalakis, Tzamos, & Zampetakis, 2016).

Larger K: easily trapped in local maxima, far from global max (Jin,

Zhang, Balakrishnan, Wainwright, & Jordan, 2016).

Practitioners often use EM with many (random) restarts, but may take a long time to get near the global max.

17 / 75

slide-32
SLIDE 32

Hardness of Parameter Estimation

Exponentially difficult computationally or statistically to learn model parameters, even under the parametric setting.

Cryptographic hardness

E.g., Mossel & Roch, 2006

Information-theoretic hardness

E.g., Moitra & Valiant, 2010 May require 2Ω(K) running time or 2Ω(K) sample size.

18 / 75

slide-33
SLIDE 33

Ways Around the Hardness

Separation conditions.

◮ E.g., assume min

i=j µi−µj2 σ2

i +σ2 j

is sufficiently large.

◮ (Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2002; . . . )

Structural assumptions.

◮ E.g., assume sparsity, separable (anchor words). ◮ (Spielman, Wang & Wright, 2012; Arora, Ge & Moitra, 2012; . . . )

Non-degeneracy conditions.

◮ E.g., assume µ1, . . ., µK span a K-dimensional space.

This tutorial: statistically and computationally efficient learning algorithms for non-degenerate instances via method-of-moments.

19 / 75

slide-34
SLIDE 34

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

5

Algorithms for Tensor Decompositions

6

Tensor Decomposition for Neural Network Compression

7

Conclusion

20 / 75

slide-35
SLIDE 35

Method-of-Moments At A Glance

1

Determine function of model parameters θ estimatable from

  • bservable data:

◮ Moments

Eθ[f(X)]

2

Form estimates of moments using data (iid samples {xi}n

i=1):

◮ Empirical Moments

  • E[f(X)]

3

Solve the approximate equations for parameters θ:

◮ Moment matching

Eθ[f(X)] n→∞ =

  • E[f(X)]

Toy Example

How to estimate Gaussian variable, i.e., (µ,Σ), given iid samples {xi}n

i=1 ∼ N(µ, Σ2)?

21 / 75

slide-36
SLIDE 36

What is a tensor?

Multi-dimensional Array

Tensor - Higher order matrix The number of dimensions is called tensor order.

22 / 75

slide-37
SLIDE 37

Tensor Product

=

[a ⊗ b]i1,i2 ai1 bi2

=

[a ⊗ b ⊗ c]i1,i2,i3 ai1 bi2 ci3 [a ⊗ b]i1,i2 = ai1bi2 Rank-1 matrix [a ⊗ b ⊗ c]i1,i2,i3 = ai1bi2ci3 Rank-1 tensor

23 / 75

slide-38
SLIDE 38

Slices

Horizontal slices Lateral slices Frontal slices

24 / 75

slide-39
SLIDE 39

Fiber

Mode-1 (column) fibers Mode-2 (row) fibers Mode-3 (tube) fibers

25 / 75

slide-40
SLIDE 40

CP decomposition

X =

R

  • h=1

ah ⊗ bh ⊗ ch Rank: Minimum number of rank-1 tensors whose sum generates the tensor.

26 / 75

slide-41
SLIDE 41

Multi-linear Transform

Multi-linear Operation

If T =

R

  • h=1

ah ⊗ bh ⊗ ch, a multi-linear operation using matrices (X, Y , Z) is as follows T (X, Y , Z) :=

K

  • h=1

(X⊤ah) ⊗ (Y ⊤bh) ⊗ (Z⊤ch). Similarly for a multi-linear operation using vectors (x, y, z) T (x, y, z) :=

K

  • h=1

(x⊤ah) ⊗ (y⊤bh) ⊗ (z⊤ch).

27 / 75

slide-42
SLIDE 42

Tensors in Method of Moments

Matrix: Pair-wise relationship Signal or data observed x ∈ Rd Rank 1 matrix: [x ⊗ x]i,j = xixj Aggregated pair-wise relationship M2 = E[x ⊗ x]

=xi

xj [x⊗x]i,j

Tensor: Triple-wise relationship or higher Signal or data observed x ∈ Rd Rank 1 tensor: [x ⊗ x ⊗ x]i,j,k = xixjxk Aggregated triple-wise relationship M3 = E[x ⊗ x ⊗ x] = E[x⊗3]

=

[x⊗x⊗x]i,j,k xi xj xk

28 / 75

slide-43
SLIDE 43

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap

  • 1

1

  • = e1e⊤

1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

29 / 75

slide-44
SLIDE 44

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap

  • 1

1

  • = e1e⊤

1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2

Unique with eigenvalue gap

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

29 / 75

slide-45
SLIDE 45

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap

  • 1

1

  • = e1e⊤

1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2

Unique with eigenvalue gap

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

Tensor Orthogonal Decomposition (Harshman, 1970)

Unique: eigenvalue gap not needed

+ = ≠

29 / 75

slide-46
SLIDE 46

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap

  • 1

1

  • = e1e⊤

1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2

Unique with eigenvalue gap

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

Tensor Orthogonal Decomposition (Harshman, 1970)

Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap

+ =

29 / 75

slide-47
SLIDE 47

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap

  • 1

1

  • = e1e⊤

1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2

Unique with eigenvalue gap

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

Tensor Orthogonal Decomposition (Harshman, 1970)

Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap

+ = ≠

29 / 75

slide-48
SLIDE 48

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

5

Algorithms for Tensor Decompositions

6

Tensor Decomposition for Neural Network Compression

7

Conclusion

30 / 75

slide-49
SLIDE 49

Topic Modeling

General Topic Model (e.g., Latent Dirichlet Allocation)

K topics

◮ each associated with a distribution

  • ver vocab words {ah}K

h=1

Hidden topic proportion w

◮ per document i, w(i) ∈ ∆K−1

Document iid ∼ mixture of topics

Word Count per Document Topic Word Matrix

game season play

Poli

cs Science Sports

game season play

Business

31 / 75

slide-50
SLIDE 50

Topic Modeling

Topic Model for Single-topic Documents

K topics

◮ each associated with a distribution

  • ver vocab words {ah}K

h=1

Hidden topic proportion w

◮ per document i, w(i) ∈ {e1, . . . , eK}

Document iid ∼ ah

Word Count per Document

1.0

Topic Word Matrix

game season play

Polics Science Sports

game season play

Business

31 / 75

slide-51
SLIDE 51

Model Parameters of Topic Model for Single-topic Documents

Estimate Topic Proportion

Polics Science Sports Business

Topic proportion w = [w1, . . . , wK] wh = P[topic of word = h]

Estimate Topic Word Matrix

P

  • l

i

  • c

s Science Sports

game season play

B u s i n e s s

Topic-word matrix A = [a1, . . . , aK] Ajh = P[word = ej|topic = h] Goal: to estimate model parameters {(ah, wh)}K

h=1, given iid samples

  • f n documents (word count {c(i)}n

i=1)

Frequency vector x(i) = c(i)

L , the length of document is L = j c(i) j

32 / 75

slide-52
SLIDE 52

Moment Matching

Nondegenerate model (linearly independent topic-word matrix)

P

  • l

i

  • c

s Science Sports

game season play

B u s i n e s s

Generative process:

◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah

E[x] = K

h=1 P[topic = h]E[x|topic = h]

E[x|topic = h] =

  • j P[word = ej|topic = h]ej

= ah

33 / 75

slide-53
SLIDE 53

Moment Matching

Nondegenerate model (linearly independent topic-word matrix)

P

  • l

i

  • c

s Science Sports

game season play

B u s i n e s s

Generative process:

◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah

E[x] = K

h=1 P[topic = h]E[x|topic = h] = K h=1 whah

E[x|topic = h] =

  • j P[word = ej|topic = h]ej

= ah

33 / 75

slide-54
SLIDE 54

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

P

  • l

i

  • c

s Science Sports

game season play

B u s i n e s s

Generative process:

◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah

E[x] = K

h=1 P[topic = h]E[x|topic = h] = K h=1 whah

E[x|topic = h] =

  • j P[word = ej|topic = h]ej

= ah

M1: Distribution of words ( M1: Occurrence frequency of words)

M1 = E[x] =

  • h

whah;

  • M1 = 1

n

n

  • i=1

x(i)

= + +

campus police witness

c r i m e Sports Educaon

campus police witness campus police witness 33 / 75

slide-55
SLIDE 55

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

P

  • l

i

  • c

s Science Sports

game season play

B u s i n e s s

Generative process:

◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah

E[x] = K

h=1 P[topic = h]E[x|topic = h] = K h=1 whah

E[x|topic = h] =

  • j P[word = ej|topic = h]ej

= ah

M1: Distribution of words ( M1: Occurrence frequency of words)

M1 = E[x] =

  • h

whah;

  • M1 = 1

n

n

  • i=1

x(i)

= + +

campus police witness

c r i m e Sports Educaon

campus police witness campus police witness

No unique decomposition of vectors

33 / 75

slide-56
SLIDE 56

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

P

  • l

i

  • c

s Science Sports

game season play

B u s i n e s s

Generative process:

◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah

E[x] = K

h=1 P[topic = h]E[x|topic = h] = K h=1 whah

E[x|topic = h] =

  • j P[word = ej|topic = h]ej

= ah

M2: Distribution of word pairs ( M2: Co-occurrence of word pairs)

M2 = E[x ⊗ x] =

  • h

whah ⊗ ah;

  • M2 = 1

n

n

  • i=1

x(i) ⊗ x(i)

= + +

campus police witness

c r i m e Sports Educaon

campus police witness campus police witness 33 / 75

slide-57
SLIDE 57

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

P

  • l

i

  • c

s Science Sports

game season play

B u s i n e s s

Generative process:

◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah

E[x] = K

h=1 P[topic = h]E[x|topic = h] = K h=1 whah

E[x|topic = h] =

  • j P[word = ej|topic = h]ej

= ah

M2: Distribution of word pairs ( M2: Co-occurrence of word pairs)

M2 = E[x ⊗ x] =

  • h

whah ⊗ ah;

  • M2 = 1

n

n

  • i=1

x(i) ⊗ x(i)

= + +

campus police witness

c r i m e Sports Educaon

campus police witness campus police witness

Matrix decomposition recovers subspace, not actual model

33 / 75

slide-58
SLIDE 58

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

Find a W

W ⊤ W ⊤ W ⊤

such that M2: Distribution of word pairs ( M2: Co-occurrence of word pairs)

M2 = E[x ⊗ x] =

  • h

whah ⊗ ah;

  • M2 = 1

n

n

  • i=1

x(i) ⊗ x(i)

= + +

campus police witness

c r i m e Sports Educaon

campus police witness campus police witness

Many such W ’s, find one such that vh = W ⊤ah orthogonal

33 / 75

slide-59
SLIDE 59

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

Know a W

W ⊤ W ⊤ W ⊤

such that M3: Distribution of word triples ( M3: Co-occurrence of word triples)

M3 = E[x⊗3] =

  • h

whah⊗3;

  • M3 = 1

n

n

  • i=1

x(i)⊗3

= + +

campus police witness

c r i m e Sports Educaon

campus police witness campus police witness

Orthogonalize the tensor, project data with W : M3(W , W , W )

33 / 75

slide-60
SLIDE 60

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

Know a W

W ⊤ W ⊤ W ⊤

such that M3: Distribution of word triples ( M3: Co-occurrence of word triples)

M3(W , W , W ) = E[(W ⊤x)⊗3] =

  • h

wh(W ⊤ah)⊗3; M3(W , W , W ) = 1 n

n

  • i=1

(W ⊤x(i))⊗3

= + +

W W W

Unique orthogonal tensor decomposition { vh}K

h=1

33 / 75

slide-61
SLIDE 61

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

Know a W

W ⊤ W ⊤ W ⊤

such that M3: Distribution of word triples ( M3: Co-occurrence of word triples)

M3(W , W , W ) = E[(W ⊤x)⊗3] =

  • h

wh(W ⊤ah)⊗3; M3(W , W , W ) = 1 n

n

  • i=1

(W ⊤x(i))⊗3

= + +

W W W

Model parameter estimation: ah = (W ⊤)† vh

33 / 75

slide-62
SLIDE 62

Identifiability: how long must the documents be?

Nondegenerate model (linearly independent topic-word matrix)

Know a W

W ⊤ W ⊤ W ⊤

such that M3: Distribution of word triples ( M3: Co-occurrence of word triples)

M3(W , W , W ) = E[(W ⊤x)⊗3] =

  • h

wh(W ⊤ah)⊗3; M3(W , W , W ) = 1 n

n

  • i=1

(W ⊤x(i))⊗3

= + +

W W W

L ≥ 3: Learning Topic Models through Matrix/Tensor Decomposition

33 / 75

slide-63
SLIDE 63

Take Away Message

Consider topic models satisfying linear independent word distributions under different topics. Parameters of topic model for single-topic documents can be efficiently recovered from distribution of three-word documents.

◮ Distribution of three-word documents (word triples)

M3 = E[x ⊗ x ⊗ x] =

  • h

whah ⊗ ah ⊗ ah

M3: Co-occurrence of word triples

Two-word documents are not sufficient for identifiability.

34 / 75

slide-64
SLIDE 64

Tensor Methods Compared with Variational Inference

Learning Topics from PubMed on Spark: 8 million docs

103 104 105

Perplexity Tensor Variational

2 4 6 8 10 ×104

Running Time (s) 35 / 75

slide-65
SLIDE 65

Tensor Methods Compared with Variational Inference

Learning Topics from PubMed on Spark: 8 million docs

103 104 105

Perplexity Tensor Variational

2 4 6 8 10 ×104

Running Time (s)

Learning Communities from Graph Connectivity

Facebook: n ∼ 20k Yelp: n ∼ 40k DBLPsub: n ∼ 0.1m DBLP: n ∼ 1m

10-2 10-1 100 101

Error /group FB YP DBLPsub DBLP

102 103 104 105 106

Running Times (s) FB YP DBLPsub DBLP 35 / 75

slide-66
SLIDE 66

Tensor Methods Compared with Variational Inference

Learning Topics from PubMed on Spark: 8 million docs

103 104 105

Perplexity Tensor Variational

2 4 6 8 10 ×104

Running Time (s)

Learning Communities from Graph Connectivity

Facebook: n ∼ 20k Yelp: n ∼ 40k DBLPsub: n ∼ 0.1m DBLP: n ∼ 1m

10-2 10-1 100 101

Error /group FB YP DBLPsub DBLP

102 103 104 105 106

Running Times (s) FB YP DBLPsub DBLP

O r d e r s

  • f

M a g n i t u d e F a s t e r & M

  • r

e A c c u r a t e

“Online Tensor Methods for Learning Latent Variable Models”, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. “Tensor Methods on Apache Spark”, by F. Huang, A. Anandkumar, Oct. 2015. 35 / 75

slide-67
SLIDE 67

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

5

Algorithms for Tensor Decompositions

6

Tensor Decomposition for Neural Network Compression

7

Conclusion

36 / 75

slide-68
SLIDE 68

Jennrich’s Algorithm (Simplified)

Task: Given tensor T = K

h=1 µh⊗3 with linearly independent

components {µh}K

h=1, find the components (up to scaling).

+ = ≠

37 / 75

slide-69
SLIDE 69

Jennrich’s Algorithm (Simplified)

Task: Given tensor T = K

h=1 µh⊗3 with linearly independent

components {µh}K

h=1, find the components (up to scaling).

Properties of Tensor Slices

Linear combination of slices T (I, I, c) =

h < µh, c > µh ⊗ µh

+ = ≠

37 / 75

slide-70
SLIDE 70

Jennrich’s Algorithm (Simplified)

Task: Given tensor T = K

h=1 µh⊗3 with linearly independent

components {µh}K

h=1, find the components (up to scaling).

Properties of Tensor Slices

Linear combination of slices T (I, I, c) =

h < µh, c > µh ⊗ µh

+ = ≠

Intuitions for Jennrich’s Algorithm

37 / 75

slide-71
SLIDE 71

Jennrich’s Algorithm (Simplified)

Task: Given tensor T = K

h=1 µh⊗3 with linearly independent

components {µh}K

h=1, find the components (up to scaling).

Properties of Tensor Slices

Linear combination of slices T (I, I, c) =

h < µh, c > µh ⊗ µh

+ = ≠

Intuitions for Jennrich’s Algorithm

Linear comb. of slices of a tensor share the same set of eigenvectors

37 / 75

slide-72
SLIDE 72

Jennrich’s Algorithm (Simplified)

Task: Given tensor T = K

h=1 µh⊗3 with linearly independent

components {µh}K

h=1, find the components (up to scaling).

Properties of Tensor Slices

Linear combination of slices T (I, I, c) =

h < µh, c > µh ⊗ µh

+ = ≠

Intuitions for Jennrich’s Algorithm

Linear comb. of slices of a tensor share the same set of eigenvectors The shared eigenvectors are tensor components {µh}K

h=1

37 / 75

slide-73
SLIDE 73

Jennrich’s Algorithm (Simplified)

Task: Given tensor T = K

h=1 µh⊗3 with linearly independent

components {µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor T ∈ Rd×d×d Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)T (I, I, c′)†

38 / 75

slide-74
SLIDE 74

Jennrich’s Algorithm (Simplified)

Task: Given tensor T = K

h=1 µh⊗3 with linearly independent

components {µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor T ∈ Rd×d×d Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)T (I, I, c′)†

Consistency of Jennrich’s Algorithm?

Estimators { µh}K

h=1 ≡ unknown components {µh}K h=1 (up to scaling)?

38 / 75

slide-75
SLIDE 75

Analysis of Consistency of Jennrich’s algorithm

Recall: Linear comb. of slices share eigenvectors {µh}K

h=1,

i.e., T (I, I, c)T (I, I, c′)† a.s. = UDcU ⊤(U ⊤)†D−1

c′ U † a.s.

= U(DcD−1

c′ )U †,

where U = [µ1|. . . |µK] are the linearly independent tensor components and Dc = Diag

  • < µ1, c >, . . . , < µK, c >
  • is diagonal.

39 / 75

slide-76
SLIDE 76

Analysis of Consistency of Jennrich’s algorithm

Recall: Linear comb. of slices share eigenvectors {µh}K

h=1,

i.e., T (I, I, c)T (I, I, c′)† a.s. = UDcU ⊤(U ⊤)†D−1

c′ U † a.s.

= U(DcD−1

c′ )U †,

where U = [µ1|. . . |µK] are the linearly independent tensor components and Dc = Diag

  • < µ1, c >, . . . , < µK, c >
  • is diagonal.

By linear independence of {µi}K

i=1 and random choice of c and c′:

1

U has rank K;

39 / 75

slide-77
SLIDE 77

Analysis of Consistency of Jennrich’s algorithm

Recall: Linear comb. of slices share eigenvectors {µh}K

h=1,

i.e., T (I, I, c)T (I, I, c′)† a.s. = UDcU ⊤(U ⊤)†D−1

c′ U † a.s.

= U(DcD−1

c′ )U †,

where U = [µ1|. . . |µK] are the linearly independent tensor components and Dc = Diag

  • < µ1, c >, . . . , < µK, c >
  • is diagonal.

By linear independence of {µi}K

i=1 and random choice of c and c′:

1

U has rank K;

2

Dc and Dc′ are invertible (a.s.);

39 / 75

slide-78
SLIDE 78

Analysis of Consistency of Jennrich’s algorithm

Recall: Linear comb. of slices share eigenvectors {µh}K

h=1,

i.e., T (I, I, c)T (I, I, c′)† a.s. = UDcU ⊤(U ⊤)†D−1

c′ U † a.s.

= U(DcD−1

c′ )U †,

where U = [µ1|. . . |µK] are the linearly independent tensor components and Dc = Diag

  • < µ1, c >, . . . , < µK, c >
  • is diagonal.

By linear independence of {µi}K

i=1 and random choice of c and c′:

1

U has rank K;

2

Dc and Dc′ are invertible (a.s.);

3

Diagonal entries of DcD−1

c′

are distinct (a.s.);

39 / 75

slide-79
SLIDE 79

Analysis of Consistency of Jennrich’s algorithm

Recall: Linear comb. of slices share eigenvectors {µh}K

h=1,

i.e., T (I, I, c)T (I, I, c′)† a.s. = UDcU ⊤(U ⊤)†D−1

c′ U † a.s.

= U(DcD−1

c′ )U †,

where U = [µ1|. . . |µK] are the linearly independent tensor components and Dc = Diag

  • < µ1, c >, . . . , < µK, c >
  • is diagonal.

By linear independence of {µi}K

i=1 and random choice of c and c′:

1

U has rank K;

2

Dc and Dc′ are invertible (a.s.);

3

Diagonal entries of DcD−1

c′

are distinct (a.s.); So {µi}K

i=1 are the eigenvectors of T (I, I, c)T (I, I, c)† with distinct

non-zero eigenvalues. Jennrich’s algorithm is consistent

39 / 75

slide-80
SLIDE 80

Error-tolerant algorithms for tensor decompositions

40 / 75

slide-81
SLIDE 81

Moment Estimator: Empirical Moments

41 / 75

slide-82
SLIDE 82

Moment Estimator: Empirical Moments

Moments Eθ[f(X)] are functions of model parameters θ Empirical Moments E[f(X)] are computed using iid samples {xi}n

i=1

  • nly

41 / 75

slide-83
SLIDE 83

Moment Estimator: Empirical Moments

Moments Eθ[f(X)] are functions of model parameters θ Empirical Moments E[f(X)] are computed using iid samples {xi}n

i=1

  • nly

Example

Third Order Moment: distribution of word triples

◮ E[x ⊗ x ⊗ x] =

h whah ⊗ ah ⊗ ah

Empirical Third Order Moment: co-occurrence frequency of word triples

E[x ⊗ x ⊗ x] = 1

n n

  • i=1

xi ⊗ xi ⊗ xi

41 / 75

slide-84
SLIDE 84

Moment Estimator: Empirical Moments

Moments Eθ[f(X)] are functions of model parameters θ Empirical Moments E[f(X)] are computed using iid samples {xi}n

i=1

  • nly

Example

Third Order Moment: distribution of word triples

◮ E[x ⊗ x ⊗ x] =

h whah ⊗ ah ⊗ ah

Empirical Third Order Moment: co-occurrence frequency of word triples

E[x ⊗ x ⊗ x] = 1

n n

  • i=1

xi ⊗ xi ⊗ xi

Inevitably expect error of order n− 1

2 in some norm, e.g., ◮ Operator norm: E[x ⊗ x ⊗ x] −

E[x ⊗ x ⊗ x] n− 1

2

◮ where T :=

sup

x,y,z∈Sd−1 T (x, y, z)

◮ Frobenius norm: E[x ⊗ x ⊗ x] −

E[x ⊗ x ⊗ x]F n− 1

2

◮ where T F :=

i,j,k

T 2

i,j,k

41 / 75

slide-85
SLIDE 85

Stability of Jennrich’s Algorithm

Recall Jennrich’s algorithm

Given tensor T = K

h=1 µh⊗3 with linearly independent components

{µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor T ∈ Rd×d×d Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)T (I, I, c′)†

42 / 75

slide-86
SLIDE 86

Stability of Jennrich’s Algorithm

Recall Jennrich’s algorithm

Given tensor T = K

h=1 µh⊗3 with linearly independent components

{µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor T ∈ Rd×d×d Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)T (I, I, c′)†

Challenge: Only have access to T such that T − T n− 1

2 42 / 75

slide-87
SLIDE 87

Stability of Jennrich’s Algorithm

Recall Jennrich’s algorithm

Given tensor T = K

h=1 µh⊗3 with linearly independent components

{µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor

  • T ∈ Rd×d×d

Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1 ?

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)

T (I, I, c′)† Challenge: Only have access to T such that T − T n− 1

2 42 / 75

slide-88
SLIDE 88

Stability of Jennrich’s Algorithm

Recall Jennrich’s algorithm

Given tensor T = K

h=1 µh⊗3 with linearly independent components

{µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor

  • T ∈ Rd×d×d

Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1 ?

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)

T (I, I, c′)†

Stability of eigenvectors requires eigenvalue gaps

42 / 75

slide-89
SLIDE 89

Stability of Jennrich’s Algorithm

Recall Jennrich’s algorithm

Given tensor T = K

h=1 µh⊗3 with linearly independent components

{µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor

  • T ∈ Rd×d×d

Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1 ?

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)

T (I, I, c′)†

Stability of eigenvectors requires eigenvalue gaps

To ensure eigenvalue gaps for T (·, ·, c) T (·, ·, c)†, T (·, ·, c) T (·, ·, c)† − T (·, ·, c)T (·, ·, c)† ≪ ∆ is needed.

42 / 75

slide-90
SLIDE 90

Stability of Jennrich’s Algorithm

Recall Jennrich’s algorithm

Given tensor T = K

h=1 µh⊗3 with linearly independent components

{µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor

  • T ∈ Rd×d×d

Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1 ?

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)

T (I, I, c′)†

Stability of eigenvectors requires eigenvalue gaps

To ensure eigenvalue gaps for T (·, ·, c) T (·, ·, c)†, T (·, ·, c) T (·, ·, c)† − T (·, ·, c)T (·, ·, c)† ≪ ∆ is needed. Ultimately, T − T F ≪

1 poly d is required.

42 / 75

slide-91
SLIDE 91

Stability of Jennrich’s Algorithm

Recall Jennrich’s algorithm

Given tensor T = K

h=1 µh⊗3 with linearly independent components

{µh}K

h=1, find the components (up to scaling).

Algorithm Jennrich’s Algorithm Require: Tensor

  • T ∈ Rd×d×d

Ensure: Components { µh}K

h=1 a.s.

= {µh}K

h=1 ?

1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {

µh}K

h=1 ← eigenvectors of

  • T (I, I, c)

T (I, I, c′)†

Stability of eigenvectors requires eigenvalue gaps

To ensure eigenvalue gaps for T (·, ·, c) T (·, ·, c)†, T (·, ·, c) T (·, ·, c)† − T (·, ·, c)T (·, ·, c)† ≪ ∆ is needed. Ultimately, T − T F ≪

1 poly d is required. A different approach?

42 / 75

slide-92
SLIDE 92

Initial Ideas

In many applications, we estimate moments of the form M3 =

K

  • h=1

whah⊗3, where {ah}K

h=1 are assumed to be linearly independent.

What if {ah}K

h=1 has orthonormal columns?

43 / 75

slide-93
SLIDE 93

Initial Ideas

In many applications, we estimate moments of the form M3 =

K

  • h=1

whah⊗3, where {ah}K

h=1 are assumed to be linearly independent.

What if {ah}K

h=1 has orthonormal columns?

M3(I, ai, ai) =

h whah, ai2ah= wiai, ∀i.

43 / 75

slide-94
SLIDE 94

Initial Ideas

In many applications, we estimate moments of the form M3 =

K

  • h=1

whah⊗3, where {ah}K

h=1 are assumed to be linearly independent.

What if {ah}K

h=1 has orthonormal columns?

M3(I, ai, ai) =

h whah, ai2ah= wiai, ∀i.

Analogous to matrix eigenvectors: Mv = M(I, v) = λv.

43 / 75

slide-95
SLIDE 95

Initial Ideas

In many applications, we estimate moments of the form M3 =

K

  • h=1

whah⊗3, where {ah}K

h=1 are assumed to be linearly independent.

What if {ah}K

h=1 has orthonormal columns?

M3(I, ai, ai) =

h whah, ai2ah= wiai, ∀i.

Analogous to matrix eigenvectors: Mv = M(I, v) = λv. Define orthonormal {ah}K

h=1 as eigenvectors of tensor M3.

43 / 75

slide-96
SLIDE 96

Initial Ideas

In many applications, we estimate moments of the form M3 =

K

  • h=1

whah⊗3, where {ah}K

h=1 are assumed to be linearly independent.

What if {ah}K

h=1 has orthonormal columns?

M3(I, ai, ai) =

h whah, ai2ah= wiai, ∀i.

Analogous to matrix eigenvectors: Mv = M(I, v) = λv. Define orthonormal {ah}K

h=1 as eigenvectors of tensor M3.

Two Problems

{ah}K

h=1 is not orthogonal in general.

How to find eigenvectors of a tensor?

43 / 75

slide-97
SLIDE 97

Initial Ideas

In many applications, we estimate moments of the form M3 =

K

  • h=1

whah⊗3, where {ah}K

h=1 are assumed to be linearly independent.

What if {ah}K

h=1 has orthonormal columns?

M3(I, ai, ai) =

h whah, ai2ah= wiai, ∀i.

Analogous to matrix eigenvectors: Mv = M(I, v) = λv. Define orthonormal {ah}K

h=1 as eigenvectors of tensor M3.

Two Problems

{ah}K

h=1 is not orthogonal in general.

How to find eigenvectors of a tensor?

43 / 75

slide-98
SLIDE 98

Whitening is the process of finding a whitening matrix W such that multi-linear operation (using W ) on M3 orthogonalize its components: M3(W , W , W ) =

  • h

wh(W ⊤ah)⊗3 =

  • h

whvh⊗3, vh ⊥ ⊥ vh′, ∀ h = h′

44 / 75

slide-99
SLIDE 99

Whitening

Given M3 =

  • h

whah⊗3, M2 =

  • h

whah ⊗ ah,

45 / 75

slide-100
SLIDE 100

Whitening

Given M3 =

  • h

whah⊗3, M2 =

  • h

whah ⊗ ah, Find whitening matrix W s.t. W ⊤ah = vh are orthogonal.

45 / 75

slide-101
SLIDE 101

Whitening

Given M3 =

  • h

whah⊗3, M2 =

  • h

whah ⊗ ah, Find whitening matrix W s.t. W ⊤ah = vh are orthogonal. When {ah}K

h=1 ∈ Rd×K has full column rank, it is an invertible

transformation. v1 v2 v3 W a1 a2 a3 W

45 / 75

slide-102
SLIDE 102

Using Whitening to Obtain Orthogonal Tensor

+ +

+ +

Tensor M3 Tensor T

46 / 75

slide-103
SLIDE 103

Using Whitening to Obtain Orthogonal Tensor

+ +

+ +

Tensor M3 Tensor T

Multi-linear transform

T = M3(W , W , W ) =

h wh(W ⊤ah)⊗3.

46 / 75

slide-104
SLIDE 104

Using Whitening to Obtain Orthogonal Tensor

+ +

+ +

Tensor M3 Tensor T

Multi-linear transform

T = M3(W , W , W ) =

h wh(W ⊤ah)⊗3.

T =

h∈[K]

wh · vh⊗3 has orthogonal components.

46 / 75

slide-105
SLIDE 105

Using Whitening to Obtain Orthogonal Tensor

+ +

+ +

Tensor M3 Tensor T

Multi-linear transform

T = M3(W , W , W ) =

h wh(W ⊤ah)⊗3.

T =

h∈[K]

wh · vh⊗3 has orthogonal components. Dimensionality reduction when K ≪ d, as M3 ∈ Rd×d×d and T ∈ RK×K×K.

46 / 75

slide-106
SLIDE 106

How to Find Whitening Matrix?

Given M3 =

  • h

whah⊗3, M2 =

  • h

whah ⊗ ah, Goal: W such that v1 v2 v3 W a1 a2 a3

47 / 75

slide-107
SLIDE 107

How to Find Whitening Matrix?

Given M3 =

  • h

whah⊗3, M2 =

  • h

whah ⊗ ah, Goal: W such that v1 v2 v3 W a1 a2 a3 Use pairwise moments M2 to find W s.t. W ⊤M2W = I.

47 / 75

slide-108
SLIDE 108

How to Find Whitening Matrix?

Given M3 =

  • h

whah⊗3, M2 =

  • h

whah ⊗ ah, Goal: W such that v1 v2 v3 W a1 a2 a3 Use pairwise moments M2 to find W s.t. W ⊤M2W = I. W = UDiag(˜ λ−1/2), where Eigen-decomposition M2 = UDiag(˜ λ)U ⊤.

47 / 75

slide-109
SLIDE 109

How to Find Whitening Matrix?

Given M3 =

  • h

whah⊗3, M2 =

  • h

whah ⊗ ah, Goal: W such that v1 v2 v3 W a1 a2 a3 Use pairwise moments M2 to find W s.t. W ⊤M2W = I. W = UDiag(˜ λ−1/2), where Eigen-decomposition M2 = UDiag(˜ λ)U ⊤. V := W ⊤ADiag(w)1/2 is an orthogonal matrix. T = M3(W , W , W ) =

  • h

w−1/2

h

(W ⊤ah √wh)⊗3 =

  • h

λhvh⊗3, λh := w−1/2

h

. T is an orthogonal tensor.

47 / 75

slide-110
SLIDE 110

Initial Ideas

In many applications, we estimate moments of the form M3 =

  • h

whah⊗3, where {ah}K

h=1 are assumed to be linearly independent.

What if {ah}K

h=1 has orthonormal columns?

M3(I, ai, ai) =

h whah, ai2ah= wiai, ∀i.

Analogous to matrix eigenvectors: Mv = M(I, v) = λv. Define orthonormal {ah}K

h=1 as eigenvectors of tensor M3.

Two Problems

{ah}K

h=1 is not orthogonal in general.

How to find eigenvectors of a tensor?

48 / 75

slide-111
SLIDE 111

Review: Orthogonal Matrix Eigen Decomposition

Task: Given matrix M = K

h=1 λhvh ⊗ vh with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors.

49 / 75

slide-112
SLIDE 112

Review: Orthogonal Matrix Eigen Decomposition

Task: Given matrix M = K

h=1 λhvh ⊗ vh with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors.

Properties of Matrix Eigenvectors

Fixed point: linear transform M(I, vi) =

h λhvi, vhvh = λivi

49 / 75

slide-113
SLIDE 113

Review: Orthogonal Matrix Eigen Decomposition

Task: Given matrix M = K

h=1 λhvh ⊗ vh with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors.

Properties of Matrix Eigenvectors

Fixed point: linear transform M(I, vi) =

h λhvi, vhvh = λivi

Intuitions for Matrix Power Method

49 / 75

slide-114
SLIDE 114

Review: Orthogonal Matrix Eigen Decomposition

Task: Given matrix M = K

h=1 λhvh ⊗ vh with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors.

Properties of Matrix Eigenvectors

Fixed point: linear transform M(I, vi) =

h λhvi, vhvh = λivi

Intuitions for Matrix Power Method

Linear transform on eigenvectors {vh}K

h=1 preserve direction

49 / 75

slide-115
SLIDE 115

Orthogonal Tensor Eigen Decomposition

Task: Given tensor T = K

h=1 λhvh⊗3 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors.

+ = ≠

50 / 75

slide-116
SLIDE 116

Orthogonal Tensor Eigen Decomposition

Task: Given tensor T = K

h=1 λhvh⊗3 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors.

Properties of Tensor Eigenvectors

Fixed point: bi-linear transform T (I, vi, vi) =

h λhvi, vh2vh = λivi

+ = ≠

50 / 75

slide-117
SLIDE 117

Orthogonal Tensor Eigen Decomposition

Task: Given tensor T = K

h=1 λhvh⊗3 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors.

Properties of Tensor Eigenvectors

Fixed point: bi-linear transform T (I, vi, vi) =

h λhvi, vh2vh = λivi

+ = ≠

Intuitions for Tensor Power Method

50 / 75

slide-118
SLIDE 118

Orthogonal Tensor Eigen Decomposition

Task: Given tensor T = K

h=1 λhvh⊗3 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors.

Properties of Tensor Eigenvectors

Fixed point: bi-linear transform T (I, vi, vi) =

h λhvi, vh2vh = λivi

+ = ≠

Intuitions for Tensor Power Method

Bilinear transform on eigenvectors {vh}K

h=1 preserve direction

50 / 75

slide-119
SLIDE 119

Orthogonal Matrix Eigen Decomposition

Task: Given matrix M = K

h=1 λhvh⊗2 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M ∈ RK×K Ensure: Components { vh}K

h=1 w.h.p.

= {vh}K

h=1

1: for h = 1 : K do 2:

Sample u0 uniformly at random from SK−1

3:

for i = 1 : T do

4:

ui ←

M(I,ui−1) M(I,ui−1)

5:

end for

6:

  • vh ← uT ,

λh ← M( vh, vh)

7:

Deflate M ← M − λh vh⊗2

8: end for

51 / 75

slide-120
SLIDE 120

Orthogonal Matrix Eigen Decomposition

Task: Given matrix M = K

h=1 λhvh⊗2 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M ∈ RK×K Ensure: Components { vh}K

h=1 w.h.p.

= {vh}K

h=1

1: for h = 1 : K do 2:

Sample u0 uniformly at random from SK−1

3:

for i = 1 : T do

4:

ui ←

M(I,ui−1) M(I,ui−1)

5:

end for

6:

  • vh ← uT ,

λh ← M( vh, vh)

7:

Deflate M ← M − λh vh⊗2

8: end for

Consistency of Matrix Power Method?

Is there convergence? { vh}K

h=1 ≡ {vh}K h=1 w.h.p.?

51 / 75

slide-121
SLIDE 121

Orthogonal Matrix Eigen Decomposition

Task: Given matrix M = K

h=1 λhvh⊗2 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M ∈ RK×K Ensure: Components { vh}K

h=1 w.h.p.

= {vh}K

h=1

1: for h = 1 : K do 2:

Sample u0 uniformly at random from SK−1

3:

for i = 1 : T do

4:

ui ←

M(I,ui−1) M(I,ui−1)

5:

end for

6:

  • vh ← uT ,

λh ← M( vh, vh)

7:

Deflate M ← M − λh vh⊗2

8: end for

Consistency of Matrix Power Method?

Is there convergence? { vh}K

h=1 ≡ {vh}K h=1 w.h.p.?

Does the convergence depend on initialization?

51 / 75

slide-122
SLIDE 122

Orthogonal Tensor Eigen Decomposition

Task: Given tensor T = K

h=1 λhvh⊗3 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T ∈ RK×K×K Ensure: Components { vh}K

h=1 w.h.p.

= {vh}K

h=1

1: for h = 1 : K do 2:

Sample u0 uniformly at random from SK−1

3:

for i = 1 : T do

4:

ui ←

T (I,ui−1,ui−1) T (I,ui−1,ui−1)

5:

end for

6:

  • vh ← uT ,

λh ← T ( vh, vh, vh)

7:

Deflate T ← T − λh vh⊗3

8: end for

52 / 75

slide-123
SLIDE 123

Orthogonal Tensor Eigen Decomposition

Task: Given tensor T = K

h=1 λhvh⊗3 with orthonormal components

{vh}K

h=1 (vh ⊥

⊥ vh′, ∀ h = h′), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T ∈ RK×K×K Ensure: Components { vh}K

h=1 w.h.p.

= {vh}K

h=1

1: for h = 1 : K do 2:

Sample u0 uniformly at random from SK−1

3:

for i = 1 : T do

4:

ui ←

T (I,ui−1,ui−1) T (I,ui−1,ui−1)

5:

end for

6:

  • vh ← uT ,

λh ← T ( vh, vh, vh)

7:

Deflate T ← T − λh vh⊗3

8: end for

Consistency of Tensor Power Method?

Is there convergence? { vh}K

h=1 ≡ {vh}K h=1 w.h.p.?

Does the convergence depend on initialization?

52 / 75

slide-124
SLIDE 124

Analysis of Consistency of Matrix Power Method

Order eigenvectors {vh}K

h=1 such that corresponding eigenvalues

satisfy λ1 ≥ λ2 . . . ≥ λK. Project initial point u0 onto eigenvectors {vh}K

h=1

ch = u0, vh, ∀h

Convergence properties

Unique (identifiable) i.f.f. {λh}K

h=1 are distinct.

If gap λ2

λ1 < 1 and c1 = 0, matrix power method converges to v1.

Converges linearly to v1 assuming gap λ2/λ1 < 1.

◮ Linear transform permits M(I, u0) =

h λh

  • v⊤

h u0

  • vh =

h λhchvh,

i.e., projection in vh direction is scaled by λh.

◮ In t iterations,

  • v⊤

1 v

2

  • i
  • v⊤

i v

2 ≥ 1 − K

  • λ2

λ1

2t .

53 / 75

slide-125
SLIDE 125

Analysis of Consistency of Tensor Power Method

Project initial point u0 onto eigenvectors ch = u0, vh, ∀h. Order eigenvectors {vh}K

h=1 such that

λ1|c1| > λ2|c2| ≥ · · · ≥ λK|cK|.

Convergence properties

Identifiable i.f.f. {λh|ch|}K

h=1 are distinct. Initialization dependent.

If λ2|c2|

λ1|c1| < 1 and λ1|c1| = 0, tensor power method converges to v1.

Note v1 is NOT necessarily the largest eigenvector. Converges quadraticly to v1 assuming gap λ2|c2|

λ1|c1| < 1.

◮ Bi-linear transform permits T (I, u0, u0) = h λh

v⊤

h u0

2vh =

h λhc2 h vh

i.e., projection in vh direction is squared then scaled by λh.

◮ In t iterations,

  • v⊤

1 v2

  • i
  • v⊤

i v2 ≥ 1 − k

  • λ1

maxi=1 λi

2

  • v2c2

v1c1

  • 2t+1

.

54 / 75

slide-126
SLIDE 126

Matrix vs. tensor power iteration

Matrix power iteration: Tensor power iteration:

55 / 75

slide-127
SLIDE 127

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration:

1

Requires gap between largest and second-largest λh|ch|. Property of the tensor and initialization u0.

55 / 75

slide-128
SLIDE 128

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only.

2

Converges to top eigenvector. Tensor power iteration:

1

Requires gap between largest and second-largest λh|ch|. Property of the tensor and initialization u0.

2

Converges to vi which is the largest vh|ch|. Not necessarily the largest eigenvector.

55 / 75

slide-129
SLIDE 129

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only.

2

Converges to top eigenvector.

3

Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration:

1

Requires gap between largest and second-largest λh|ch|. Property of the tensor and initialization u0.

2

Converges to vi which is the largest vh|ch|. Not necessarily the largest eigenvector.

3

Quadratic convergence. Need O(log log(1/ǫ)) iterations.

55 / 75

slide-130
SLIDE 130

Spurious Eigenvectors for Tensor Eigen Decomposition

T =

  • h∈[K]

λhvh ⊗3 .

Characterization of eigenvectors: T (I, v, v) = λv?

{vh}K

h=1 are eigenvectors as T (I, vh, vh) = λhvh.

56 / 75

slide-131
SLIDE 131

Spurious Eigenvectors for Tensor Eigen Decomposition

T =

  • h∈[K]

λhvh ⊗3 .

Characterization of eigenvectors: T (I, v, v) = λv?

{vh}K

h=1 are eigenvectors as T (I, vh, vh) = λhvh.

Bad news: There can be other eigenvectors (unlike matrix case).

◮ E.g., when {λh}K

h=1 ≡ 1

v = v1 + v2 √ 2 satisfies T (I, v, v) = 1 √ 2v.

56 / 75

slide-132
SLIDE 132

Spurious Eigenvectors for Tensor Eigen Decomposition

T =

  • h∈[K]

λhvh ⊗3 .

Characterization of eigenvectors: T (I, v, v) = λv?

{vh}K

h=1 are eigenvectors as T (I, vh, vh) = λhvh.

Bad news: There can be other eigenvectors (unlike matrix case).

◮ E.g., when {λh}K

h=1 ≡ 1

v = v1 + v2 √ 2 satisfies T (I, v, v) = 1 √ 2v. How do we avoid spurious solutions (not components {vh}K

h=1)?

56 / 75

slide-133
SLIDE 133

Spurious Eigenvectors for Tensor Eigen Decomposition

T =

  • h∈[K]

λhvh ⊗3 .

Characterization of eigenvectors: T (I, v, v) = λv?

{vh}K

h=1 are eigenvectors as T (I, vh, vh) = λhvh.

Bad news: There can be other eigenvectors (unlike matrix case).

◮ E.g., when {λh}K

h=1 ≡ 1

v = v1 + v2 √ 2 satisfies T (I, v, v) = 1 √ 2v. How do we avoid spurious solutions (not components {vh}K

h=1)?

Optimization viewpoint of tensor Eigen decomposition will help.

56 / 75

slide-134
SLIDE 134

Spurious Eigenvectors for Tensor Eigen Decomposition

T =

  • h∈[K]

λhvh ⊗3 .

Characterization of eigenvectors: T (I, v, v) = λv?

{vh}K

h=1 are eigenvectors as T (I, vh, vh) = λhvh.

Bad news: There can be other eigenvectors (unlike matrix case).

◮ E.g., when {λh}K

h=1 ≡ 1

v = v1 + v2 √ 2 satisfies T (I, v, v) = 1 √ 2v. How do we avoid spurious solutions (not components {vh}K

h=1)?

Optimization viewpoint of tensor Eigen decomposition will help. All spurious eigenvectors are saddle points.

56 / 75

slide-135
SLIDE 135

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition

57 / 75

slide-136
SLIDE 136

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition

Optimization Problem Matrix: max

v

M(v, v) s.t. v = 1.

Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).

Tensor: max

v

T(v, v, v) s.t. v = 1.

Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).

58 / 75

slide-137
SLIDE 137

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition

Optimization Problem Matrix: max

v

M(v, v) s.t. v = 1.

Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).

Tensor: max

v

T(v, v, v) s.t. v = 1.

Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).

Non-convex: stationary points = {global optima, local optima, saddle point}

58 / 75

slide-138
SLIDE 138

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition

Optimization Problem Matrix: max

v

M(v, v) s.t. v = 1.

Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).

Tensor: max

v

T(v, v, v) s.t. v = 1.

Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).

Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative ∇L(v, λ) = 0

∇L(v, λ) = 2(M(I, v) − λv) = 0

Eigenvectors are stationary points. Power method v ←

M(I,v) M(I,v) is a version

  • f gradient ascent.

∇L(v, λ) = 3(T (I, v, v) − λv) = 0

Eigenvectors are stationary points. Power method v ←

T (I,v,v) T (I,v,v) is a

version of gradient ascent.

58 / 75

slide-139
SLIDE 139

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition

Optimization Problem Matrix: max

v

M(v, v) s.t. v = 1.

Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).

Tensor: max

v

T(v, v, v) s.t. v = 1.

Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).

Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative ∇L(v, λ) = 0

∇L(v, λ) = 2(M(I, v) − λv) = 0

Eigenvectors are stationary points. Power method v ←

M(I,v) M(I,v) is a version

  • f gradient ascent.

∇L(v, λ) = 3(T (I, v, v) − λv) = 0

Eigenvectors are stationary points. Power method v ←

T (I,v,v) T (I,v,v) is a

version of gradient ascent.

Local Optima: w⊤∇2L(v, λ)w < 0 for all w ⊥ v, at a stationary point v

v1 is the only local optimum. All other eigenvectors are saddle points. {vh}K

h=1 are the only local optima.

All spurious eigenvectors are saddle points.

58 / 75

slide-140
SLIDE 140

Question: What about performance under noise?

59 / 75

slide-141
SLIDE 141

Tensor Perturbation Analysis

ˆ T = T + E, T =

  • h

λhvh⊗3, E := max

x:x=1 |E(x, x, x)| ≤ ǫ.

60 / 75

slide-142
SLIDE 142

Tensor Perturbation Analysis

ˆ T = T + E, T =

  • h

λhvh⊗3, E := max

x:x=1 |E(x, x, x)| ≤ ǫ.

Theorem: Let T be number of iterations. If T ≥ log K + log log λmax

ǫ

, ǫ < λmin

K ,

then output (v, λ) (after polynomial restarts) satisfies v − v1 ≤ O ǫ λ1

  • ,

λ − λ1 ≤ O(ǫ), where v1 is s.t. λ1|c1| > λ2|c2| . . . , ci := vi, u0, and u0 is the (successful) initializer.

60 / 75

slide-143
SLIDE 143

Tensor Perturbation Analysis

ˆ T = T + E, T =

  • h

λhvh⊗3, E := max

x:x=1 |E(x, x, x)| ≤ ǫ.

Theorem: Let T be number of iterations. If T ≥ log K + log log λmax

ǫ

, ǫ < λmin

K ,

then output (v, λ) (after polynomial restarts) satisfies v − v1 ≤ O ǫ λ1

  • ,

λ − λ1 ≤ O(ǫ), where v1 is s.t. λ1|c1| > λ2|c2| . . . , ci := vi, u0, and u0 is the (successful) initializer. Careful analysis of deflation: avoid buildup of errors. Implies polynomial sample complexity for learning.

60 / 75

slide-144
SLIDE 144

Other tensor decomposition techniques

61 / 75

slide-145
SLIDE 145

Orthogonal Tensor Decomposition

Simultaneous Power Method

(Wang & Lu, 2017)

◮ Simultaneous recovery of eigenvectors ◮ Initialization is not optimal

Orthogonalized Simultaneous Alternating Least Square

(Sharan & Valiant, 2017)

◮ Random initialization ◮ Proved convergence for symmetric tensor

Initialization

SVD based initialization (Anandkumar & Janzamin, 2014). State-of-the-art (trace based) initialization (Li & Huang, 2018).

62 / 75

slide-146
SLIDE 146

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

5

Algorithms for Tensor Decompositions

6

Tensor Decomposition for Neural Network Compression

7

Conclusion

63 / 75

slide-147
SLIDE 147

Neural Network - Nonlinear Function Approximation

Image classification Speech recognition Text processing

Success of Deep Neural Networks

computation power growth enormous labeled data

64 / 75

slide-148
SLIDE 148

Neural Network - Nonlinear Function Approximation

Image classification Speech recognition Text processing

Success of Deep Neural Networks

computation power growth enormous labeled data

Express Power

linear composition vs nonlinear composition shallow network vs deep structure

64 / 75

slide-149
SLIDE 149

Revolution of Depth

3.57 6.7 7.3 11.7 16.4 25.8 28.2 ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10

ImageNet Classification top-5 error (%)

shallow 8 layers 19 layers 22 layers

152 layers

8 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

65 / 75

slide-150
SLIDE 150

Revolution of Depth

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

65 / 75

slide-151
SLIDE 151

Revolution of Depth

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000

AlexNet, 8 layers (ILSVRC 2012)

3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000

VGG, 19 layers (ILSVRC 2014) GoogleNet, 22 layers (ILSVRC 2014)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

65 / 75

slide-152
SLIDE 152

Revolution of Depth

1x1 con✈, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2

AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015)

3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000

VGG, 19 layers (ILSVRC 2014)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

65 / 75

slide-153
SLIDE 153

Revolution of Depth

34 58 66 86 H

❖G, DP ▼ ❆l ❡ ✒Net

(RCNN) VGG (RCNN) ResNet (

❋aster RCNN) ✯

P

✓SC ✓L V ✔C 200 ✼ Object Detection m ✓P ( ✪)

shallo

8 layers 16 layers

101 layers

*w/ other improvements & more data

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

Engines of visual recognition

65 / 75

slide-154
SLIDE 154

Challenges For Large Deep Neural Network

Learning

Learning takes longer, might not converge, susceptible to vanishing/exploding gradients, etc One-time cost.

66 / 75

slide-155
SLIDE 155

Challenges For Large Deep Neural Network

Learning

Learning takes longer, might not converge, susceptible to vanishing/exploding gradients, etc One-time cost.

Test

Requires large amount of computation and memory storage.

◮ Ill-suited for smart phones or IoT device.

Repeated cost.

66 / 75

slide-156
SLIDE 156

Challenges For Large Deep Neural Network

Learning

Learning takes longer, might not converge, susceptible to vanishing/exploding gradients, etc One-time cost.

Test

Requires large amount of computation and memory storage.

◮ Ill-suited for smart phones or IoT device.

Repeated cost. How to compress the neural network without much performance loss?

66 / 75

slide-157
SLIDE 157

Common Types of Tensor Decompositions

m-order tensor T ∈ RI0×I1×···×Im−1

67 / 75

slide-158
SLIDE 158

Common Types of Tensor Decompositions

m-order tensor T ∈ RI0×I1×···×Im−1 CANDECOMP/PARAFAC (CP) Decomposition Factorize a tensor into sum of rank-1 tensors Rank-1 tensor is defined as outer product of multiple vectors T i0,··· ,im−1 = R−1

r=0 M (0) r,i0 · · · M (m−1) r,im−1

67 / 75

slide-159
SLIDE 159

Common Types of Tensor Decompositions

m-order tensor T ∈ RI0×I1×···×Im−1 CANDECOMP/PARAFAC (CP) Decomposition Factorize a tensor into sum of rank-1 tensors Rank-1 tensor is defined as outer product of multiple vectors T i0,··· ,im−1 = R−1

r=0 M (0) r,i0 · · · M (m−1) r,im−1

Tucker (TK) Decomposition More general than CP decomposition Multilinear operation on a core tensor C: C(M (0), . . . , M (m−1)) T i0,··· ,im−1 = R0−1

r0=0 · · · Rm−1−1 rm−1=0 Cr0,...,rm−1M (0) r0,i0 · · · M (m−1) rm−1,im−1

67 / 75

slide-160
SLIDE 160

Common Types of Tensor Decompositions

m-order tensor T ∈ RI0×I1×···×Im−1 CANDECOMP/PARAFAC (CP) Decomposition Factorize a tensor into sum of rank-1 tensors Rank-1 tensor is defined as outer product of multiple vectors T i0,··· ,im−1 = R−1

r=0 M (0) r,i0 · · · M (m−1) r,im−1

Tucker (TK) Decomposition More general than CP decomposition Multilinear operation on a core tensor C: C(M (0), . . . , M (m−1)) T i0,··· ,im−1 = R0−1

r0=0 · · · Rm−1−1 rm−1=0 Cr0,...,rm−1M (0) r0,i0 · · · M (m−1) rm−1,im−1

Tensor-Train (TT) Decomposition Factorize a tensor into a number of interconnected lower-order tensors T i0,...,im−1 = R0−1

r0=1 · · · Rm−2−1 rm−2=1 T (0) i0,r0 T (1) r0,i1,r1 · · · T (m−1) rm−2,im−1

67 / 75

slide-161
SLIDE 161

Compression of Convolutional Layer w/ Tensor Decompositions Convolutional Kernel: tensor K ∈ RH×W×S×T

68 / 75

slide-162
SLIDE 162

Compression of Convolutional Layer w/ Tensor Decompositions Convolutional Kernel: tensor K ∈ RH×W×S×T

Filter height/width H/W, No. of input/output channels S/T .

68 / 75

slide-163
SLIDE 163

Compression of Convolutional Layer w/ Tensor Decompositions Convolutional Kernel: tensor K ∈ RH×W×S×T

Filter height/width H/W, No. of input/output channels S/T . Map an input tensor U ∈ RX×Y ×S to an output tensor V ∈ RX′×Y ′×T .

68 / 75

slide-164
SLIDE 164

Compression of Convolutional Layer w/ Tensor Decompositions Convolutional Kernel: tensor K ∈ RH×W×S×T

Filter height/width H/W, No. of input/output channels S/T . Map an input tensor U ∈ RX×Y ×S to an output tensor V ∈ RX′×Y ′×T .

Kernel CP Decomposition

CP: Decompose kernel K into 3 factor tensors Ki,j,s,t =

R−1

  • r=0

K(0)

s,r K(1) i,j,r K(2) r,t

  • No. of param.: HWST → (HW + S + T)R

CP decomposition H W R R R S T

68 / 75

slide-165
SLIDE 165

Compression of Convolutional Layer w/ Tensor Decompositions Convolutional Kernel: tensor K ∈ RH×W×S×T

Filter height/width H/W, No. of input/output channels S/T . Map an input tensor U ∈ RX×Y ×S to an output tensor V ∈ RX′×Y ′×T .

Kernel TK Decomposition

TK: Decompose K into 1 core tensor, 2 factor tensors Ki,j,s,t =

Rs−1

  • rs=0

Rt−1

  • rt=0

K(0)

s,rs K(1) i,j,rs,rt K(2) rt,t

  • No. of param.: HWST → SRs + HWRsRt + RtT

TK decomposition H W Rs Rs Rt Rt S T

68 / 75

slide-166
SLIDE 166

Compression of Convolutional Layer w/ Tensor Decompositions Convolutional Kernel: tensor K ∈ RH×W×S×T

Filter height/width H/W, No. of input/output channels S/T . Map an input tensor U ∈ RX×Y ×S to an output tensor V ∈ RX′×Y ′×T .

Kernel TT Decomposition

TT: Decompose K into 4 factor tensors Ki,j,s,t =

Rs−1

  • rs=0

R−1

  • r=0

Rt−1

  • rt=0

K(0)

s,rsK(1) rs,i,r K(2) r,j,rt K(3) rt,t

  • No. of param.: HWST→SRs+HRsR + WRtR + RtT

TT decomposition H W Rs R Rt S T

68 / 75

slide-167
SLIDE 167

Tensorized Spectrum Preserving Compression of Neural Networks Convolutional Kernel: K ∈ RH×W×S×T tensorized to K′ ∈ RH×W ×S0×···×Sm−1×T0×···×Tm−1

69 / 75

slide-168
SLIDE 168

Tensorized Spectrum Preserving Compression of Neural Networks Convolutional Kernel: K ∈ RH×W×S×T tensorized to K′ ∈ RH×W ×S0×···×Sm−1×T0×···×Tm−1

Tensorization: kernel reshaped to higher order tensor.

69 / 75

slide-169
SLIDE 169

Tensorized Spectrum Preserving Compression of Neural Networks Convolutional Kernel: K ∈ RH×W×S×T tensorized to K′ ∈ RH×W ×S0×···×Sm−1×T0×···×Tm−1

Tensorization: kernel reshaped to higher order tensor. S = m−1

i=0 Si and T = m−1 i=0 Ti.

69 / 75

slide-170
SLIDE 170

Tensorized Spectrum Preserving Compression of Neural Networks Convolutional Kernel: K ∈ RH×W×S×T tensorized to K′ ∈ RH×W ×S0×···×Sm−1×T0×···×Tm−1

Tensorization: kernel reshaped to higher order tensor. S = m−1

i=0 Si and T = m−1 i=0 Ti.

Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1.

69 / 75

slide-171
SLIDE 171

Tensorized Spectrum Preserving Compression of Neural Networks Convolutional Kernel: K ∈ RH×W×S×T tensorized to K′ ∈ RH×W ×S0×···×Sm−1×T0×···×Tm−1

Tensorization: kernel reshaped to higher order tensor. S = m−1

i=0 Si and T = m−1 i=0 Ti.

Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1. Output reshaped V ∈ RX×Y ×T to V′ ∈ RX′×Y ′×T0×···×Tm−1.

69 / 75

slide-172
SLIDE 172

Tensorized Spectrum Preserving Compression of Neural Networks Convolutional Kernel: K ∈ RH×W×S×T tensorized to K′ ∈ RH×W ×S0×···×Sm−1×T0×···×Tm−1

Tensorization: kernel reshaped to higher order tensor. S = m−1

i=0 Si and T = m−1 i=0 Ti.

Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1. Output reshaped V ∈ RX×Y ×T to V′ ∈ RX′×Y ′×T0×···×Tm−1.

Tensorized Kernel CP Decomposition

...

CP tensorized CP

H H W W R R R R R R R S T S

1 m

T

1 m

S

1 m

T

1 m

S

1 m

T

1 m

  • Param. No.: HWST → (HW + S + T)R → (m(ST)

1 m + HW)R

69 / 75

slide-173
SLIDE 173

Tensorized Spectrum Preserving Compression of Neural Networks Convolutional Kernel: K ∈ RH×W×S×T tensorized to K′ ∈ RH×W ×S0×···×Sm−1×T0×···×Tm−1

Tensorization: kernel reshaped to higher order tensor. S = m−1

i=0 Si and T = m−1 i=0 Ti.

Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1. Output reshaped V ∈ RX×Y ×T to V′ ∈ RX′×Y ′×T0×···×Tm−1.

Tensorized Kernel TK Decomposition

... ...

TK tensorized TK

H H W W R R R R R R R R Rs Rs Rt Rt S T S

1 m

T

1 m

S

1 m

T

1 m

  • Param. No.: HWST→SRs+HWRsRt + RtT→m(S

1 m +T 1 m )R+HWR2m

69 / 75

slide-174
SLIDE 174

Tensorized Spectrum Preserving Compression of Neural Networks Convolutional Kernel: K ∈ RH×W×S×T tensorized to K′ ∈ RH×W ×S0×···×Sm−1×T0×···×Tm−1

Tensorization: kernel reshaped to higher order tensor. S = m−1

i=0 Si and T = m−1 i=0 Ti.

Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1. Output reshaped V ∈ RX×Y ×T to V′ ∈ RX′×Y ′×T0×···×Tm−1.

Tensorized Kernel TT Decomposition

...

TT tensorized TT

H H W W R R R Rs R Rt S T S

1 m

T

1 m

S

1 m

T

1 m

S

1 m

T

1 m

  • Param. No.:HWST→SRs+HRsR+WRtR+RtT→(m(ST)

1 m R+HW)R

69 / 75

slide-175
SLIDE 175

Experiments - Compress CIFAR10 Resnet-34

Successful Compression of CIFAR10 Resnet-34 Network (Su, Li,

Bhattacharjee & Huang, 2018)

Compression rate: SPC, E2E Compression rate: t-SPC, Seq. Method 5% 10% 20% 40% 2% 5% 10% 20% CP 84.02 86.93 88.75 88.75 85.7 89.86 91.28

  • TK

83.57 86.00 88.03 89.35 61.06 71.34 81.59 87.11 TT 77.44 82.92 84.13 86.64 78.95 84.26 87.89

  • Testing accuracies of tensor methods under compression rates.

The uncompressed network achieves 93.2% accuracy. CIFAR10 Resnet-34 has 4 × 105 parameters that have to be trained and retained during testing.

70 / 75

slide-176
SLIDE 176

Experiments - Compress ImageNet Resnet-50

Successful Compression of ImageNet Resnet-50 Network (Su, Li,

Bhattacharjee & Huang, 2018)

#

Uncompressed SPC-TT t-SPC-TT Epochs

(E2E) (Seq.) 0.2 4.22 0.66x 10.51x 0.3 6.23 0.64x 7.54x 0.5 9.01 0.83x 5.54x 1.0 17.3 0.74x 3.04x 2.0 30.8 0.59x 1.75x Testing accuracy of tensor methods compared to the uncompressed ImageNet Resnet-50. The accuracy of the tensor method results (both non-tensorized and tensorized) are shown normalized to the uncompressed network’s accuracy.

71 / 75

slide-177
SLIDE 177

Outline

1

Introduction

2

Motivation: Challenges of MLE for Gaussian Mixtures

3

Introduction of Method of Moments and Tensor Notations

4

Topic Model for Single-topic Documents

5

Algorithms for Tensor Decompositions

6

Tensor Decomposition for Neural Network Compression

7

Conclusion

72 / 75

slide-178
SLIDE 178

Conclusion

Method-of-moments can efficiently estimate parameters for many latent variable models.

◮ Exploit distributional properties, multi-view structure, and other

structure to determine usable moments tensors.

◮ Some efficient algorithms for carrying out the tensor decomposition to

  • btain parameter estimates.

Tensor decomposition of neural network kernels/weights effectively compresses the network. Many issues to resolve

◮ Handle model misspecification, increase robustness. ◮ Learning deep neural network parameters using tensor decomposition? 73 / 75

slide-179
SLIDE 179

A Short List of Related Papers to Today’s Talk

“A Method of Moments for Mixture Models and Hidden Markov Models”, by Anima Anandkumar, Daniel Hsu and Sham Kakade. In Conference on Learning Theory, 2012. “Tensor Decompositions for Learning Latent Variable Models”, by Anima Anandkumar, Rong Ge, Daniel Hsu, Sham Kakade and Matus Telgarsky. In Journal of Machine Learning Research, 2014. “Escaping from saddle pointsonline stochastic gradient for tensor decomposition”, Rong Ge, Furong Huang, Chi Jin and Yang Yuan. In Conference on Learning Theory, 2015. “Online tensor methods for learning latent variable models”, Furong Huang, Niranjan U. N., Mohammad Umar Hakeem and Anima Anandkumar. The Journal of Machine Learning Research, 2016. “Guaranteed Simultaneous Asymmetric Tensor Decomposition via Orthogonalized Alternating Least Squares”, by Jialin Li and Furong Huang, 2018. “Tensorized Spectrum Preserving Compression for Neural Networks”, by Jiahao Su, Jingling Li, Bobby Bhattacharjee and Furong Huang, 2018.

74 / 75

slide-180
SLIDE 180

Tensor Softwares

Spark implementation of method of moments to learn Latent Dirichlet Allocation available at https://github.com/FurongHuang/spectrallda-tensorspark. Tensorly: Simple and Fast Tensor Learning in Python available at http://tensorly.org/stable/home.html. A general library with higher order tensor operations is coming soon.

75 / 75