[PPT] - Method-of-moments Daniel Hsu 1 Example: modeling the topics of a PowerPoint Presentation

SLIDE 1

Method-of-moments

Daniel Hsu

1

SLIDE 2

Example: modeling the topics of a document corpus

Goal: model the topics of document in a corpus.

Model parameters Learning algorithm Sample of documents

θ

2

SLIDE 3

Topic model

(e.g., Hofmann, ’99; Blei-Ng-Jordan, ’03)

sports science business politics

k topics (distributions over vocab words). Each document ↔ mixture of topics. Words in document ∼iid mixture dist.

3

SLIDE 4

Topic model

(e.g., Hofmann, ’99; Blei-Ng-Jordan, ’03)

sports science business politics

k topics (distributions over vocab words). Each document ↔ mixture of topics. Words in document ∼iid mixture dist.

athlete aardvark zygote

sports science politics business

1 3 . . .

+0· ∼iid 0.6· +0.3· +0.1·

E.g., Prθ[“play” | sports] = 0.0002 Prθ[“game” | sports] = 0.0003 Prθ[“season” | sports] = 0.0001 . . .

3

SLIDE 5

Learning topic models

Topic model:

sports science business politics

k topics (dists. over d words) µ1, . . . , µk; Each document ↔ mixture of topics. Words in document ∼iid mixture dist.

4

SLIDE 6

Learning topic models

Simple topic model:

(each document about single topic)

sports science business politics

k topics (dists. over d words) µ1, . . . , µk; Topic t chosen with prob. wt, words in document ∼iid µt.

4

SLIDE 7

Learning topic models

Simple topic model:

(each document about single topic)

sports science business politics

k topics (dists. over d words) µ1, . . . , µk; Topic t chosen with prob. wt, words in document ∼iid µt.

◮ Input: sample of documents, generated by simple topic

model with unknown parameters θ⋆ := {( µt ⋆, wt ⋆)}.

4

SLIDE 8

Learning topic models

Simple topic model:

(each document about single topic)

sports science business politics

k topics (dists. over d words) µ1, . . . , µk; Topic t chosen with prob. wt, words in document ∼iid µt.

◮ Input: sample of documents, generated by simple topic

model with unknown parameters θ⋆ := {( µt ⋆, wt ⋆)}.

◮ Task: find parameters θ := {(

µt, wt)} so that θ ≈ θ⋆.

4

SLIDE 9

Some approaches to estimation

5

SLIDE 10

Some approaches to estimation

Maximum-likelihood

(e.g., Fisher, 1912).

θMLE := arg maxθ Prθ[data].

5

SLIDE 11

Some approaches to estimation

Maximum-likelihood

(e.g., Fisher, 1912).

θMLE := arg maxθ Prθ[data]. Current practice (> 40 years): local search for local maxima — can be quite far from θMLE.

5

SLIDE 12

Some approaches to estimation

Maximum-likelihood

(e.g., Fisher, 1912).

θMLE := arg maxθ Prθ[data]. Current practice (> 40 years): local search for local maxima — can be quite far from θMLE. Method-of-moments

(Pearson, 1894).

Find parameters θ that (approximately) satisfy system of equations based on the data.

5

SLIDE 13

Some approaches to estimation

Maximum-likelihood

(e.g., Fisher, 1912).

θMLE := arg maxθ Prθ[data]. Current practice (> 40 years): local search for local maxima — can be quite far from θMLE. Method-of-moments

(Pearson, 1894).

Find parameters θ that (approximately) satisfy system of equations based on the data. Many ways to instantiate & implement.

5

SLIDE 14

Some approaches to estimation

Maximum-likelihood

(e.g., Fisher, 1912).

θMLE := arg maxθ Prθ[data]. Current practice (> 40 years): local search for local maxima — can be quite far from θMLE. Method-of-moments

(Pearson, 1894).

Find parameters θ that (approximately) satisfy system of equations based on the data. Many ways to instantiate & implement.

5

SLIDE 15

Moments: normal distribution

Normal distribution: x ∼ N(µ, v) First- and second-order moments: E(µ,v)[x] = µ, E(µ,v)[x2] = µ2 + v.

6

SLIDE 16

Moments: normal distribution

Normal distribution: x ∼ N(µ, v) First- and second-order moments: E(µ,v)[x] = µ, E(µ,v)[x2] = µ2 + v. Method-of-moments estimators of µ⋆ and v⋆: find ˆ µ and ˆ v s.t.

ES[x] ≈ ˆ

µ,

ES[x2] ≈ ˆ

µ2 + ˆ v.

6

SLIDE 17

Moments: normal distribution

Normal distribution: x ∼ N(µ, v) First- and second-order moments: E(µ,v)[x] = µ, E(µ,v)[x2] = µ2 + v. Method-of-moments estimators of µ⋆ and v⋆: find ˆ µ and ˆ v s.t.

ES[x] ≈ ˆ

µ,

ES[x2] ≈ ˆ

µ2 + ˆ v. A reasonable solution: ˆ µ := ES[x], ˆ v := ES[x2] − ˆ µ2 since ES[x] → E(µ⋆,v⋆)[x] and ES[x2] → E(µ⋆,v⋆)[x2] by LLN.

6

SLIDE 18

Moments: simple topic model

For any n-tuple (i1, i2, . . . , in) ∈ Vocabularyn: (Population) moments under some parameter θ: Prθ

document contains words i1, i2, . . . , in
.

e.g., Prθ[“machine” & “learning” co-occur].

7

SLIDE 19

Moments: simple topic model

For any n-tuple (i1, i2, . . . , in) ∈ Vocabularyn: (Population) moments under some parameter θ: Prθ

document contains words i1, i2, . . . , in
.

e.g., Prθ[“machine” & “learning” co-occur].

Empirical moments from sample S of documents:

PrS
document contains words i1, i2, . . . , in
i.e., empirical frequency of co-occurrences in sample S.

7

SLIDE 20

Method-of-moments

Method-of-moments strategy: Given data sample S, find θ to satisfy system of equations momentsθ =

momentsS.

(Recall: we expect

momentsS ≈ momentsθ⋆ by LLN.)
Q1. Which moments should we use?
Q2. How do we (approx.) solve these moment equations?

8

SLIDE 21

Q1. Which moments should we use?

9

SLIDE 22

Q1. Which moments should we use?

moment order reliable estimates? unique solution? 1st, 2nd 1st- and 2nd-order moments (e.g., prob. of word pairs).

[Vempala-Wang, ’02]

1st 2nd Ω(k)th

[McSherry, ’01]

rder of moments

[Arora-Ge-Moitra, ’12] [Kleinberg-Sandler, ’04]

9

SLIDE 23

Q1. Which moments should we use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ 1st- and 2nd-order moments (e.g., prob. of word pairs).

◮ Fairly easy to get reliable estimates.

PrS[“machine”, “learning”] ≈ Prθ⋆[“machine”, “learning”]

[Vempala-Wang, ’02]

1st 2nd Ω(k)th

[McSherry, ’01]

rder of moments

[Arora-Ge-Moitra, ’12] [Kleinberg-Sandler, ’04]

9

SLIDE 24

Q1. Which moments should we use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ 1st- and 2nd-order moments (e.g., prob. of word pairs).

◮ Fairly easy to get reliable estimates.

PrS[“machine”, “learning”] ≈ Prθ⋆[“machine”, “learning”]

◮ Can have multiple solutions to moment equations.

momentsθ1 =

moments = momentsθ2,

θ1 = θ2

[Vempala-Wang, ’02]

1st 2nd Ω(k)th

[McSherry, ’01]

rder of moments

[Arora-Ge-Moitra, ’12] [Kleinberg-Sandler, ’04]

9

SLIDE 25

Q1. Which moments should we use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th Ω(k)th-order moments (prob. of word k-tuples)

[Moitra-Valiant, ’10]

1st 2nd Ω(k)th

[McSherry, ’01] [Prony, 1795] [Lindsay, ’89]

rder of moments

[Arora-Ge-Moitra, ’12] [Kleinberg-Sandler, ’04] [Vempala-Wang, ’02] [Gravin et al, ’12]

9

SLIDE 26

Q1. Which moments should we use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th ✓ Ω(k)th-order moments (prob. of word k-tuples)

◮ Uniquely pins down the solution. [Moitra-Valiant, ’10]

1st 2nd Ω(k)th

[McSherry, ’01] [Prony, 1795] [Lindsay, ’89]

rder of moments

[Arora-Ge-Moitra, ’12] [Kleinberg-Sandler, ’04] [Vempala-Wang, ’02] [Gravin et al, ’12]

9

SLIDE 27

Q1. Which moments should we use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th ✗ ✓ Ω(k)th-order moments (prob. of word k-tuples)

◮ Uniquely pins down the solution. ◮ Empirical estimates very unreliable. [Moitra-Valiant, ’10]

1st 2nd Ω(k)th

[McSherry, ’01] [Prony, 1795] [Lindsay, ’89]

rder of moments

[Arora-Ge-Moitra, ’12] [Kleinberg-Sandler, ’04] [Vempala-Wang, ’02] [Gravin et al, ’12]

9

SLIDE 28

Q1. Which moments should we use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th ✗ ✓ Can we get best-of-both-worlds?

[Moitra-Valiant, ’10]

1st 2nd Ω(k)th

[McSherry, ’01] [Prony, 1795] [Lindsay, ’89]

rder of moments

[Arora-Ge-Moitra, ’12] [Kleinberg-Sandler, ’04] [Vempala-Wang, ’02] [Gravin et al, ’12]

9

SLIDE 29

Q1. Which moments should we use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th ✗ ✓ Can we get best-of-both-worlds? Yes! In high-dimensions, low-order multivariate moments suffice.

(1st-, 2nd-, and 3rd-order moments)

[Vempala-Wang, ’02]

1st 2nd Ω(k)th

[McSherry, ’01] [Prony, 1795] [Lindsay, ’89]

3rd

rder of moments

this work

[Moitra-Valiant, ’10] [Gravin et al, ’12] [Arora-Ge-Moitra, ’12] [Kleinberg-Sandler, ’04]

9

SLIDE 30

Low-order multivariate moments suffice

Key observation: in high dimensions (d ≫ k), low-order moments have simple (“low-rank”) algebraic structure.

10

SLIDE 31

Low-order multivariate moments suffice

j i

(Empirical: P)

Pθ :=

Prθ[words i, j]

Given a document about topic t, Prθ[ words i, j | topic t ] = ( µt)i · ( µt)j.

11

SLIDE 32

Low-order multivariate moments suffice

j i

(Empirical: P)

Pθ :=

Prθ[words i, j]

Given a document about topic t, Prθ[ words i, j | topic t ] = ( µt ⊗ µt)i,j.

11

SLIDE 33

Low-order multivariate moments suffice

j i

(Empirical: P)

Pθ :=

Prθ[words i, j]

Averaging over topics, Prθ[ words i, j ] =

t

wt · ( µt ⊗ µt)i,j.

11

SLIDE 34

Low-order multivariate moments suffice

j i

(Empirical: P)

Pθ :=

Prθ[words i, j]

In matrix notation Pθ, Pθ =

t

wt µt ⊗ µt.

11

SLIDE 35

Low-order multivariate moments suffice

(Empirical: T)

T θ :=

i j k

Prθ[words i, j, k]

Similarly, Prθ[ words i, j, k ] =

t

wt · ( µt ⊗ µt ⊗ µt)i,j,k.

11

SLIDE 36

Low-order multivariate moments suffice

(Empirical: T)

T θ :=

i j k

Prθ[words i, j, k]

In tensor notation T θ, T θ =

t

wt µt ⊗ µt ⊗ µt.

11

SLIDE 37

Low-order multivariate moments suffice

j i

T θ := Pθ :=

i j k

Prθ[words i, j, k] Prθ[words i, j]

Pθ =

k

t=1

wt µt ⊗ µt and T θ =

k

t=1

wt µt ⊗ µt ⊗ µt

11

SLIDE 38

Low-order multivariate moments suffice

j i

T θ := Pθ :=

i j k

Prθ[words i, j, k] Prθ[words i, j]

Pθ =

k

t=1

wt µt ⊗ µt and T θ =

k

t=1

wt µt ⊗ µt ⊗ µt Low-rank matrix and tensor

11

SLIDE 39

Low-order multivariate moments suffice

j i

T θ := Pθ :=

i j k

Prθ[words i, j, k] Prθ[words i, j]

Pθ =

k

t=1

wt µt ⊗ µt and T θ =

k

t=1

wt µt ⊗ µt ⊗ µt Moment equations: Pθ = P, T θ = T

(i.e., find low-rank decompositions of empirical moments).

11

SLIDE 40

Low-order multivariate moments suffice

j i

T θ := Pθ :=

i j k

Prθ[words i, j, k] Prθ[words i, j]

Pθ =

k

t=1

wt µt ⊗ µt and T θ =

k

t=1

wt µt ⊗ µt ⊗ µt Moment equations: Pθ = P, T θ = T

(i.e., find low-rank decompositions of empirical moments).

Claim: Pθ and T θ uniquely determine the parameters θ.

11

SLIDE 41

Reduction to orthogonal case via whitening

Pθ =

t wt

µt ⊗ µt defines “whitened” coord. system.

12

SLIDE 42

Reduction to orthogonal case via whitening

Pθ =

t wt

µt ⊗ µt defines “whitened” coord. system. Technical reduction: Apply change-of-basis transformation P−1/2

θ

to T θ: T θ =

k

t=1

wt µt ⊗ µt ⊗ µt − → Bθ =

k

t=1

λt vt ⊗ vt ⊗ vt where λt = 1/√wt,

vt = P−1/2

θ

(√wt µt).

12

SLIDE 43

Reduction to orthogonal case via whitening

Pθ =

t wt

µt ⊗ µt defines “whitened” coord. system. Technical reduction: Apply change-of-basis transformation P−1/2

θ

to T θ: T θ =

k

t=1

wt µt ⊗ µt ⊗ µt − → Bθ =

k

t=1

λt vt ⊗ vt ⊗ vt where λt = 1/√wt,

vt = P−1/2

θ

(√wt µt). Upshot: { v1, v2, . . . , vk} are orthonormal.

12

SLIDE 44

Reduction to orthogonal case via whitening

Pθ =

t wt

µt ⊗ µt defines “whitened” coord. system. “Whitened” third-order moment tensor Bθ has

rthogonal decomposition

Bθ =

k

t=1

λt vt ⊗ vt ⊗ vt.

(And {(λt, vt)} are related to parameters {(wt, µt)}.)

Upshot: { v1, v2, . . . , vk} are orthonormal. Claim: Orthogonal decomposition of Bθ is unique.

12

SLIDE 45

The spectral theorem and eigendecompositions

13

SLIDE 46

The spectral theorem and eigendecompositions

Any symmetric matrix Decomposition is unique are distinct.

nly if all eigenvalues λi

A = k

i=1 λi

vi ⊗ vi

13

SLIDE 47

The spectral theorem and eigendecompositions

Any symmetric matrix Decomposition is unique are distinct.

nly if all eigenvalues λi

A = k

i=1 λi

vi ⊗ vi

v2
v1

13

SLIDE 48

The spectral theorem and eigendecompositions

Any symmetric matrix Decomposition is unique are distinct.

nly if all eigenvalues λi

A = k

i=1 λi

vi ⊗ vi

v2
v ′

1

v ′

2

v1

13

SLIDE 49

The spectral theorem and eigendecompositions

Any symmetric matrix Decomposition is unique are distinct.

nly if all eigenvalues λi

A = k

i=1 λi

vi ⊗ vi Special 3rd-order tensor B = k

i=1 λi

vi ⊗ vi ⊗ vi If decomposition exists,

(even if λi all same).

then it’s always unique

13

SLIDE 50

The spectral theorem and eigendecompositions

Any symmetric matrix Decomposition is unique are distinct.

nly if all eigenvalues λi

A = k

i=1 λi

vi ⊗ vi Special 3rd-order tensor B = k

i=1 λi

vi ⊗ vi ⊗ vi If decomposition exists,

(even if λi all same).

then it’s always unique

· · ·

v2
v1
vk

13

SLIDE 51

The spectral theorem and eigendecompositions

Any symmetric matrix Decomposition is unique are distinct.

nly if all eigenvalues λi

A = k

i=1 λi

vi ⊗ vi Special 3rd-order tensor B = k

i=1 λi

vi ⊗ vi ⊗ vi If decomposition exists,

(even if λi all same).

then it’s always unique

v1,

v2, . . .

· · ·

v2
v1
vk

13

SLIDE 52

The spectral theorem and eigendecompositions

Any symmetric matrix Decomposition is unique are distinct.

nly if all eigenvalues λi

A = k

i=1 λi

vi ⊗ vi Special 3rd-order tensor B = k

i=1 λi

vi ⊗ vi ⊗ vi If decomposition exists,

(even if λi all same).

then it’s always unique Uniqueness of orthogonal decomposition (+low-rank structure) implies that Pθ and T θ uniquely determine θ.

13

SLIDE 53

The spectral theorem and eigendecompositions

Any symmetric matrix Decomposition is unique are distinct.

nly if all eigenvalues λi

A = k

i=1 λi

vi ⊗ vi Special 3rd-order tensor B = k

i=1 λi

vi ⊗ vi ⊗ vi If decomposition exists,

(even if λi all same).

then it’s always unique Uniqueness of orthogonal decomposition (+low-rank structure) implies that Pθ and T θ uniquely determine θ.

13

SLIDE 54

Q2. How to solve the moment equations?

14

SLIDE 55

Q2. How to solve the moment equations?

Solve moment equations via optimization problem minθ T θ − T2 s.t. Pθ = P. (†)

14

SLIDE 56

Q2. How to solve the moment equations?

Solve moment equations via optimization problem minθ T θ − T2 s.t. Pθ = P. (†) Not convex in parameters θ = {( µi, wi)}.

14

SLIDE 57

Q2. How to solve the moment equations?

Solve moment equations via optimization problem minθ T θ − T2 s.t. Pθ = P. (†) Not convex in parameters θ = {( µi, wi)}. What we do: find one topic ( µi, wi) at a time, using local optimization on rank-1 approximation objective: minλ,

v λ

v ⊗ v ⊗ v − B2 (‡)

(after change-of-coord. system via P:

T −

→ B).

14

SLIDE 58

Q2. How to solve the moment equations?

Solve moment equations via optimization problem minθ T θ − T2 s.t. Pθ = P. (†) Not convex in parameters θ = {( µi, wi)}. What we do: find one topic ( µi, wi) at a time, using local optimization on rank-1 approximation objective: max

u≤1

i,j,k
Bi,j,k uiujuk

(‡)

14

SLIDE 59

Q2. How to solve the moment equations?

Solve moment equations via optimization problem minθ T θ − T2 s.t. Pθ = P. (†) Not convex in parameters θ = {( µi, wi)}. What we do: find one topic ( µi, wi) at a time, using local optimization on rank-1 approximation objective:

· · ·

u∗

1

u∗

2

u∗

k

14

SLIDE 60

Q2. How to solve the moment equations?

Solve moment equations via optimization problem minθ T θ − T2 s.t. Pθ = P. (†) Not convex in parameters θ = {( µi, wi)}. What we do: find one topic ( µi, wi) at a time, using local optimization on rank-1 approximation objective:

u∗

k

u∗

1

· · ·

u∗

2

( µk ⋆, wk ⋆) ( µ1⋆, w1⋆) ( µ2⋆, w2⋆)

Can approximate all local optima, each corresp. to a topic. − → Near-optimal solution to (†).

14

SLIDE 61

Variational argument

Interpret Pθ : Rd × Rd → R and T θ : Rd × Rd × Rd → R as bi-linear and tri-linear forms.

15

SLIDE 62

Variational argument

Interpret Pθ : Rd × Rd → R and T θ : Rd × Rd × Rd → R as bi-linear and tri-linear forms.

Lemma

Assuming { µi} linearly independent and wi > 0, each of the k distinct, isolated local maximizers u∗ of max

u∈Rd T θ(

u, u, u) s.t. Pθ( u, u) ≤ 1 (‡) satisfies, for some i ∈ [k], Pθ u∗ = √wi µi, T θ( u∗, u∗, u∗) = 1 √wi .

15

SLIDE 63

Variational argument

Interpret Pθ : Rd × Rd → R and T θ : Rd × Rd × Rd → R as bi-linear and tri-linear forms.

Lemma

Assuming { µi} linearly independent and wi > 0, each of the k distinct, isolated local maximizers u∗ of max

u∈Rd T θ(

u, u, u) s.t. Pθ( u, u) ≤ 1 (‡) satisfies, for some i ∈ [k], Pθ u∗ = √wi µi, T θ( u∗, u∗, u∗) = 1 √wi . ∴ {( µi, wi) : i ∈ [k]} uniquely determined by Pθ and T θ.

15

SLIDE 64

Implementation of topic model estimator

Potential deal-breakers: Explicitly form T, count word-triples − → Ω(d3) space, Ω(length3) time / doc.

16

SLIDE 65

Implementation of topic model estimator

Potential deal-breakers: Explicitly form T, count word-triples − → Ω(d3) space, Ω(length3) time / doc. Can exploit algebraic structure to avoid bottlenecks.

16

SLIDE 66

Implementation of topic model estimator

Potential deal-breakers: Explicitly form T, count word-triples − → Ω(d3) space, Ω(length3) time / doc. Can exploit algebraic structure to avoid bottlenecks. Implicit representation of T:

T

≈ 1 |S|

h∈S
h ⊗

h ⊗ h where h ∈ Nd is (sparse) histogram vector for a document.

16

SLIDE 67

Implementation of topic model estimator

Potential deal-breakers: Explicitly form T, count word-triples − → Ω(d3) space, Ω(length3) time / doc. Can exploit algebraic structure to avoid bottlenecks. Computation of objective gradient at vector u ∈ Rd:

T(

u) ≈ 1 |S|

h∈S
h ⊗

h ⊗ h

(

u) = 1 |S|

h∈S

( h⊤ u)2 h

(sparse vector operations; time = O(input size)).

16

SLIDE 68

Illustrative empirical results

◮ Corpus: 300000 New York Times articles. ◮ Vocabulary size: 102660 words. ◮ Set number of topics k := 50.

17

SLIDE 69

Illustrative empirical results

◮ Corpus: 300000 New York Times articles. ◮ Vocabulary size: 102660 words. ◮ Set number of topics k := 50.

Predictive performance of straightforward implementation: ≈ 4–8× speed-up over Gibbs sampling.

0.5 1 1.5 2 8.4 8.6 Training time (×104 sec) Log loss Method-of-moments Gibbs sampling

17

SLIDE 70

Illustrative empirical results

Sample topics: (showing top 10 words for each topic)

Econ. Baseball Edu. Health care Golf sales run school drug player economic inning student patient tiger_wood consumer hit teacher million won major game program company shot home season

fficial

doctor play indicator home public companies round weekly right children percent win

rder

games high cost tournament claim dodger education program tour scheduled left district health right

18

SLIDE 71

Illustrative empirical results

Sample topics: (showing top 10 words for each topic)

Invest. Election auto race Child’s Lit. Afghan War percent al_gore car book taliban stock campaign race children attack market president driver ages afghanistan fund george_bush team author

fficial

investor bush won read military companies clinton win newspaper u_s analyst vice racing web united_states money presidential track writer terrorist investment million season written war economy democratic lap sales bin

19

SLIDE 72

Illustrative empirical results

Sample topics: (showing top 10 words for each topic)

Web Antitrust TV Movies Music com court show film music www case network movie song site law season director group web lawyer nbc play part sites federal cb character new_york information government program actor company

nline

decision television show million mail trial series movies band internet microsoft night million show telegram right new_york part album etc.

20

SLIDE 73

Recap

Efficient learning algorithms for topic models, based on solving moment equations momentsθ =

momentsS.

21

SLIDE 74

Recap

Efficient learning algorithms for topic models, based on solving moment equations momentsθ =

momentsS.
Q1. Which moments should we use?

Suffices to use low-order (up to 3rd-order) moments, and exploit multivariate structure in high-dimensions.

21

SLIDE 75

Recap

Efficient learning algorithms for topic models, based on solving moment equations momentsθ =

momentsS.
Q1. Which moments should we use?

Suffices to use low-order (up to 3rd-order) moments, and exploit multivariate structure in high-dimensions.

Q2. How do we (approx.) solve these moment equations?

Local optimization based on orthogonal tensor decompositions.

21

SLIDE 76

Structure in latent variable models

“Eigen-structure” found in low-order moments for many other models of high-dimensional data

k

i=1

λi vi ⊗ vi ⊗ vi

sports science business politics

bank manager investment NP NP DT NN

shortstop

DT NN

the ball

Vt

caught

VP S

the

22

SLIDE 77

Latent Dirichlet Allocation and Mixtures of Gaussians

Latent Dirichlet Allocation (Blei-Ng-Jordan, ’02) topic model:

sports science business politics

k topics (distributions over d words). Each document ↔ mixture of topics. Doc.’s mixing weights ∼ Dirichlet( α). Words in doc. ∼iid mixture dist.

23

SLIDE 78

Latent Dirichlet Allocation and Mixtures of Gaussians

Latent Dirichlet Allocation (Blei-Ng-Jordan, ’02) topic model:

sports science business politics

k topics (distributions over d words). Each document ↔ mixture of topics. Doc.’s mixing weights ∼ Dirichlet( α). Words in doc. ∼iid mixture dist.

Mixtures of Gaussians (Pearson, 1894) k sub-populations in Rd; t-th sub-pop. modeled as Gaussian N( µt, Σt) with mixing weight wt.

23

SLIDE 79

Finding the relevant eigenstructure

In both LDA and mixtures of axis-aligned Gaussians:

f

≤ 2nd-order momentsθ
=
wt

µt ⊗ µt

g

≤ 3rd-order momentsθ
=
wt

µt ⊗ µt ⊗ µt for suitable f and g based on additional model structure.

24

SLIDE 80

Hidden Markov Models (HMMs)

h1 h2 · · · hℓ

x1
x2
xℓ

Workhorse statistical model for sequence data /k/ /a/ /t/

25

SLIDE 81

Hidden Markov Models (HMMs)

h1 h2 · · · hℓ

x1
x2
xℓ

Workhorse statistical model for sequence data /k/ /a/ /t/

◮ Hidden state variables h1 → h2 → · · · form a Markov chain. ◮ Observation

xt at time t depends only on hidden state ht at time t.

25

SLIDE 82

Learning HMMs

Correlations between past, present, and future ht−1 ht ht+1

xt−1
xt
xt+1

26

SLIDE 83

Learning HMMs

Correlations between past, present, and future ht−1 ht ht+1

xt−1
xt
xt+1

Suffices to use low-order (asymmetric) cross moments Eθ[ xt−1 ⊗ xt ⊗ xt+1 ].

26

SLIDE 84

Where to read more

Tensor decompositions for learning latent variable models

A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, M. Telgarsky

Journal of Machine Learning Research, 2014. http://jmlr.org/papers/v15/anandkumar14b.html

27