An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 - - PowerPoint PPT Presentation

an introduction to spectral learning
SMART_READER_LITE
LIVE PREVIEW

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 - - PowerPoint PPT Presentation

An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral Learning Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words An


slide-1
SLIDE 1

An Introduction to Spectral Learning

An Introduction to Spectral Learning

Hanxiao Liu November 8, 2013

slide-2
SLIDE 2

An Introduction to Spectral Learning

Outline

1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words

slide-3
SLIDE 3

An Introduction to Spectral Learning

Preliminaries

X1, · · · , Xn ∼ p (x; θ), θ = (θ1, · · · θm)⊤ ˆ θ = ˆ θn = w (X1, · · · , Xn) Maximum Likelihood Estimator (MLE) ˆ θ = argmax

θ

log L (θ) Bayes Estimator (BE) ˆ θ = E (θ|X) =

θp (x|θ) π (θ) dθ p (x|θ) π (θ) dθ

slide-4
SLIDE 4

An Introduction to Spectral Learning

Preliminaries

Question What makes a good estimator? MLE is consistent Both the MLE and BE have asymptotic normality √n

  • ˆ

θn − θ

  • N
  • 0,

1 I (θ)

  • under mild (regularity) conditions

Can be computationally expensive

slide-5
SLIDE 5

An Introduction to Spectral Learning

Preliminaries

Example (Gamma distribution) p (xi; α, θ) = 1 Γ (α) θα xα−1

i

exp

  • −xi

θ

  • L (α, θ) =
  • 1

Γ (α) θα

n n

  • i=1

xi

α−1

exp

n

i=1 xi

θ

  • MLE is hard to compute due to the existence of Γ (α)
slide-6
SLIDE 6

An Introduction to Spectral Learning

Method of Moments

j-th theoretical moment, j ∈ [k] µj (θ) := Eθ

  • X j

j-th sample moment, j ∈ [k] Mj := 1 n

n

  • i=1

X j

i

Plug-in and solve the multivariate polynomial equations Mj = µj (θ) j ∈ [k] sometimes can be recast as spectral decomposition

slide-7
SLIDE 7

An Introduction to Spectral Learning

Method of Moments

Example (Gamma distribution) p (xi; α, θ) = 1 Γ (α) θα xα−1

i

exp

  • −xi

θ

  • X = E (Xi) = αθ

1 n

n

  • i=1
  • Xi − X

2 = Var (Xi) = αθ2

⇒ ˆ θ = 1 nX

n

  • i=1
  • Xi − X

2 , ˆ

α = X ˆ θ = nX 2

n

i=1

  • Xi − X

2

slide-8
SLIDE 8

An Introduction to Spectral Learning

Method of Moments

lack guarantee about the solution high-order sample moments are hard to estimate To reach a specified accuracy, the required sample size and computational cost is exponential in k (or n)! Question Could we recover the true θ from only low-order moments? Question Could we lower the sample requirement and computational complexity based on some (hopefully mild) assumptions?

slide-9
SLIDE 9

An Introduction to Spectral Learning

Learning the Topic Models

Papadimitriou et al. (2000)

Non-overlapping separation condition (strong)

Anandkumar et al. (2012), MoM+SD

Full rank assumption (weak) Multinomial Mixture, LDA

Arora et al. (2012), MoM+NMF+LP

Anchor words (mild) LDA, Correlated Topic Model A more practical algorithm proposed in 2013

slide-10
SLIDE 10

An Introduction to Spectral Learning

Learning the Topic Models

Suppose there are n documents, k hidden topics, d features M = [µ1|µ2| . . . |µk] ∈ Rd×k, µj ∈ ∆d−1 ∀j ∈ [k] w = (w1, . . . , wk) , w ∈ ∆k−1 P (h = j) = wj j ∈ [k] For the v-th word in a document, xv ∈ {e1, . . . ed} P (xv = ei|h = j) = µi

j,

j ∈ [k], i ∈ [d] Goal: Recover the M using low-order moments

slide-11
SLIDE 11

An Introduction to Spectral Learning

Learning the Topic Models

Construct moment statistics Pairsij := P (x1 = ei, x2 = ej) Triplesij := P (x1 = ei, x2 = ej, x3 = et) Pair = E[x1 ⊗ x2] ∈ Rd×d Triples = E[x1 ⊗ x2 ⊗ x3] ∈ Rd×d×d Empirical plug-ins i.e. ˆ Pairs and ˆ Triples could be obtained from data through a straightforward manner We want to establish some equivalence between the empirical moments and parameters of interest

slide-12
SLIDE 12

An Introduction to Spectral Learning

Learning the Topic Models

Triples (η) := E[x1 ⊗ x2 ⊗ x3, η] ∈ Rd×d Triples (η) : Rd → Rd×d Lemma Pairs = Mdiag (w) M ⊤ Triples (η) = M

  • diag
  • M ⊤η
  • diag (w)
  • M ⊤

The unknown M and w are twisted.

slide-13
SLIDE 13

An Introduction to Spectral Learning

Learning the Topic Models

Assumption ( Non-degeneracy ) M has full column rank k

1 Find U, V ∈ Rd×k s.t.

  • U ⊤M

−1 and

  • V ⊤M

−1 exist.

2 ∀η ∈ Rd, define B (η) ∈ Rk×k

B (η) :=

  • U ⊤Triples (η) V

U ⊤PairsV

−1

Lemma (Observable Operator) B (η) =

  • U ⊤M
  • diag
  • M ⊤η

U ⊤M

−1

slide-14
SLIDE 14

An Introduction to Spectral Learning

Learning the Topic Models

Input: ˆ Pairs and ˆ Triples Output: topic-word distributions ˆ M ˆ U, ˆ V ← top k left, right eigenvectors of ˆ Pairs a η ← random sample from range( ˆ U)

  • ˆ

ξ1, ˆ ξ2, . . . , ˆ ξk

  • ← right eigenvectors of B (η) b

for j ← 1 to k do ˆ µj ← ˆ U ˆ ξj/1, ˆ U ˆ ξj end return ˆ M = [ ˆ µ1| ˆ µ2| . . . | ˆ µk]

aPairs = Mdiag (w) M ⊤ bB (η) =

U ⊤M diag M ⊤η U ⊤M−1

slide-15
SLIDE 15

An Introduction to Spectral Learning

Learning the Topic Models

Lemma (Observable Operator) B (η) =

  • U ⊤M
  • diag
  • M ⊤η

U ⊤M

−1

We hope M ⊤η has distinct entries. How to pick η? η ← ei ⇒ M ⊤η i-th word’s distribution over topics Prior knowledge required! Otherwise, η ← Uθ, θ ∼ Uniform(Sk−1)

slide-16
SLIDE 16

An Introduction to Spectral Learning

Learning the Topic Models

SVD is carried out on Rk×k, k ≪ d Only involves trigram statistics i.e. low-order moments Guaranteed to recover the parameters Parameters of more complicated models like LDA can be recovered in the same manner

slide-17
SLIDE 17

An Introduction to Spectral Learning

Tensor Decomposition

Recall Pairs = Mdiag (w) M ⊤ Triples (η) = M

  • diag
  • M ⊤η
  • diag (w)
  • M ⊤

Pairs =

k

  • j

wj · µj ⊗ µj Triples =

k

  • j

wj · µj ⊗ µj ⊗ µj Symmetric tensor decomposition? µj need to be orthogonal

slide-18
SLIDE 18

An Introduction to Spectral Learning

Tensor Decomposition

Whiten Pairs W := UD

1 2 ⇒ W ⊤PairsW = I

µ′

j := √wjW ⊤µj

We can check that µ′

j, j ∈ [k] are orthonormal vectors

Do orthogonal tensor decomposition on Triples (W , W , W ) =

k

  • j=1

wj

  • W ⊤µj

⊗3 =

k

  • j=1

1 √wj µ′

j ⊗3

Then recover µj from µ′

j

slide-19
SLIDE 19

An Introduction to Spectral Learning

Anchor Words

Drawbacks of previous algorithms topics cannot be correlated the bound is weak (comparatively speaking) empirical runtime performance is not satisfactory Alternatively assumptions?

slide-20
SLIDE 20

An Introduction to Spectral Learning

Anchor Words

Definition (p-separable) M is p-separable if ∀j, ∃i s.t. Mij ≥ p and Mij′ = 0 for j′ = j Documents do not necessarily contains anchor words Two-fold algorithm

1 Selection: find the anchor word for each topic 2 Recover: recover M based on anchor words

Good theoretical guarantees and empirical results

slide-21
SLIDE 21

An Introduction to Spectral Learning

Anchor Words

1

1The illustration is taken from Ankur Moitra’s slides,

http://people.csail.mit.edu/moitra/docs/IASM.pdf

slide-22
SLIDE 22

An Introduction to Spectral Learning

Discussion

Summary A brief introduction to MoM Learning topic models by spectral decomposition Anchor words assumption Connections with our work?