An Introduction to Spectral Learning
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 - - PowerPoint PPT Presentation
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 - - PowerPoint PPT Presentation
An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral Learning Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words An
An Introduction to Spectral Learning
Outline
1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words
An Introduction to Spectral Learning
Preliminaries
X1, · · · , Xn ∼ p (x; θ), θ = (θ1, · · · θm)⊤ ˆ θ = ˆ θn = w (X1, · · · , Xn) Maximum Likelihood Estimator (MLE) ˆ θ = argmax
θ
log L (θ) Bayes Estimator (BE) ˆ θ = E (θ|X) =
θp (x|θ) π (θ) dθ p (x|θ) π (θ) dθ
An Introduction to Spectral Learning
Preliminaries
Question What makes a good estimator? MLE is consistent Both the MLE and BE have asymptotic normality √n
- ˆ
θn − θ
- N
- 0,
1 I (θ)
- under mild (regularity) conditions
Can be computationally expensive
An Introduction to Spectral Learning
Preliminaries
Example (Gamma distribution) p (xi; α, θ) = 1 Γ (α) θα xα−1
i
exp
- −xi
θ
- L (α, θ) =
- 1
Γ (α) θα
n n
- i=1
xi
α−1
exp
- −
n
i=1 xi
θ
- MLE is hard to compute due to the existence of Γ (α)
An Introduction to Spectral Learning
Method of Moments
j-th theoretical moment, j ∈ [k] µj (θ) := Eθ
- X j
j-th sample moment, j ∈ [k] Mj := 1 n
n
- i=1
X j
i
Plug-in and solve the multivariate polynomial equations Mj = µj (θ) j ∈ [k] sometimes can be recast as spectral decomposition
An Introduction to Spectral Learning
Method of Moments
Example (Gamma distribution) p (xi; α, θ) = 1 Γ (α) θα xα−1
i
exp
- −xi
θ
- X = E (Xi) = αθ
1 n
n
- i=1
- Xi − X
2 = Var (Xi) = αθ2
⇒ ˆ θ = 1 nX
n
- i=1
- Xi − X
2 , ˆ
α = X ˆ θ = nX 2
n
i=1
- Xi − X
2
An Introduction to Spectral Learning
Method of Moments
lack guarantee about the solution high-order sample moments are hard to estimate To reach a specified accuracy, the required sample size and computational cost is exponential in k (or n)! Question Could we recover the true θ from only low-order moments? Question Could we lower the sample requirement and computational complexity based on some (hopefully mild) assumptions?
An Introduction to Spectral Learning
Learning the Topic Models
Papadimitriou et al. (2000)
Non-overlapping separation condition (strong)
Anandkumar et al. (2012), MoM+SD
Full rank assumption (weak) Multinomial Mixture, LDA
Arora et al. (2012), MoM+NMF+LP
Anchor words (mild) LDA, Correlated Topic Model A more practical algorithm proposed in 2013
An Introduction to Spectral Learning
Learning the Topic Models
Suppose there are n documents, k hidden topics, d features M = [µ1|µ2| . . . |µk] ∈ Rd×k, µj ∈ ∆d−1 ∀j ∈ [k] w = (w1, . . . , wk) , w ∈ ∆k−1 P (h = j) = wj j ∈ [k] For the v-th word in a document, xv ∈ {e1, . . . ed} P (xv = ei|h = j) = µi
j,
j ∈ [k], i ∈ [d] Goal: Recover the M using low-order moments
An Introduction to Spectral Learning
Learning the Topic Models
Construct moment statistics Pairsij := P (x1 = ei, x2 = ej) Triplesij := P (x1 = ei, x2 = ej, x3 = et) Pair = E[x1 ⊗ x2] ∈ Rd×d Triples = E[x1 ⊗ x2 ⊗ x3] ∈ Rd×d×d Empirical plug-ins i.e. ˆ Pairs and ˆ Triples could be obtained from data through a straightforward manner We want to establish some equivalence between the empirical moments and parameters of interest
An Introduction to Spectral Learning
Learning the Topic Models
Triples (η) := E[x1 ⊗ x2 ⊗ x3, η] ∈ Rd×d Triples (η) : Rd → Rd×d Lemma Pairs = Mdiag (w) M ⊤ Triples (η) = M
- diag
- M ⊤η
- diag (w)
- M ⊤
The unknown M and w are twisted.
An Introduction to Spectral Learning
Learning the Topic Models
Assumption ( Non-degeneracy ) M has full column rank k
1 Find U, V ∈ Rd×k s.t.
- U ⊤M
−1 and
- V ⊤M
−1 exist.
2 ∀η ∈ Rd, define B (η) ∈ Rk×k
B (η) :=
- U ⊤Triples (η) V
U ⊤PairsV
−1
Lemma (Observable Operator) B (η) =
- U ⊤M
- diag
- M ⊤η
U ⊤M
−1
An Introduction to Spectral Learning
Learning the Topic Models
Input: ˆ Pairs and ˆ Triples Output: topic-word distributions ˆ M ˆ U, ˆ V ← top k left, right eigenvectors of ˆ Pairs a η ← random sample from range( ˆ U)
- ˆ
ξ1, ˆ ξ2, . . . , ˆ ξk
- ← right eigenvectors of B (η) b
for j ← 1 to k do ˆ µj ← ˆ U ˆ ξj/1, ˆ U ˆ ξj end return ˆ M = [ ˆ µ1| ˆ µ2| . . . | ˆ µk]
aPairs = Mdiag (w) M ⊤ bB (η) =
U ⊤M diag M ⊤η U ⊤M−1
An Introduction to Spectral Learning
Learning the Topic Models
Lemma (Observable Operator) B (η) =
- U ⊤M
- diag
- M ⊤η
U ⊤M
−1
We hope M ⊤η has distinct entries. How to pick η? η ← ei ⇒ M ⊤η i-th word’s distribution over topics Prior knowledge required! Otherwise, η ← Uθ, θ ∼ Uniform(Sk−1)
An Introduction to Spectral Learning
Learning the Topic Models
SVD is carried out on Rk×k, k ≪ d Only involves trigram statistics i.e. low-order moments Guaranteed to recover the parameters Parameters of more complicated models like LDA can be recovered in the same manner
An Introduction to Spectral Learning
Tensor Decomposition
Recall Pairs = Mdiag (w) M ⊤ Triples (η) = M
- diag
- M ⊤η
- diag (w)
- M ⊤
Pairs =
k
- j
wj · µj ⊗ µj Triples =
k
- j
wj · µj ⊗ µj ⊗ µj Symmetric tensor decomposition? µj need to be orthogonal
An Introduction to Spectral Learning
Tensor Decomposition
Whiten Pairs W := UD
1 2 ⇒ W ⊤PairsW = I
µ′
j := √wjW ⊤µj
We can check that µ′
j, j ∈ [k] are orthonormal vectors
Do orthogonal tensor decomposition on Triples (W , W , W ) =
k
- j=1
wj
- W ⊤µj
⊗3 =
k
- j=1
1 √wj µ′
j ⊗3
Then recover µj from µ′
j
An Introduction to Spectral Learning
Anchor Words
Drawbacks of previous algorithms topics cannot be correlated the bound is weak (comparatively speaking) empirical runtime performance is not satisfactory Alternatively assumptions?
An Introduction to Spectral Learning
Anchor Words
Definition (p-separable) M is p-separable if ∀j, ∃i s.t. Mij ≥ p and Mij′ = 0 for j′ = j Documents do not necessarily contains anchor words Two-fold algorithm
1 Selection: find the anchor word for each topic 2 Recover: recover M based on anchor words
Good theoretical guarantees and empirical results
An Introduction to Spectral Learning
Anchor Words
1
1The illustration is taken from Ankur Moitra’s slides,
http://people.csail.mit.edu/moitra/docs/IASM.pdf
An Introduction to Spectral Learning