Lecture 6: (Probabilistic) Latent Semantic Analysis Julia - - PowerPoint PPT Presentation

lecture 6 probabilistic latent semantic analysis
SMART_READER_LITE
LIVE PREVIEW

Lecture 6: (Probabilistic) Latent Semantic Analysis Julia - - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 6: (Probabilistic) Latent Semantic Analysis Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment Indexing by Latent


slide-1
SLIDE 1

CS598JHM: Advanced NLP (Spring 2013)

http://courses.engr.illinois.edu/cs598jhm/

Julia Hockenmaier

juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

Lecture 6: (Probabilistic) Latent Semantic Analysis

slide-2
SLIDE 2

Bayesian Methods in NLP

Indexing by Latent Semantic Analysis

(Deerwester et al., 1990)

2

slide-3
SLIDE 3

Bayesian Methods in NLP

Latent Semantic Analysis

3

The task:

Return relevant documents for text queries

The problem: relevance is conceptual/semantic

  • The index of relevant documents may not contain all query

terms (synonymy and missing information)

  • The query terms may be ambiguous (polysemy)

Indexing by Latent Semantic Analysis

  • Map queries and documents into a new vector space

whose k dimensions correspond to independent concepts

  • In this space, queries will be near semantically close

documents

slide-4
SLIDE 4

Bayesian Methods in NLP

: Region closest to Query (e.g. cosine > .9)

4

Dimension 1 Dimension 2

?

: Documents : Terms

? : Query

slide-5
SLIDE 5

Bayesian Methods in NLP

Latent Semantic Analysis

5

Terms Documents

Concepts

Documents Terms

Concepts

≈ × × X ≈ T0 × S0 × D0’ = Ẋ X: Term-document matrix (=data): Xij = freq of wi in Dj Ẋ = T0 S0D0‘ (k-rank approximation of X) T0: Columns are orthogonal and unit-length T0’T0 = I S0: Diagonal matrix of the k largest singular values D0: Columns are orthogonal and unit-length D0’D0 = I Low-rank approximation of Singular Value Decomposition (SVD): =

this should really be X

^

slide-6
SLIDE 6

Bayesian Methods in NLP

T0

Term wi

LSA: term similarity

Ẋ Ẋ‘ = T0 S0 S0 T0 dot product of wi, wj in the new space T0 S0 ẊẊ‘ = T0 S0 S0 T0 (D cancels out because S is diagonal and D orthonormal) Similarity of terms wi, wj in the new space: (ẊẊ‘)ij

slide-7
SLIDE 7

Bayesian Methods in NLP

LSA: document similarity

7

20 D0

  • Doc. Dj

Ẋ’ Ẋ = D0 S0 S0 D0 dot product of Di, Dj in the new space D0 S0 ẊẊ‘ Ẋ’Ẋ = D0 S0 S0 D0 (T cancels out because S is diagonal and T orthonormal) Similarity of documents di, dj in the new space: (Ẋ’Ẋ)ij

slide-8
SLIDE 8

Bayesian Methods in NLP

LSA: term-document similarity

The elements of Ẋ give the similarity of terms and documents. Now, terms are projected to TS1/2 , documents to DS1/2

8

slide-9
SLIDE 9

Bayesian Methods in NLP

LSA: query-document similarity

Queries q are ‘pseudo-documents’: they don’t appear in X Construct their term vector Xq Define their document vector Dq = X’q TS-1

9

slide-10
SLIDE 10

Bayesian Methods in NLP

Probabilistic Latent Semantic Indexing

(Hofmann 1999)

10

slide-11
SLIDE 11

Bayesian Methods in NLP

The aspect model

11

Observations are document-word pairs (d, w) Assume there are k aspects z1...zk Each observation is associated with a hidden aspect z P(d, w) = P(d)P(w | d) with P(w | d) = ∑z∈Z P(w | z)P(z | d) Or, equivalently: P(d, w) = ∑z∈Z P(z)P(d | z)P(w | z)

slide-12
SLIDE 12

Bayesian Methods in NLP 12

w1 w2 w3

1.0 1.0 1.0

A geometric interpretation

Documents P(w |d)

Each document corresponds to one multinomial over words

Topics P(w | z)

Each topic is a multinomial over words

Word simplex

Any point in this simplex defines a multinomial over words

Topic simplex

The topics define the corners

  • f a (sub)simplex.

All training documents lie inside this topic simplex.

P(w | d) = λ1 P(w | z1 ) + λ2 P(w | z2 ) + λ3 P(w | z3 ) = P(z1 | d)P(w | z1) + P(z2 | d)P(w | z2) + P(z3 | d)P(w | z3)

slide-13
SLIDE 13

Bayesian Methods in NLP

PLSA is a mixture model

Mixture models:

  • K mixture components and N observations x1... xN
  • Mixing weights (θ1... θK): P( k ) = θK
  • Each observation xn is generated by mixture component zn

P( xn ) = P( zn ) P( xn | zn )

PLSI:

  • Mixture components = topics
  • Mixing weights are specific to each document θd = (θd1...θdK)
  • Each observation (word) wd,n is a sample

from the document-specific mixture model. It is drawn from one of the components zd,n P( wd,n ) = P( zd,n | θd ) P( wd,n | zd,n )

13

slide-14
SLIDE 14

Bayesian Methods in NLP

Estimation: EM algorithm

E-step: Recompute

P(z | d, w) = P(z, d, w) / ∑z’ P(z’, d, w) with P(z, d, w) = P(z)P(d | z)P(w | z)

M-step: Recompute

P(w | z) ∝ ∑d freq(d, w) P( z | d, w) P(d | z) ∝ ∑w freq(d, w) P( z | d, w) P(z) ∝ ∑d ∑w freq(d, w) P( z | d, w)

14