Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. - - PowerPoint PPT Presentation

models for retrieval models for retrieval
SMART_READER_LITE
LIVE PREVIEW

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. - - PowerPoint PPT Presentation

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2003 References: 1. Berlin Chen et al., An HMM/N-gram-based Linguistic Processing


slide-1
SLIDE 1

Models for Retrieval Models for Retrieval

Berlin Chen 2003

References:

  • 1. Berlin Chen et al., “An HMM/N-gram-based Linguistic Processing Approach for Mandarin Spoken Document

Retrieval,” EUROSPEECH 2001

  • 2. M. W. Berry et al., “Using Linear Algebra for Intelligent Information Retrieval,” technical report, 1994
  • 3. Thomas Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, 2001
  • 1. HMM/N-gram-based
  • 2. Latent Semantic Indexing (LSI)
  • 3. Probabilistic Latent Semantic Analysis (PLSA)
slide-2
SLIDE 2

2

HMM/N-gram-based Model

  • Model the query as a sequence of input
  • bservations (index terms),
  • Model the doc as a discrete HMM

composed of distribution of N-gram parameters

  • The relevance measure, , can be

estimated by the N-gram probabilities of the index term sequence for the query, , predicted by the doc

– A generative model for IR Q

N n q

q q q Q .. ..

2 1

=

D

( )

R D Q P is

N n q

q q q Q .. ..

2 1

= D

( ) ( ) (

)

( )

R D Q P R D P R D Q P Q R D P D

D D D

is max arg is is max arg is max arg

*

≈ ≈ =

with the assumption that ……

slide-3
SLIDE 3

3

HMM/N-gram-based Model

  • N-gram approximation (Language Model)

– Unigram – Bigram – Trigram – ……..

( ) ( ) ( ) ( ) ( )

N

w P w P w P w P W P .....

3 2 1

=

( ) { } ( ) ( ) (

) ( ) ( )

1 2 1 2 1 3 1 2 1 2 1 2 1

.... ..... .. .. .. ..

= = =

N N N n N n

w w w w P w w w P w w P w P w w w w P w w w w W W P

( ) ( ) (

) ( ) ( )

1 2 3 1 2 1

.....

=

N N w

w P w w P w w P w P W P

( ) ( ) (

) ( ) ( )

1 2 2 1 3 1 2 1

.....

− −

=

N N N

w w w P w w w P w w P w P W P

slide-4
SLIDE 4

4

HMM/N-gram-based Model

  • A discrete HMM composed of distribution of N-

gram parameters

1

4 3 2 1

= + + + m m m m

N n

q q q q Q .. ..

2 1

=

3

m

4

m

( )

D q P

n

( )

Corpus q P

n

( )

D q q P

n n

,

1 −

( )

Corpus q q P

n n

,

1 −

1

m

2

m

( ) ( ) ( ) [ ] ( ) ( ) [ ( ) ( )]

, , is

1 4 1 3 2 2 1 1 2 1 1

Corpus q q P m D q q P m Corpus q P m D q P m Corpus q P m D q P m R D Q P

n n n n N n n n − − =

+ + + ⋅ + =

slide-5
SLIDE 5

5

HMM/N-gram-based Model

  • Three Types of HMM Structures

– Type I: Unigram-Based (Uni) – Type II: Unigram/Bigram-Based (Uni+Bi) – Type III: Unigram/Bigram/Corpus-Based (Uni+Bi*)

( ) ( ) ( ) [ ]

is

1 2 1

=

+ =

N n n n

Corpus q P m D q P m R D Q P

( ) ( ) ( ) [ ] ( ) ( ) ( ) [ ]

, is

2 1 3 2 1 1 2 1 1

= −

+ + ⋅ + =

N n n n n n

D q q P m Corpus q P m D q P m Corpus q P m D q P m R D Q P

( ) ( ) ( ) [ ] ( ) ( ) [ ( ) ( )]

, , is

1 4 1 3 2 2 1 1 2 1 1

Corpus q q P m D q q P m Corpus q P m D q P m Corpus q P m D q P m R D Q P

n n n n N n n n − − =

+ + + ⋅ + =

P(陳水扁 總統 視察 阿里山 小火車|D) =[m1P(陳水扁|D)+m2 P(陳水扁|C)] x [m1P(總統|D)+m2P(總統|C)+ m3P(總統|陳水扁,D)+m4P(總統|陳水扁,C)] x[m1P(視察|D)+m2P(視察|C)+ m3P(視察|總統,D)+m4P(視察|總統,C)] x ……….

slide-6
SLIDE 6

6

HMM/N-gram-based Model

  • The role of the corpus N-gram probabilities

– Model the general distribution of the index terms

  • Help to solve zero-frequency problem
  • Help to differentiate the contributions of

different missing terms in a doc – The corpus N-gram probabilities were estimated using an outside corpus

( )

! = D q P

n

qa qa qa qb qa qb qb qc Doc D qc P(qb|D)=0.3 P(qa |D)=0.4 qd P(qc |D)=0.2 P(qd |D)=0.1 P(qe |D)=0.0 P(qf |D)=0.0

( ) ( )

Corpus q q P Corpus q P

n n n

,

1 −

slide-7
SLIDE 7

7

HMM/N-gram-based Model

  • Estimation of N-grams (Language Models)

– Maximum likelihood estimation (MLE) for doc N-grams

  • Unigram
  • Bigram

– Similar formulas for corpus N-grams

( )

( )

( )

( )

D q C q C q C D q P

i D D j q j D i D i

= = ∑

( )

( ) ( )

j D i j D j i

q C q q C D q q P , , = Counts of term qi in the doc D Length of the doc D Counts of term qi in the doc D Counts of term pair (qj,qi ) in the doc D

( )

( )

Corpus q C Corpus q P

i Corpus i

=

( )

( ) ( )

j Corpus i j Corpus j i

q C q q C D q q P , , = Corpus: an outside corpus or just the doc collection

slide-8
SLIDE 8

8

HMM/N-gram-based Model

  • Basically, m1, m2, m3, m4, can be estimated by

using the Expectation-Maximization (EM) algorithm

– All docs share the same weights here – The N-gram probability distributions also can be estimated using the EM algorithm instead of the maximum likelihood estimation

  • For those docs with training queries, m1, m2, m3,

m4, can be estimated by using the Minimum Classification Error (MCE) training algorithm

– The docs can have different weights

because of the insufficiency of training data

slide-9
SLIDE 9

9

HMM/N-gram-based Model

  • Expectation-Maximum Training

– The weights are tied among the documents – E.g. m1 of Type I HMM can be trained using the following equation:

  • Where is the set of training query exemplars,

is the set of docs that are relevant to a specific training query exemplar , is the length of the query , and is the total number of docs relevant to the query

( ) ( ) ( )

[ ] [ ]

[ ]

[ ]

∑ ∑ ∑ ∑

∈ ∈ ∈ ∈

⋅       + =

Q TrainSet Q Q R Q TrainSet Q Q R Doc D Q n q n n n

Doc Q Corpus q P m D q P m D q P m m

to to 2 1 1 1

ˆ ˆ ˆ

[ ]Q

TrainSet

[ ]

Q R

Doc

to

Q

Q

[ ]

Q R

Doc

to

Q

the old weight the new weight

819 queries ≦2265 docs

slide-10
SLIDE 10

10

HMM/N-gram-based Model

  • Expectation-Maximum Training

( )

( )

( )

( ) ( )

( ) ( ) ( )

( ) ( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

( )

( ) ( ) ( ) ( ) ( )

( )

( )

( )

( ) ( ) ( )

( )

( ) ( )

( )

( )

D Q P D Q P D k q P D q k P D k q P D q k P p q p p q p p p q p D k q P D q k P D k q P D q k P D q k P D q k P D q k P D q k P D k q P D q k P D k q P D q k P D q k P D k q P D q k P D q k P D k q P D q k P D k q P D k q P D q P D q k P D k q P D k q P D q P D q k P D Q P D Q P D Q P D Q P

Q n q n k n Q n q n k n i i i i i i i i i i i i i i Q n q n k n n k n Q n q n k n n k n Q n q n k n n k n Q n q n n k n Q n q n n k n Q n q n n n k n Q n q n n n k n

ˆ | | then ˆ | , log ˆ , | | , log ˆ , | If 1 log log log ˆ | , log ˆ , | | , log ˆ , | , | log ˆ , | ˆ , | log ˆ , | ˆ | , log ˆ , | | , log ˆ , | ˆ , | ˆ | , log ˆ , | , | | , log ˆ , | ˆ | , ˆ | , ˆ | log ˆ , | | , | , | log ˆ , | ˆ | log | log ? ˆ | | > ≥ ∴           =         − ≤ = −       − ≥       − +       − = − =         −       = − >

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈

Q

Empirical Derivation

The new model The old model Jensen’s inequality

) 1 log ( − ≤ x x Q

) ˆ , ˆ ( ) ˆ , ( D D D D Φ − Φ

) ˆ , ( ) ˆ , ˆ ( ≥ − D D H D D H

slide-11
SLIDE 11

11

HMM/N-gram-based Model

  • Expectation-Maximum Training

( )

( )

( ) ( ) ( )

( ) ( ) [ ]

( ) ( )

( ) [ ]

log ˆ ˆ ˆ ˆ log ˆ ˆ ˆ log ˆ ) ˆ , (

∑ ∑ ∑ ∑ ∑ ∑ ∑

∈ ∈ ∈

= = = Φ

Q n q k n k j j n k n Q n q n k n n Q n q n k n

m |k,D q P m D |j, q P m D |k, q P k|D P |k,D q P D | q P D k| P D |k, q P ,k|D q P D , k|q P D D

( ) ( )

( ) [ ]

( ) ( ) ( ) ( )

l m G m G m G m D |j, q P m D |k, q P G l m D |j, q P m D |k, q P m m D D m l m |k,D q P m D |j, q P m D |k, q P D D

k k Q n q j j n k n k Q n q j j n k n k k i i Q n q k k n j j n k n

− = = = = ⇒ = = +           = ∂ Φ ′ ∂         − +           = Φ ′

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

∈ ∈ ∈

... .... ˆ ˆ ˆ ˆ Assume ˆ ˆ ˆ ˆ 1 ) ˆ , ( 1 log ˆ ˆ ˆ ˆ ) ˆ , (

2 2 1 1

( ) ( ) ( ) ( ) ( ) ( ) ˆ

ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ Q m D |j, q P m D |k, q P m D |j, q P m D |k, q P m D |j, q P m D |k, q P m G l

Q q j j n k n s Q q j j n s n Q q j j n k n k s s

n n n

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

∈ ∈ ∈

= = ∴ − = ∴

Q function

normalization constraints using Lagrange multipliers empirical distribution the model

l

k

G

slide-12
SLIDE 12

12

HMM/N-gram-based Model

  • Experimental results with EM training

– HMM/N-gram-based approach – Vector space model – HMM/N-gram-based approach is consistently better than vector space model

Word-level Syllable-level Average Precision Uni Uni+Bi Uni+Bi* Uni Uni+Bi Uni+Bi* TQ/TD 0.6327 0.6069 0.5427 0.4698 0.5220 0.5718 TDT2 TQ/SD 0.5658 0.5702 0.4803 0.4411 0.5011 0.5307 TQ/TD 0.6569 0.6542 0.6141 0.5343 0.5970 0.6560 TDT3 TQ/SD 0.6308 0.6361 0.5808 0.5177 0.5678 0.6433 Word-level Syllable-level Average Precision S(N), N=1 S(N), N=1~2 S(N), N=1 S(N), N=1~2 TQ/TD 0.5548 0.5623 0.3412 0.5254 TDT2 TQ/SD 0.5122 0.5225 0.3306 0.5077 TQ/TD 0.6505 0.6531 0.3963 0.6502 TDT3 TQ/SD 0.6216 0.6233 0.3708 0.6353

slide-13
SLIDE 13

13

Review: The EM Algorithm

  • Introduction of EM (Expectation Maximization):

– Why EM?

  • Simple optimization algorithms for likelihood function relies
  • n the intermediate variables, called latent (隱藏的)data

In our case here, the state sequence is the latent data

  • Direct access to the data necessary to estimate the

parameters is impossible or difficult

– Two Major Steps :

  • E : expectation with respect to the latent data using the

current estimate of the parameters and conditioned on the

  • bservations
  • M: provides a new estimation of the parameters according to

ML (or MAP)

slide-14
SLIDE 14

14

Review: The EM Algorithm

  • The EM Algorithm is important to HMMs and other

learning techniques

– Discover new model parameters to maximize the log-likelihood

  • f incomplete data

by iteratively maximizing the expectation of log-likelihood from complete data

  • Example

– The observable training data

  • We want to maximize , is a parameter vector

– The hidden (unobservable) data

  • E.g. the component densities of observable data , or the

underlying state sequence in HMMs

  • s

λ

( )

λ

  • P
  • (

)

λ s

  • P ,

log

( )

λ

  • P

log

( )

( )

[ ]

= Θ

  • s

s

  • P

E λ λ λ

λ

, log ,

,

slide-15
SLIDE 15

15

HMM/N-gram-based Model

  • Minimum Classification Error (MCE) Training

– Given a query and a desired relevant doc , define the classification error function as:

  • >0 means misclassified; <=0 means a correct decision

– Transform the error function to the loss function

  • In the range between 0 and 1

Q

*

D

( ) ( )

[ ]

R D Q P R D Q P Q D Q E

D

not is ' log max is log 1 ) , (

' * *

+ − = ) ) , ( exp( 1 1 ) , (

* *

β α + − + = D Q E D Q L

slide-16
SLIDE 16

16

HMM/N-gram-based Model

  • Minimum Classification Error (MCE) Training

– Apply the loss function to the MCE procedure for iteratively updating the weighting parameters

  • Constraints:
  • Parameter Transforms, (e.g.,Type I HMM)

and – Iteratively update (e.g., Type I HMM)

  • Where,

1 , = ≥

k k k

m m

2 1 1

~ ~ ~ 1 m m m

e e e m + =

2 1 2

~ ~ ~ 2 m m m

e e e m + =

1

m

( ) ( ) ( ) ( )

( )

i D D

m D Q L i i m i m

* *

1 * 1 1

~ , ~ 1 ~

=

∂ ∂ ⋅ − = + ε

( ) ( )

, ~ ) , ( ) , ( ) , ( ~ ) , (

1 * * * 1 * ~ ,

1 *

m D Q E D Q E D Q L i m D Q L i

m D

∂ ∂ ⋅ ∂ ∂ ⋅ = ∂ ∂ ⋅ = ∇ ε ε

[ ]

) , ( 1 ) , ( ) , ( ) , (

* * * *

D Q L D Q L D Q E D Q L − ⋅ ⋅ = ∂ ∂ α

slide-17
SLIDE 17

17

HMM/N-gram-based Model

  • Minimum Classification Error (MCE) Training

– Iteratively update (e.g., Type I HMM)

( ) ( )

( )

( ) ( ) [ ] ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ,

1 1 1 ~ log 1 ~ ) , (

2 * 1 * 1 1 ~ ~ ~ * ~ ~ ~ * ~ ~ ~ ~ ~ ~ ~ ~ ~ * ~ ~ ~ * ~ ~ ~ ~ * ~ 2 ~ ~ ~ 1 ~ ~ ~ * ~ ~ ~ 1 *

2 1 2 2 1 1 2 1 1 2 1 1 2 1 2 2 1 1 2 1 1 2 1 2 1 1 2 1 2 2 1 1

        + + − − =               + + + + − + =               + + + + + + + − − = ∂             + + + ∂ − = ∂ ∂

∑ ∑ ∑ ∑

∈ ∈ ∈ ∈ Q q n n n Q q n m m m n m m m n m m m m m m Q q n m m m n m m m n m m m n m n m m m m Q q n m m m n m m m

n n n n

Corpus q P m D q P m D q P m Q m Corpus q P e e e D q P e e e D q P e e e Q e e e Corpus q P e e e D q P e e e D q P e e e Corpus q P e D q P e e e e Q m Corpus q P e e e D q P e e e Q m D Q E

1

m

slide-18
SLIDE 18

18

HMM/N-gram-based Model

  • Minimum Classification Error (MCE) Training

– Iteratively update (e.g., Type I HMM)

( ) ( ) ( ) [ ] ( ) ( ) (

)

( ) (

)

( ) (

) ,

1 , 1 , ) (

2 * 1 * 1 1 * * ~ ,

1 *

        + + − ⋅ − ⋅ ⋅ ⋅ − = ∇

∈Q q n n n m D

n

Corpus q P i m D q P i m D q P i m Q i m D Q L D Q L i i α ε

( )

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( )

( ) ( ) ( )

( )

[ ]

( ) ( ) ( )

( )

[ ]

( ) ( ) ( )

, 1

) ( 2 ) ( 1 ) ( 1 ~ ~ ) ( ~ ~ ~ ) ( ~ ~ ~ ) ( ~ ~ ) ( ~ ) ( ~ 1 ~ 1 ~ 1 ~ 1

2 ~ , * 1 ~ , * 1 ~ , * 2 1 2 ~ , * 2 2 1 1 ~ , * 1 2 1 1 ~ , * 1 ) ( 2 ~ , * 2 1 ~ , * 1 1 ~ , * 1 2 1 1

i i i i m i m i i m i m i m i i m i m i m i i m e i m i i m i i m i m i m i m

m D m D m D m D m D m D i m D m D m D

e i m e i m e i m e e e e e e e e e e e e e e e e e e e e i m

∇ − ∇ − ∇ − ∇ − ∇ − ∇ − ∇ − ∇ − + + +

⋅ + ⋅ ⋅ = + + + + = + = + = +

∇ −

1

m

the new weight the old weight

( ) ( )

) ( ~ 1 ~

1 ~ , * 1 1

i i m i m

m D

∇ − = +

slide-19
SLIDE 19

19

HMM/N-gram-based Model

  • Minimum Classification Error (MCE) Training

– Final Equations

  • Iteratively update
  • can be updated in the similar way

1

m

( )

( ) ( ) [ ]

( ) ( ) (

)

( ) (

)

( ) (

)

       + + − ⋅ − ⋅ ⋅ ⋅ − = ∇

∈Q n q n n n m D

Corpus q P i m D q P i m D q P i m Q i m D Q L D Q L i i

2 * 1 * 1 1 * * 1 ~ , *

1 , 1 , ) ( α ε

( ) ( ) ( ) ( )

) ( 2 ~ , * 2 ) ( 1 ~ , * 1 ) ( 1 ~ , * 1 1

1

i m D i m D i m D

e i m e i m e i m i m

∇ − ∇ − −∇

⋅ + ⋅ ⋅ = +

2

m

slide-20
SLIDE 20

20

HMM/N-gram-based Model

  • Experimental results with MCE training

– The results for the syllable-level index features were significantly improved

Word-level Syllable-level Average Precision Uni Uni+Bi* Fusion TQ/TD 0.6459 (0.6327) 0.6858 (0.5718) 0.7329 TDT2 TQ/SD 0.5810 (0.5658) 0.6300 (0.5307) 0.6914

Iterations=100

Average Precision Average Precision MCE Iterations (Word-based) MCE Iterations (syllable-based) TQ/TD TQ/SD TQ/TD TQ/SD Before MCE Training

slide-21
SLIDE 21

21

HMM/N-gram-based Model

  • Advantages

– A formal mathematic framework – Use collection statistics but not heuristics – The retrieval system can be gradually improved through usage

  • Disadvantages

– Only literal term matching (or word overlap measure)

  • The issue of relevance or aboutness is not taken

into consideration – The implementation relevance feedback or query expansion is not straightforward

slide-22
SLIDE 22

22

Latent Semantic Indexing (LSI)

  • LSI: a technique projects queries and docs into a

space with “latent” semantic dimensions

– Co-occurring terms are projected onto the same dimensions – In the latent semantic space (with fewer dimensions), a query and doc can have high cosine similarity even if they do not share any terms – Dimensions of the reduced space correspond to the axes of greatest variation

  • Closely related to Principal Component Analysis (PCA)
slide-23
SLIDE 23

23

Latent Semantic Indexing (LSI)

  • Dimension Reduction and Feature Extraction

– PCA – SVD (in LSI)

X φ

T i i

y = Y

k n

X

k

φ

1

φ

= k i i i

y

1

φ

k

φ

1

φ

n

X ˆ

rxr

Σ

r≤min(m,n)

rxn

VT U

mxr mxn mxn

kxk

A A’

k given a for ˆ min

2

X X − k

F

given a for min

2

A A − ′

feature space

latent semantic space latent semantic space k k

  • rthonormal basis
slide-24
SLIDE 24

24

Latent Semantic Indexing (LSI)

– Singular Value Decomposition (SVD) used for the word-document matrix

  • A least-squares method for dimension reduction
slide-25
SLIDE 25

25

Latent Semantic Indexing (LSI)

  • Frameworks to circumvent vocabulary mismatch

Doc Query terms terms

doc expansion query expansion literal term matching

structure model structure model

latent semantic structure retrieval

slide-26
SLIDE 26

26

Latent Semantic Indexing (LSI)

slide-27
SLIDE 27

27

Latent Semantic Indexing (LSI)

  • Singular Value Decomposition (SVD)

=

w1 w2 wm d1 d2 dn mxn rxr mxr rxn

A Umxr Σr VT

rxm

d1 d2 dn w1 w2 wm

=

w1 w2 wm d1 d2 dn mxn kxk mxk kxn

A’ Umxk Σk VT

kxm

d1 d2 dn w1 w2 wm

Docs and queries are represented in a k-dimensional space. The quantities of the axes can be properly weighted according to the associated diagonal values of Σk VTV=IkXk Both U and V has orthonormal column vectors UTU=IkXk

k≤r ||A||F

2 ≥||A’|| F 2

r≤min(m,n)

slide-28
SLIDE 28

28

Latent Semantic Indexing (LSI)

  • Singular Value Decomposition (SVD)

– ATA is symmetric nxn matrix

  • All eigenvalues λj are nonnegative real numbers
  • All eigenvectors vj are orthonormal
  • Define singular values

– As the square roots of the eigenvalues of ATA – As the lengths of the vectors Av1, Av2 , …., Avn

....

2 1

≥ ≥ ≥ ≥

n

λ λ λ n j

j j

,..., 1 , = = λ σ

[ ]

n

v v v V ...

2 1

=

1 =

j T j v

v

( )

nxn T

I V V =

( )

n

diag λ λ λ ,..., ,

1 1 2 =

Σ

.....

2 2 1 1

Av Av = = σ σ For λi≠ 0, i=1,…r, {Av1, Av2 , …. , Avr } is an

  • rthogonal basis of Col A
slide-29
SLIDE 29

29

Latent Semantic Indexing (LSI)

  • {Av1, Av2 , …. , Avr } is an orthogonal basis of

Col A

– Suppose that A (or ATA) has rank r ≤n – Define an orthonormal basis {u1, u2 ,…., ur} for Col A

  • Extend to an orthonormal basis {u1, u2 ,…, um} of Rm

( )

= = = =

  • j

T i j j T T i j T i j i

v v Av A v Av Av Av Av λ

.... , ....

2 1 2 1

= = = = > ≥ ≥ ≥

+ + r r r r

λ λ λ λ λ λ

[ ] [ ]

r r r i i i i i i i i

v v v A u u u Av u Av Av Av u ... 1 1

2 1 2 1

= Σ ⇒ = ⇒ = = σ σ

[ ] [ ]

T m r m r

V U A AV U v v v v A u u u u Σ = ⇒ = Σ ⇒ = Σ ⇒ ... ... ... ...

2 1 2 1 2 2 2 2 1 2

...

r F

A σ σ σ + + + = ?

∑∑

= =

=

m i n j ij F

a A

1 1 2 2

slide-30
SLIDE 30

30

Latent Semantic Indexing (LSI)

slide-31
SLIDE 31

31

Latent Semantic Indexing (LSI)

  • Fundamental comparisons based on SVD

– The original word-document matrix (A) – The new word-document matrix (A’)

  • compare two terms

→ dot product of two rows of U’Σ’

  • compare two docs

→ dot product of two rows of V’Σ’

  • compare a query and a doc → each individual entry of A’

w1 w2 wm d1 d2 dn mxn

A

  • compare two terms → dot product of two rows of A

– or an entry in AAT

  • compare two docs → dot product of two columns of A

– or an entry in ATA

  • compare a term and a doc → each individual entry of A

A’A’T=(U’Σ’V’T) (U’Σ’V’T)T=U’Σ’V’TV’Σ’TU’T =(U’Σ’)(U’Σ’)T A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)t

For stretching

  • r shrinking

I U’=Umxk Σ’=Σk V’=Vnxk

slide-32
SLIDE 32

32

Latent Semantic Indexing (LSI)

  • Fold-in: find representations for pesudo-docs q

– For objects (new queries or docs) that did not appear in the original analysis

  • Fold-in a new mx1 query (or doc) vector

– Cosine measure between the query and doc vectors in the latent semantic space

( )

1 1 1

ˆ

− × × Σ

=

k k k m xm T xk

U q q

Query represented by the weighted sum of it constituent term vectors The separate dimensions are differentially weighted

Just like a row of V

( )

Σ Σ Σ = Σ Σ = d q d q d q coine d q sim

T

ˆ ˆ ˆ ˆ ) ˆ , ˆ ( ˆ , ˆ

2

row vectors

slide-33
SLIDE 33

33

Latent Semantic Indexing (LSI)

  • Fold-in a new 1xn term vector

1 1 1

ˆ

− × × Σ

=

k k k n xn xk

V t t

slide-34
SLIDE 34

34

Latent Semantic Indexing (LSI)

  • Experimental results

– HMM is consistently better than VSM at all recall levels – LSI is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated on TDT-3 SD collection. (Using word-level indexing terms)

slide-35
SLIDE 35

35

Latent Semantic Indexing (LSI)

  • Advantages

– A clean formal framework and a clearly defined

  • ptimization criterion (least-squares)
  • Conceptual simplicity and clarity

– Handle synonymy problems (“heterogeneous vocabulary”) – Good results for high-recall search

  • Take term co-occurrence into account
  • Disadvantages

– High computational complexity – LSI offers only a partial solution to polysemy

  • E.g. bank, bass,…
slide-36
SLIDE 36

36

Probabilistic Latent Semantic Analysis (PLSA)

  • Also called The Aspect Model, Probabilistic

Latent Semantic Indexing (PLSA)

– Can be viewed as a complex HMM Model

Thomas Hofmann 1999

J j

w w w w Q .. ..

2 1

=

( )

1

T w P

j

( )

i

D T P

1

( )

i K

D T P

( )

2

T w P

j

( )

K j T

w P

( )

( )

( ) ( ) ( )

( ) (

)

( )

( )

( )

i i i i i i i i i

D Q P D Q sim D Q P D P D Q P D Q P Q P D Q P Q D P D Q sim ≈ ⇒ ≈ = ≈ = = , , , ,

( )

( ) ( ) ( ) ( ) ( )

∏ ∑ ∏ ∑ ∏

      =       = = =

= = j w K k i k k j j w K k i k j j w i j i i

D T P T w P D T w P D w P D Q P D Q sim

1 1

, ,

( )

i

D T P

2

The latent variables =>The unobservable class variables Ti (topics or domains)

?

slide-37
SLIDE 37

37

Probabilistic Latent Semantic Analysis (PLSA)

  • Definition

– : the prob. when selecting a doc – : the prob. when pick a latent class for the doc – : the prob. when generating a word from the class

( )

i

D P

i

D

i

D

( )

i k D

T P

k

T

( )

k j T

w P

k

T

j

w

slide-38
SLIDE 38

38

Probabilistic Latent Semantic Analysis (PLSA)

  • Assumptions

– Bag-of-words: treat docs as memoryless source, words are generated independently – Conditional independent: the doc and word are independent conditioned on the state of the associated latent variable

i

D

k

T

j

w

( ) ( ) ( )

k i k j k i j

T D P T w P T D w P ≈ ,

( ) ( ) ( )

( )

( ) (

) ( )

( ) ( ) (

) ( )

( ) (

) ( )

( ) ( )

∑ ∑ ∑ ∑ ∑ ∑

= = = = = =

= = = = = =

K k i k k j K k i i k k j K k i k k i k j K k i k k i j K k i k i j K k i k j i j

D T P T w P D P D T P T w P D P T P T D P T w P D P T P T D w P D P T D w P D T w P D w P

1 1 1 1 1 1

, , , , ,

slide-39
SLIDE 39

39

Probabilistic Latent Semantic Analysis (PLSA)

  • Probability estimation using EM (expectation-

maximization) algorithm

– E (expectation) step

[ ] ( ) ( )

[ ]

∑ ∑

=

i D j w i k j i D j w k T i j C

D T w P E D w n L E , log ,

,

take expectation

( )

( )

( )

[ ]

∑ ∑ ∑

=

i D j w k T i k j i j k i j

D T w P D w T P D w n , log , ˆ ,

( )

( ) ( ) ( ) ( ) ( ) ( )

         = =

k T i k k j i k k j i j i j k i j k

D T P T w P D T P T w P D w P D w T P D w T P ˆ ˆ ˆ ˆ ˆ , ˆ , ˆ

( )

( )

( ) ( )

[ ]

∑ ∑ ∑

=

i D j w k T i k k j i j k i j

D T P T w P D w T P D w n log , ˆ ,

complete data likelihood

( ) ( ) ( ) ( ) ( ) ( ) ( )

∑ ∑ ∑ ∑

          =

i D j w k T i k k j k T i k k j i k k j i j

D T P T w P D T P T w P D T P T w P D w n log ˆ ˆ ˆ ˆ ,

Kullback-Leibler divergence empirical distribution the model

slide-40
SLIDE 40

40

Probabilistic Latent Semantic Analysis (PLSA)

  • Probability estimation using EM

– M (maximization) step

[ ] ( ) ( )

∑ ∑ ∑ ∑

        − +         − + =

i D k T i k i k T j w k j k C

D T P T w P L E Q 1 1 ρ τ

( )

( ) (

)

( ) ( )

       − + =

∑ ∑∑

j w k j k i D j w k j i j k i j k T j w P

T w P T w P D w T P D w n Q 1 log , ˆ , τ

( )

( )

( )

( ) ( )

       − + =

∑ ∑ ∑

k T i k j j w k T i k i j k i j j D k T P

D T P D T P D w T P D w n Q 1 log , ˆ , ρ

normalization constraints using Lagrange multipliers

slide-41
SLIDE 41

41

Probabilistic Latent Semantic Analysis (PLSA)

  • Probability estimation using EM

– M (maximization) step

  • Take differentiation

( )

( ) (

)

( ) (

)

∑ ∑ ∑

=

j w i D i j k i j i D i j k i j k T j w P

D w T P D w n D w T P D w n , ˆ , , ˆ ,

( )

( ) (

)

( ) (

)

( ) (

)

( ) ( ) (

)

( )

i i j k j w i j j w i j i j k j w i j k T i j k j w i j i j k j w i j j k

D n D w T P D w n D w n D w T P D w n D w T P D w n D w T P D w n D T P , ˆ , , , ˆ , , ˆ , , ˆ ,

∑ ∑ ∑ ∑ ∑ ∑

= = =

The training formula The training formula

slide-42
SLIDE 42

42

Probabilistic Latent Semantic Analysis (PLSA)

  • Latent Probability Spaces

image sequence analysis medical imaging context of contour boundary detection phonetic segmentation

( ) ( ) ( ) (

)

( ) (

) (

)

k i k k T k j i k k T i k j k T i k j i j

T D P T P T w P D T P D T w P D T w P D w P

∑ ∑ ∑

= = = , , , , ,

( ) ( )

k j k j T

w P

,

: ˆ U

( ) ( )k

k

T P diag : ˆ Σ

( ) ( )

k i k i T

D P

,

: ˆ V

( )

D W P ,

. = .

Dimensionality K=128 (latent classes)

slide-43
SLIDE 43

43

Probabilistic Latent Semantic Analysis (PLSA)

  • One more example on TDT1 dataset

aviation space missions family love Hollywood love

slide-44
SLIDE 44

44

Probabilistic Latent Semantic Analysis (PLSA)

  • Comparison with LSI

– Decomposition/Approximation

  • LSI: least-squares criterion measured on the L2- or

Frobenius norms of the word-doc matrices

  • PLSA: maximization of the likelihoods functions

based on the cross entropy or Kullback-Leibler divergence between the empirical distribution and the model – Computational complexity

  • LSI: SVD decomposition
  • PLSA: EM training, is time-consuming for iterations ?
slide-45
SLIDE 45

45

Probabilistic Latent Semantic Analysis (PLSA)

  • Experimental Results

PLSI-U*

– Two ways to smoothen empirical distribution with PLSI

  • Combine the cosine score with that of the vector

space model (so does LSI)

  • Combine the multinomials individually

Both provide almost identical performance – It’s not known if PLSA was used alone ) | ( ) 1 ( ) | ( ) | (

* i j PLSA i j Empirical i j Q PLSI

d P d P d P ω λ ω λ ω − + =

( )

( )

i i j i j Empirical

d n d w n d P , ) | ( = ω

slide-46
SLIDE 46

46

Probabilistic Latent Semantic Analysis (PLSA)

  • Experimental Results

PLSI-Q*

– Use the low-dimensional representation and

(be viewed in a k-dimensional latent space) to

evaluate relevance by means of cosine measure – Combine the cosine score with that of the vector space model – Use the ad hoc approach to reweight the different model components (dimensions) by

) | ( Q T P

k

) | (

i k

D T P

( )

( ) [ ]

⋅ =

j w j k j k

w idf T w P T RW ) | (

( )

( )

( )

( )

( ) ( )

∑ ∑ ∑ ∑ ∑

∈ ∈

                =

k T i k k Q j w k T j k k j Q j w k T i k j k k j

D T P T RW w T P T RW w q n D T P w T P T RW w q n D Q sim ) | ( ) | ( , ) | ( ) | ( , ,

2 2 2 2 2

slide-47
SLIDE 47

47

Probabilistic Latent Semantic Analysis (PLSA)

  • Experimental Results