Statistical Modeling Approaches for Statistical Modeling Approaches - - PowerPoint PPT Presentation

statistical modeling approaches for statistical modeling
SMART_READER_LITE
LIVE PREVIEW

Statistical Modeling Approaches for Statistical Modeling Approaches - - PowerPoint PPT Presentation

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval Information Retrieval 1. HMM/N-gram-based 2. Topical Mixture Model (TMM) 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis


slide-1
SLIDE 1

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval Information Retrieval

Berlin Chen 2004

References:

  • 1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval. July 2003
  • 2. Berlin Chen et al. A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents. ACM Transactions on

Asian Language Information Processing, June 2004

  • 3. Berlin Chen. Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval , 2004
  • 4. M. W. Berry et al. Using Linear Algebra for Intelligent Information Retrieval. Technical report, 1994
  • 5. Thomas Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 2001
  • 1. HMM/N-gram-based
  • 2. Topical Mixture Model (TMM)
  • 2. Latent Semantic Indexing (LSI)
  • 3. Probabilistic Latent Semantic Analysis (PLSA)
slide-2
SLIDE 2

IR 2004 – Berlin Chen 2

Taxonomy of Classic IR Models

Non-Overlapping Lists Proximal Nodes Structured Models

Retrieval: Adhoc Filtering Browsing U s e r T a s k

Classic Models Boolean Vector Probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Hidden Markov Model Topical Mixture Model Probabilistic LSI Language Model Algebraic Generalized Vector Latent Semantic Indexing (LSI) Neural Networks Browsing Flat Structure Guided Hypertext probability-based

slide-3
SLIDE 3

IR 2004 – Berlin Chen 3

Two Perspectives for IR Models

  • Matching Strategy

– Literal term matching

  • Vector Space Model (VSM), Hidden Markov Model (HMM),

Language Model (LM) – Concept matching

  • Latent Semantic Indexing (LSI), Probabilistic Latent Semantic

Indexing (PLSI), Topical Mixture Model (TMM)

  • Learning Capability

– Term weight, query expansion, document expansion, etc

  • Vector Space Model, Latent Semantic Indexing

– Solid statistical foundations

  • Hidden Markov Model, Probabilistic Latent Semantic

Indexing (PLSI), Topical Mixture Model (TMM)

slide-4
SLIDE 4

IR 2004 – Berlin Chen 4

Two Perspectives for IR Models (cont.)

香港星島日報篇報導引述軍事觀察家的話表示到二零零 五年台灣將完全喪失空中優勢原因是中國大陸戰機不論是數量 或是性能上都將超越台灣報導指出中國在大量引進俄羅斯先進 武器的同時也得加快研發自製武器系統目前西安飛機製造廠任職 的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對 地攻擊住宅飛機以督促遇到挫折的監控其戰機目前也已經 取得了重大階段性的認知成果根據日本媒體報導在台海戰 爭隨時可能爆發情況之下北京方面的基本方針使用高科技答應局部 戰爭因此解放軍打算在二零零四年前又有包括蘇愷 三十二期在內的兩百架蘇霍伊戰鬥機 中國解放軍 蘇愷戰機 中共新一代 空軍戰力

  • Literal Term Matching vs. Concept Matching
slide-5
SLIDE 5

IR 2004 – Berlin Chen 5

HMM/N-gram-based Model

Q

  • Model the query as a sequence of input observations

(index terms),

  • Model the doc as a discrete HMM composed of

distributions of N-gram parameters

  • The relevance measure, , can be estimated by

the N-gram probabilities of the index term sequence for the query, , predicted by the doc

– A generative model for IR

N n q

q q q Q .. ..

2 1

=

D

( )

R D Q P is

N n q

q q q Q .. ..

2 1

=

D

( ) ( ) (

)

( )

R D Q P R D P R D Q P Q R D P D

D D D

is max arg is is max arg is max arg

*

≈ ≈ =

with the assumption that ……

slide-6
SLIDE 6

IR 2004 – Berlin Chen 6

HMM/N-gram-based Model (cont.)

  • Given a word sequence, , of length N

– How to estimate its corresponding probability ?

( ) ( ) ( ) (

) ( ) ( )

1 2 1 2 1 3 1 2 1 2 1

.... ..... .. ..

= =

N N N n

w w w w P w w w P w w P w P w w w w P W P

W

N n w

w w w W .. ..

2 1

= ⇒

chain rule is applied Too complicate to estimate all the necessary probability items !

slide-7
SLIDE 7

IR 2004 – Berlin Chen 7

HMM/N-gram-based Model (cont.)

  • N-gram approximation (Language Model)

– Unigram – Bigram – Trigram – ……..

( ) ( ) ( ) ( ) ( )

N

w P w P w P w P W P .....

3 2 1

=

( ) ( ) (

) ( ) ( )

1 2 3 1 2 1

.....

=

N N w

w P w w P w w P w P W P

( ) ( ) (

) ( ) ( )

1 2 2 1 3 1 2 1

.....

− −

=

N N N

w w w P w w w P w w P w P W P

slide-8
SLIDE 8

IR 2004 – Berlin Chen 8

HMM/N-gram-based Model (cont.)

  • A discrete HMM composed of distributions of N-gram

parameters (viewed as a language model source)

1

4 3 2 1

= + + + m m m m

N n

q q q q Q .. ..

2 1

=

3

m

4

m

( )

D q P

n

( )

Corpus q P

n

( )

D q q P

n n

,

1 −

( )

Corpus q q P

n n

,

1 −

1

m

2

m

( ) ( ) ( )

, is

2 1 1

∏ =

= − N n n n

D q q P D q P R D Q P

A mixture of N probability distributions

( ) ( ) ( ) [ ] ( ) ( ) [ ( ) ( )]

, , is

1 4 1 3 2 2 1 1 2 1 1

Corpus q q P m D q q P m Corpus q P m D q P m Corpus q P m D q P m R D Q P

n n n n N n n n − − =

+ + ∏ + ⋅ + =

smoothing/interpolation , but reasons for what: avoiding zero prob., and …?

bigram modeling

slide-9
SLIDE 9

IR 2004 – Berlin Chen 9

HMM/N-gram-based Model (cont.)

  • Variants: Three Types of HMM Structures

– Type I: Unigram-Based (Uni) – Type II: Unigram/Bigram-Based (Uni+Bi) – Type III: Unigram/Bigram/Corpus-Based (Uni+Bi*)

( ) ( ) ( ) [ ]

is

1 2 1

=

+ =

N n n n

Corpus q P m D q P m R D Q P

( ) ( ) ( ) [ ] ( ) ( ) ( ) [ ]

, is

2 1 3 2 1 1 2 1 1

= −

+ + ⋅ + =

N n n n n n

D q q P m Corpus q P m D q P m Corpus q P m D q P m R D Q P

( ) ( ) ( ) [ ] ( ) ( ) [ ( ) ( )]

, , is

1 4 1 3 2 2 1 1 2 1 1

Corpus q q P m D q q P m Corpus q P m D q P m Corpus q P m D q P m R D Q P

n n n n N n n n − − =

+ + + ⋅ + =

P(陳水扁 總統 視察 阿里山 小火車|D) =[m1P(陳水扁|D)+m2 P(陳水扁|C)] x [m1P(總統|D)+m2P(總統|C)+ m3P(總統|陳水扁,D)+m4P(總統|陳水扁,C)] x[m1P(視察|D)+m2P(視察|C)+ m3P(視察|總統,D)+m4P(視察|總統,C)] x ……….

slide-10
SLIDE 10

IR 2004 – Berlin Chen 10

HMM/N-gram-based Model (cont.)

  • The role of the corpus N-gram probabilities

– Model the general distribution of the index terms

  • Help to solve zero-frequency problem
  • Help to differentiate the contributions of different missing

terms in a doc (global information like IDF?)

– The corpus N-gram probabilities were estimated using an

  • utside corpus

( )

! = D q P

n

qa qa qa qb qa qb qb qc Doc D qc P(qb|D)=0.3 P(qa |D)=0.4 qd P(qc |D)=0.2 P(qd |D)=0.1 P(qe |D)=0.0 P(qf |D)=0.0

( ) ( )

Corpus q q P Corpus q P

n n n

,

1 −

slide-11
SLIDE 11

IR 2004 – Berlin Chen 11

HMM/N-gram-based Model (cont.)

  • Estimation of N-grams (Language Models)

– Maximum likelihood estimation (MLE) for doc N-grams

  • Unigram
  • Bigram

– Similar formulas for corpus N-grams

( )

( )

( )

( )

D q C q C q C D q P

i D D j q j D i D i

= = ∑

( )

( ) ( )

j D i j D j i

q C q q C D q q P , , = Counts of term qi in the doc D Length of the doc D Or number of terms in the doc D Counts of term qi in the Corpus Counts of term pair (qj,qi ) in the doc D

( )

( )

Corpus q C Corpus q P

i Corpus i

=

( )

( ) ( )

j Corpus i j Corpus j i

q C q q C D q q P , , = Corpus: an outside corpus or just the doc collection

slide-12
SLIDE 12

IR 2004 – Berlin Chen 12

HMM/N-gram-based Model (cont.)

  • Basically, m1, m2, m3, m4, can be estimated by using the

Expectation-Maximization (EM) algorithm

– All docs share the same weights mi here – The N-gram probability distributions also can be estimated using the EM algorithm instead of the maximum likelihood (ML) estimation

  • Unsupervised: using doc itself, ML
  • Supervised: using query exemplars, EM
  • For those docs with training queries, m1, m2, m3, m4, can

be estimated by using the Minimum Classification Error (MCE) training algorithm

– The docs can have different weights

because of the insufficiency of training data

slide-13
SLIDE 13

IR 2004 – Berlin Chen 13

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– The weights are tied among the documents – E.g. m1 of Type I HMM can be trained using the following equation:

  • Where is the set of training query exemplars,

is the set of docs that are relevant to a specific training query exemplar , is the length of the query , and is the total number of docs relevant to the query

( ) ( ) ( )

[ ] [ ]

[ ]

[ ]

∑ ∑ ∑ ∑

∈ ∈ ∈ ∈

⋅ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + =

Q Q Q R n

TrainSet Q Q R TrainSet Q Doc D Q q n n n

Doc Q Corpus q P m D q P m D q P m m

to 2 1 1 1

to

ˆ

[ ]Q

TrainSet

[ ]

Q R

Doc

to

Q

Q

[ ]

Q R

Doc

to

Q

the old weight the new weight

819 queries ≦2265 docs

slide-14
SLIDE 14

IR 2004 – Berlin Chen 14

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 1: Expectation

  • Log-likelihood expression and take expectation over K

( ) ( ) ( )

D P D P D P ˆ | ˆ , | ˆ | , Q Q K K Q =

( ) ( ) ( ) ( ) ( ) ( )

D P D P D P D P D P D P ˆ , | log ˆ | , log ˆ | log ˆ | log ˆ , | log ˆ | , log Q K K Q Q Q Q K K Q − = ⇒ + =

( )

[ ]

( ) ( )

[ ]

D D

D P D P D P

, ,

ˆ , | log ˆ | , log E ˆ | log E

Q K Q K

Q K K Q Q − = ⇒

( )

( )

( )

( ) ( ) ( ) ( )

( )

( )

( )

( )

∑ ∑ ∑ ∑

− = ⇒ − = ⇒

K K K K

Q K Q K K Q Q K Q Q K K Q Q K Q Q K D P D P D P D P D P D P D P D P D P D P ˆ , | log , ˆ | , log , ˆ | log ˆ , | log ˆ | , log , ˆ | log ,

query word sequence mixture sequence Take expectation

  • n all possible

mixture sequences K (conditioned

  • n Q,D )

1 3 2 4

N N N N

k k k k q q q q

1 2 1 1 2 1 − −

= = L L K Q

slide-15
SLIDE 15

IR 2004 – Berlin Chen 15

HMM/N-gram-based Model (cont.)

  • Explanation

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

∑∑ ∑ ∏ ∏∑

= = = = = =

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = + + × × + + × + + = =

M k M k M k N n n n k M N M N M M M M N n k n n k

N n n n

D k q P m D k q P m D k q P m D k q P m D k q P m D k q P m D k q P m D k q P m D P

1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 M 1

1 2

ˆ , | ˆ , | ˆ , | ˆ , | ˆ , | ˆ , | ˆ , | ˆ , | ˆ | L L L L L Q

( )

D k P m

n kn

ˆ where =

( ) ( ) ( ) ( ) ( )

[ ]

( )

[ ]

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∏ ∑ ∑ ∑ ∏

= = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ⇒

= = = = = = = = = = = K

K Q Q D P D k q k q k q P D k q P D k q P D k P D P

M k M k M k N N M k M k M k N n n n M k M k M k N n n n n

N N N

ˆ | , ˆ | , , , , , ˆ | , ˆ , | ˆ | ˆ |

1 1 1 2 2 1 1 1 1 1 1 1 1 1 1

1 2 1 2 1 2

L L L L

( )

D P ˆ | , K Q

1 3 2 4

N N N N

k k k k q q q q

1 2 1 1 2 1 − −

= = L L K Q ) kinds ?(

  • f

kinds many How

N

M K

  • ns

distributi

  • f

mixtures M

Independence Assumption

( )

( )( ) ( )

∑ ∑ ∑ ∏ ∏ ∑

= = = = = =

= + + + + + + + + + =

M k M k M k T t tk TM T T M M T t M k tk

T t t t

a a a a a a a a a a a

1 1 1 1 2 1 2 22 21 1 12 11 1 1

1 2 ...

... ... ... ... : Note

Sum-product → product-sum

slide-16
SLIDE 16

IR 2004 – Berlin Chen 16

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 1: Expectation (cont.)

  • Express using two auxiliary functions

where

( )

( )

( )

( )

( )

∑ − ∑ =

K K

Q K Q K K Q Q K Q D P D P D P D P D P ˆ , | log , ˆ | , log , ˆ | log

( )

D P ˆ | log Q

( ) ( ) ( )

D D D D D P ˆ , H ˆ , Φ ˆ | log − = Q

( )

[ ]

( )

( )

= =

K

K Q Q K D P D P L E D D

C

ˆ | , log , ˆ , Φ

( )

( )

( )

=

K

Q K Q K D P D P D D ˆ , | log , ˆ , H

complete data

slide-17
SLIDE 17

IR 2004 – Berlin Chen 17

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 1: Expectation (cont.)

  • We want

( )

( )

D P D P | log ˆ | log Q Q ≥

( )

( )

( ) ( )

[ ]

( ) ( )

[ ]

( )

( )

( )

( )

D D D D D D D D D D D D D D D D D P D P , H ˆ , H , Φ ˆ , Φ , H , Φ ˆ , H ˆ , Φ | log ˆ | log + − − = − − − = − Q Q

unknown model setting

slide-18
SLIDE 18

IR 2004 – Berlin Chen 18

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 1: Expectation (cont.)

  • has the following property

( )

( )

D D D D , H ˆ , H + −

( )

( )

( )

( )

( ) ( ) ( )

( )

( ) ( )

( )

( ) ( )

( )

[ ]

ˆ , , , ˆ , 1 , , ˆ , log , , log , ˆ , log , , H ˆ , H = − − = ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ − ≥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ − = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − = + −

∑ ∑ ∑ ∑ ∑

K K K K K

Q K Q K Q K Q K Q K Q K Q K Q K Q K Q K Q K Q K D P D P D P D P D P D P D P D P D P D P D P D P D D D D

Jensen’s inequality

) 1 log ( − ≤ x x Q

Kullbuack-Leibler (KL) distance

( ) ( ) ( )

∑ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡

x

x q x P x P log

slide-19
SLIDE 19

IR 2004 – Berlin Chen 19

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 1: Expectation (cont.)

  • Therefore, for maximizing , we only need to

maximize the Φ-function (auxiliary function)

  • If unigram was used, the Φ-function can be further expressed

as

( )

D P ˆ | log Q

( )

( )

( )

=

K

K Q Q K D P D P D D ˆ | , log , ˆ , Φ

( )

( )

( )

( )

( )

∑ ∑ = ∑ =

∈Q K

K Q Q K

n

q k n n

D k q P D q k P D P D P D D ˆ | , log , ˆ | , log , ˆ , Φ

? ?

slide-20
SLIDE 20

IR 2004 – Berlin Chen 20

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 1: Expectation (cont.)

( )

( )

( ) ( ) ( )

( ) ( )

[ ]

( ) ( )

( )

[ ]

ˆ ˆ log ˆ ˆ log ˆ log ) ˆ , Φ(

∑ ∑ ∑ ∑ ∑ ∑ ∑

∈ ∈ ∈

⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = =

Q Q Q

n n n

q k k n j j n k n q k n n n q k n n

m D |k, q P m |j,D q P m |k,D q P D k| P D |k, q P |D q P k|D P |k,D q P D ,k| q P ,D k|q P D D

Auxiliary function

empirical distribution the model ( )

( ).

ˆ ˆ , : Where D k| P m k|D P m

k k

= =

slide-21
SLIDE 21

IR 2004 – Berlin Chen 21

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 1: Expectation (cont.)

  • the Φ-function (auxiliary function) can be treated in two parts

( ) ( )

( )

( ) ( )

( )

ˆ log Φ ˆ log Φ

ˆ

∑ ∑ ∑ ∑ ∑ ∑

∈ ∈

= =

Q k q Q m

n n

q k n j j n k n D , | P q k k j j n k n

D |k, q P m |j,D q P m |k,D q P m m |j,D q P m |k,D q P

The reestimation of probabilities will not be discussed here !

( )

ˆ D , | P k q

slide-22
SLIDE 22

IR 2004 – Berlin Chen 22

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 2: Maximization

  • Apply Lagrange Multiplier

( )

∑ ∑ ∑ ∑ ∑ ∑ ∑

= = = = = = =

= ∴ − = ⇒ − = ∀ − = ⇒ = + = − + = =

N 1 j j j j N 1 j j N 1 j j N 1 j j j j j j j N 1 j N 1 j j j j N 1 j j j

w w y w w y j y w y w y F 1 y y log w y log w F l l l l l l that Suppose Multiplier Lagrange applying By ∂ ∂

Constraint

j j j

y y y 1 log : Note = ∂ ∂

slide-23
SLIDE 23

IR 2004 – Berlin Chen 23

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 2: Maximization (cont.)

  • Apply Lagrange Multiplier

( ) ( ) ( ) ( ) ( ) ( )

l m G m G m G m |j,D q P m |k,D q P G l m |j,D q P m |k,D q P m m m l m m |j,D q P m |k,D q P

k k q j j n k n k q j j n k n k k i i q k k j j n k n

n n n

− = = = = ⇒ = = + ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎣ ⎡ = ∂ ∂ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − + ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ =

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

∈ ∈ ∈

... ˆ .... ˆ ˆ Assume ˆ 1 ˆ Φ 1 ˆ ˆ log Φ

2 2 1 1 Q Q m Q m

normalization constraints using Lagrange multipliers

k k k

m m m ˆ 1 ˆ ˆ log : Note = ∂ ∂

slide-24
SLIDE 24

IR 2004 – Berlin Chen 24

HMM/N-gram-based Model (cont.)

  • Expectation-Maximization Training

– Step 2: Maximization (cont.) – Extension:

  • Multiple training queries for a doc
  • Weights are tied among docs

( ) ( ) ( ) ( ) ( ) ( )

ˆ Q m |j,D q P m |k,D q P m |j,D q P m |k,D q P m |j,D q P m |k,D q P m G l

Q q j j n k n s Q q j j n s n Q q j j n k n k s s

n n n

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

∈ ∈ ∈

= = ∴ − = Q

l

k

G

( ) ( )

[ ] [ ]

[ ]

[ ]

∑ ∑ ∑ ∑ ∑

∈ ∈ ∈ ∈

⋅ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎣ ⎡ =

Q Q Q R n j

TrainSet Q Q R TrainSet Q Doc D Q q m j n k n k

Doc Q m D q P m D q P m

to

to

ˆ

slide-25
SLIDE 25

IR 2004 – Berlin Chen 25

HMM/N-gram-based Model (cont.)

  • Experimental results with EM training

– HMM/N-gram-based approach – Vector space model – HMM/N-gram-based approach is consistently better than vector space model

Word-level Syllable-level Average Precision Uni Uni+Bi Uni+Bi* Uni Uni+Bi Uni+Bi* TQ/TD 0.6327 0.6069 0.5427 0.4698 0.5220 0.5718 TDT2 TQ/SD 0.5658 0.5702 0.4803 0.4411 0.5011 0.5307 TQ/TD 0.6569 0.6542 0.6141 0.5343 0.5970 0.6560 TDT3 TQ/SD 0.6308 0.6361 0.5808 0.5177 0.5678 0.6433 Word-level Syllable-level Average Precision S(N), N=1 S(N), N=1~2 S(N), N=1 S(N), N=1~2 TQ/TD 0.5548 0.5623 0.3412 0.5254 TDT2 TQ/SD 0.5122 0.5225 0.3306 0.5077 TQ/TD 0.6505 0.6531 0.3963 0.6502 TDT3 TQ/SD 0.6216 0.6233 0.3708 0.6353

slide-26
SLIDE 26

IR 2004 – Berlin Chen 26

Review: The EM Algorithm

  • Introduction of EM (Expectation Maximization):

– Why EM?

  • Simple optimization algorithms for likelihood function relies
  • n the intermediate variables, called latent (隱藏的)data

In our case here, the state sequence is the latent data

  • Direct access to the data necessary to estimate the

parameters is impossible or difficult

– Two Major Steps :

  • E : expectation with respect to the latent data using the

current estimate of the parameters and conditioned on the

  • bservations
  • M: provides a new estimation of the parameters according to

ML (or MAP)

slide-27
SLIDE 27

IR 2004 – Berlin Chen 27

Review: The EM Algorithm (cont.)

  • The EM Algorithm is important to HMMs and other

learning techniques

– Discover new model parameters to maximize the log-likelihood

  • f incomplete data

by iteratively maximizing the expectation of log-likelihood from complete data

  • Example

– The observable training data

  • We want to maximize , is a parameter vector

– The hidden (unobservable) data

  • E.g. the component densities of observable data , or the

underlying state sequence in HMMs

O

S

λ

( )

λ O P

( )

λ S O, logP

( )

λ O P log

( )

( )

[ ]

= Φ

O O S

S O

λ

λ λ λ

,

ˆ , log ˆ , P E

O

slide-28
SLIDE 28

IR 2004 – Berlin Chen 28

HMM/N-gram-based Model (cont.)

  • Minimum Classification Error (MCE) Training

– Given a query and a desired relevant doc , define the classification error function as:

“>0”: means misclassified; “<=0”: means a correct decision

– Transform the error function to the loss function

  • In the range between 0 and 1

Q

*

D

( ) ( )

[ ]

R D Q P R D Q P Q D Q E

D

not is ' log max is log 1 ) , (

' * *

+ − = ) ) , ( exp( 1 1 ) , (

* *

β α + − + = D Q E D Q L

) , (

*

D Q E

) , (

*

D Q L

1

β

slide-29
SLIDE 29

IR 2004 – Berlin Chen 29

HMM/N-gram-based Model (cont.)

  • Minimum Classification Error (MCE) Training

– Apply the loss function to the MCE procedure for iteratively updating the weighting parameters

  • Constraints:
  • Parameter Transformation, (e.g.,Type I HMM)

and – Iteratively update (e.g., Type I HMM)

  • Where,

1 , = ≥

k k k

m m

2 1 1

~ ~ ~ 1 m m m

e e e m + =

2 1 2

~ ~ ~ 2 m m m

e e e m + =

1

m

( ) ( ) ( ) ( )

( )

i D D

m D Q L i i m i m

* *

1 * 1 1

~ , ~ 1 ~

=

∂ ∂ ⋅ − = + ε

( ) ( )

, ~ ) , ( ) , ( ) , ( ~ ) , (

1 * * * 1 * ~ ,

1 *

m D Q E D Q E D Q L i m D Q L i

m D

∂ ∂ ⋅ ∂ ∂ ⋅ = ∂ ∂ ⋅ = ∇ ε ε

[ ]

) , ( 1 ) , ( ) , ( ) , (

* * * *

D Q L D Q L D Q E D Q L − ⋅ ⋅ = ∂ ∂ α

Gradient descent

slide-30
SLIDE 30

IR 2004 – Berlin Chen 30

HMM/N-gram-based Model (cont.)

  • Minimum Classification Error (MCE) Training

– Iteratively update (e.g., Type I HMM) ( ) ( )

( )

( ) ( ) [ ] ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ,

1 1 1 ~ log 1 ~ ) , (

2 * 1 * 1 1 ~ ~ ~ * ~ ~ ~ * ~ ~ ~ ~ ~ ~ ~ ~ ~ * ~ ~ ~ * ~ ~ ~ ~ * ~ 2 ~ ~ ~ 1 ~ ~ ~ * ~ ~ ~ 1 *

2 1 2 2 1 1 2 1 1 2 1 1 2 1 2 2 1 1 2 1 1 2 1 2 1 1 2 1 2 2 1 1

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + + − − = ⎪ ⎪ ⎭ ⎪ ⎪ ⎬ ⎫ ⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ + + + + − + = ⎪ ⎪ ⎭ ⎪ ⎪ ⎬ ⎫ ⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ + + + + + + + − − = ∂ ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + + + ∂ − = ∂ ∂

∑ ∑ ∑ ∑

∈ ∈ ∈ ∈ Q q n n n Q q n m m m n m m m n m m m m m m Q q n m m m n m m m n m m m n m n m m m m Q q n m m m n m m m

n n n n

Corpus q P m D q P m D q P m Q m Corpus q P e e e D q P e e e D q P e e e Q e e e Corpus q P e e e D q P e e e D q P e e e Corpus q P e D q P e e e e Q m Corpus q P e e e D q P e e e Q m D Q E

1

m

( ) [ ] ( ) ( ) ( ) ( ) [ ] ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

x g x g x f x g x f x g x f x g x f x g x f x g x f x f x f x f

2

1 log : Note ′ − ′ = ′ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ′ + ′ = ′ ′ = ′

slide-31
SLIDE 31

IR 2004 – Berlin Chen 31

HMM/N-gram-based Model (cont.)

  • Minimum Classification Error (MCE) Training

– Iteratively update (e.g., Type I HMM)

( ) ( ) ( ) [ ] ( ) ( ) (

)

( ) (

)

( ) (

) ,

1 , 1 , ) (

2 * 1 * 1 1 * * ~ ,

1 *

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + + − ⋅ − ⋅ ⋅ ⋅ − = ∇

∈Q q n n n m D

n

Corpus q P i m D q P i m D q P i m Q i m D Q L D Q L i i α ε

( )

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( )

( ) ( ) ( )

( )

( ) ( ) ( )

( )

( ) ( ) ( )

, 1

) ( 2 ) ( 1 ) ( 1 ~ ~ ) ( ~ ~ ~ ) ( ~ ~ ~ ) ( ~ ) ( ~ ) ( ~ ) ( ~ 1 ~ 1 ~ 1 ~ 1

2 ~ , * 1 ~ , * 1 ~ , * 2 1 2 ~ , * 2 2 1 1 ~ , * 1 2 1 1 ~ , * 1 2 ~ , * 2 1 ~ , * 1 1 ~ , * 1 2 1 1

i i i i m i m i i m i m i m i i m i m i m i i m i i m i i m i i m i m i m i m

m D m D m D m D m D m D m D m D m D

e i m e i m e i m e e e e e e e e e e e e e e e e e e e e e i m

∇ − ∇ − ∇ − ∇ − ∇ − ∇ − ∇ − ∇ − ∇ − + + +

⋅ + ⋅ ⋅ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + + = + = + = +

1

m

the new weight the old weight

( ) ( )

) ( ~ 1 ~

1 ~ , * 1 1

i i m i m

m D

∇ − = +

slide-32
SLIDE 32

IR 2004 – Berlin Chen 32

HMM/N-gram-based Model (cont.)

  • Minimum Classification Error (MCE) Training

– Final Equations

  • Iteratively update
  • can be updated in the similar way

1

m

( )

( ) ( ) [ ]

( ) ( ) (

)

( ) (

)

( ) (

)⎥

⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + + − ⋅ − ⋅ ⋅ ⋅ − = ∇

∈Q n q n n n m D

Corpus q P i m D q P i m D q P i m Q i m D Q L D Q L i i

2 * 1 * 1 1 * * 1 ~ , *

1 , 1 , ) ( α ε

( ) ( ) ( ) ( )

) ( 2 ~ , * 2 ) ( 1 ~ , * 1 ) ( 1 ~ , * 1 1

1

i m D i m D i m D

e i m e i m e i m i m

∇ − ∇ − −∇

⋅ + ⋅ ⋅ = +

2

m

slide-33
SLIDE 33

IR 2004 – Berlin Chen 33

HMM/N-gram-based Model (cont.)

  • Experimental results with MCE training

– The results for the syllable-level index features were significantly improved

Word-level Syllable-level Average Precision Uni Uni+Bi* Fusion TQ/TD 0.6459 (0.6327) 0.6858 (0.5718) 0.7329 TDT2 TQ/SD 0.5810 (0.5658) 0.6300 (0.5307) 0.6914

Iterations=100

A v e r a g e P r e c i s i

  • n

A v e r a g e P r e c i s i

  • n

MCE Iterations (Word-based) MCE Iterations (syllable-based) TQ/TD TQ/SD TQ/TD TQ/SD Before MCE Training

slide-34
SLIDE 34

IR 2004 – Berlin Chen 34

HMM/N-gram-based Model (cont.)

  • Advantages

– A formal mathematic framework – Use collection statistics but not heuristics – The retrieval system can be gradually improved through usage

  • Disadvantages

– Only literal term matching (or word overlap measure)

  • The issue of relevance or aboutness is not taken into

consideration – The implementation relevance feedback or query expansion is not straightforward

slide-35
SLIDE 35

IR 2004 – Berlin Chen 35

Topical Mixture Model (TMM)

  • Perform Concept Matching in the Likelihood Space

(under the likelihood criterion)

– Latent topical distributions are shared (tied) among docs

  • Various Theoretically Attractive Model Training

Algorithms can be applied

– Maximum likelihood (e.g. EM) or discriminative (e.g. MCE or MMI) training

N n

q q q q Q .. ..

2 1

=

( )

1

T q P

n

A document model Query

( )

2

T q P

n

( )

K n T

q P

( )

i

D T P

2

( )

i K D

T P

( )

i

D T P

1

( ) ( ) ( )

∏ ∑ ≈

= = N n K k i k k n i

D T P T q P D Q P

1 1

slide-36
SLIDE 36

IR 2004 – Berlin Chen 36

Topical Mixture Model (cont.)

  • EM: Supervised Training

– Given a training set of query exemplars with the corresponding query-document relevance information

( )

( ) (

)

[ ] [ ]

( ) (

)

[ ] [ ]

∑ ∑ ∑ ∑ ∑ =

∈ ∈ ∈ ∈ ∈

Q Q R i s Q Q R i

TrainSet Q Doc D Q q i n k s TrainSet Q Doc D i n k n k n

D q T P Q q n D q T P Q q n T q P

to to

, , , , ˆ

( )

( ) (

)

[ ] [ ] [ ] [ ]

∑ ∑ ∑ =

∈ ∈ ∈ ∈ ∈

Q R i Q Q R i Q s

DOC D TrainSet Q DOC D TrainSet Q Q q i s k s i k

Q D q T P Q q n D T P

to to

st. st.

, , ˆ

( ) ( ) ( ) ( ) ( )

∑ =

= K l l n i l k n i k i n k

T q P D T P T q P D T P D q T P

1

, where ,

slide-37
SLIDE 37

IR 2004 – Berlin Chen 37

Topical Mixture Model (cont.)

  • EM: Unsupervised Training

– Use each document itself as a a query exemplar to train its own document mixture model

( )

( ) (

)

i D w i s k i s i k

D D w T P D w n D T P

i s

∑ =

, , ˆ

( )

( ) (

)

[ ]

( )

( )

[ ]

∑ ∑ ∑ =

∈ ∈ ∈ D D D w i n k i s D D i n k i n k n

i i s i

D w T P D w n D w T P D w n T w P , , , , ˆ

( ) ( ) ( ) ( ) ( )

∑ =

= K l l n i l k n i k i n k

T w P D T P T w P D T P D q T P

1

, where ,

slide-38
SLIDE 38

IR 2004 – Berlin Chen 38

Latent Semantic Indexing (LSI)

  • LSI: a technique that projects queries and docs into a

space with “latent” semantic dimensions

– Co-occurring terms are projected onto the same dimensions – In the latent semantic space (with fewer dimensions), a query and doc can have high cosine similarity even if they do not share any terms – Dimensions of the reduced space correspond to the axes of greatest variation

  • Closely related to Principal Component Analysis (PCA)
slide-39
SLIDE 39

IR 2004 – Berlin Chen 39

Latent Semantic Indexing (cont.)

  • Dimension Reduction and Feature Extraction

– PCA – SVD (in LSI)

X φ

T i i

y = Y

k n

X

k

φ

1

φ

= k i i i

y

1

φ

k

φ

1

φ

n

X ˆ

rxr

Σ

r ≤ min(m,n)

rxn

VT U

mxr mxn mxn

kxk

A A’

k given a for ˆ min

2

X X − k

F

given a for min

2

A A − ′

feature space

latent semantic space latent semantic space k k

  • rthonormal basis
slide-40
SLIDE 40

IR 2004 – Berlin Chen 40

Latent Semantic Indexing (cont.)

– Singular Value Decomposition (SVD) used for the word- document matrix

  • A least-squares method for dimension reduction

i

φ x

1

ϕ

2

ϕ

1 where , cos

1 1 1 1

1 1

= = = = φ x φ x x x x

T T

y ϕ ϕ θ

1

y

2

y

Projection of a Vector :

x

slide-41
SLIDE 41

IR 2004 – Berlin Chen 41

Latent Semantic Indexing (cont.)

  • Frameworks to circumvent vocabulary mismatch

Doc Query terms terms

doc expansion query expansion literal term matching

structure model structure model

latent semantic structure retrieval

slide-42
SLIDE 42

IR 2004 – Berlin Chen 42

Latent Semantic Indexing (cont.)

slide-43
SLIDE 43

IR 2004 – Berlin Chen 43

Latent Semantic Indexing (cont.)

Query: “human computer interaction”

An OOV word

slide-44
SLIDE 44

IR 2004 – Berlin Chen 44

Latent Semantic Indexing (cont.)

  • Singular Value Decomposition (SVD)

=

w1 w2 wm d1 d2 dn mxn rxr mxr rxn

A Umxr Σr VT

rxm

d1 d2 dn w1 w2 wm

=

w1 w2 wm d1 d2 dn mxn kxk mxk kxn

A’ U’mxk Σk V’T

kxm

d1 d2 dn w1 w2 wm

Docs and queries are represented in a k-dimensional space. The quantities of the axes can be properly weighted according to the associated diagonal values of Σk VTV=IrXr Both U and V has orthonormal column vectors UTU=IrXr

K ≤ r ||A||F

2 ≥ ||A’|| F 2

r ≤ min(m,n)

∑∑

= =

=

m i n j ij F

a A

1 1 2 2

Row A Rn Col A Rm

∈ ∈

slide-45
SLIDE 45

IR 2004 – Berlin Chen 45

Latent Semantic Indexing (cont.)

  • Singular Value Decomposition (SVD)

– ATA is symmetric nxn matrix

  • All eigenvalues λj are nonnegative real numbers
  • All eigenvectors vj are orthonormal ( Rn)
  • Define singular values:

– As the square roots of the eigenvalues of ATA – As the lengths of the vectors Av1, Av2 , …., Avn

....

2 1

≥ ≥ ≥ ≥

n

λ λ λ n j

j j

,..., 1 , = = λ σ

[ ]

n

v v v V ...

2 1

=

1 =

j T j v

v

( )

nxn T

I V V =

( )

n

diag λ λ λ ,..., ,

1 1 2 =

Σ

.....

2 2 1 1

Av Av = = σ σ For λi≠ 0, i=1,…r, {Av1, Av2 , …. , Avr } is an

  • rthogonal basis of Col A

i i i i i T i i T T i i

Av v v Av A v Av σ λ λ = ⇒ = = =

2

sigma

slide-46
SLIDE 46

IR 2004 – Berlin Chen 46

Latent Semantic Indexing (cont.)

  • {Av1, Av2 , …. , Avr } is an orthogonal basis of Col A

– Suppose that A (or ATA) has rank r ≤ n – Define an orthonormal basis {u1, u2 ,…., ur} for Col A

  • Extend to an orthonormal basis {u1, u2 ,…, um} of Rm

( )

= = = =

  • j

T i j j T T i j T i j i

v v Av A v Av Av Av Av λ

.... , ....

2 1 2 1

= = = = > ≥ ≥ ≥

+ + n r r r

λ λ λ λ λ λ

[ ] [ ]

r r r i i i i i i i i

v v v A u u u Av u Av Av Av u ... 1 1

2 1 2 1

= Σ ⇒ = ⇒ = = σ σ

[ ] [ ]

T T T m r m r

V U A AVV V U AV U v v v v A u u u u Σ = ⇒ = Σ ⇒ = Σ ⇒ = Σ ⇒ ... ... ... ...

2 1 2 1

2 2 2 2 1 2

...

r F

A σ σ σ + + + = ?

∑∑

= =

=

m i n j ij F

a A

1 1 2 2

V : an orthonormal matrix

nxn

I ?

Known in advance

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

×

Σ Σ

r m m

slide-47
SLIDE 47

IR 2004 – Berlin Chen 47

Latent Semantic Indexing (cont.)

mxn

A

2

V

1

V

1

U

2

U

( )

A V AV V Σ U V V Σ U U V UΣ = = = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

1 1 1 1 1 2 1 1 2 1 T T T T T

= AX

A v

  • f

space row the spans

i T i

A u

  • f

space row the spans

Rn Rm

U

T

V

slide-48
SLIDE 48

IR 2004 – Berlin Chen 48

Latent Semantic Indexing (cont.)

  • Fundamental comparisons based on SVD

– The original word-document matrix (A) – The new word-document matrix (A’)

  • compare two terms

→ dot product of two rows of U’Σ’

  • compare two docs

→ dot product of two rows of V’Σ’

  • compare a query word and a doc → each individual entry of A’

w1 w2 wm d1 d2 dn mxn

A

  • compare two terms → dot product of two rows of A

– or an entry in AAT

  • compare two docs → dot product of two columns of A

– or an entry in ATA

  • compare a term and a doc → each individual entry of A

A’A’T=(U’Σ’V’T) (U’Σ’V’T)T=U’Σ’V’TV’Σ’TU’T =(U’Σ’)(U’Σ’)T A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T

For stretching

  • r shrinking

I U’=Umxk Σ’=Σk V’=Vnxk

wj wi dk ds

slide-49
SLIDE 49

IR 2004 – Berlin Chen 49

Latent Semantic Indexing (cont.)

  • Fold-in: find representations for pesudo-docs q

– For objects (new queries or docs) that did not appear in the

  • riginal analysis
  • Fold-in a new mx1 query (or doc) vector

– Cosine measure between the query and doc vectors in the latent semantic space

( )

1 1 1

ˆ

− × × × ×

Σ =

k k k m m T k

U q q

Query represented by the weighted sum of it constituent term vectors The separate dimensions are differentially weighted Just like a row of V

( )

Σ Σ Σ = Σ Σ = d q d q d q coine d q sim

T

ˆ ˆ ˆ ˆ ) ˆ , ˆ ( ˆ , ˆ

2

row vectors

slide-50
SLIDE 50

IR 2004 – Berlin Chen 50

Latent Semantic Indexing (cont.)

  • Fold-in a new 1 X n term vector

1 1 1

ˆ

− × × × ×

Σ =

k k k n n k

V t t

slide-51
SLIDE 51

IR 2004 – Berlin Chen 51

Latent Semantic Indexing (cont.)

  • Experimental results

– HMM is consistently better than VSM at all recall levels – LSI is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated on TDT-3 SD collection. (Using word-level indexing terms)

slide-52
SLIDE 52

IR 2004 – Berlin Chen 52

Latent Semantic Indexing (LSI)

  • Advantages

– A clean formal framework and a clearly defined optimization criterion (least-squares)

  • Conceptual simplicity and clarity

– Handle synonymy problems (“heterogeneous vocabulary”) – Good results for high-recall search

  • Take term co-occurrence into account
  • Disadvantages

– High computational complexity – LSI offers only a partial solution to polysemy

  • E.g. bank, bass,…
slide-53
SLIDE 53

IR 2004 – Berlin Chen 53

Probabilistic Latent Semantic Analysis (PLSA)

  • Also called The Aspect Model, Probabilistic Latent

Semantic Indexing (PLSI)

– Graphical Model Representation

Thomas Hofmann 1999

( )

( )

( ) ( ) ( )

( ) (

)

( )

( )

( )

i i i i i i i i i

D Q P D Q sim D Q P D P D Q P D Q P Q P D Q P Q D P D Q sim ≈ ⇒ ≈ = ≈ = = , , , ,

( )

( ) ( ) ( ) ( ) ( )

∏ ∑ ∏ ∑ ∏

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = = =

= = j w K k i k k j j w K k i k j j w i j i i

D T P T w P D T w P D w P D Q P D Q sim

1 1

, ,

The latent variables =>The unobservable class variables Ti (topics or domains)

?

D T Q

( )

i

D P

( )

i k D

T P

( )

k n T

w P

D Q

( )

i

D P

( )

i n D

w P

slide-54
SLIDE 54

IR 2004 – Berlin Chen 54

Probabilistic Latent Semantic Analysis (cont.)

  • Definition

– : the prob. when selecting a doc – : the prob. when pick a latent class for the doc – : the prob. when generating a word from the class

( )

i

D P

i

D

i

D

( )

i k D

T P

k

T

( )

k j T

w P

k

T

j

w

slide-55
SLIDE 55

IR 2004 – Berlin Chen 55

Probabilistic Latent Semantic Analysis (cont.)

  • Assumptions

– Bag-of-words: treat docs as memoryless source, words are generated independently – Conditional independent: the doc and word are independent conditioned on the state of the associated latent variable

i

D

k

T

j

w

( ) ( ) ( )

k i k j k i j

T D P T w P T D w P ≈ ,

( ) ( ) ( )

( )

( ) (

) ( )

( ) ( ) (

) ( )

( ) (

) ( )

( ) ( )

∑ ∑ ∑ ∑ ∑ ∑

= = = = = =

= = = = = =

K k i k k j K k i i k k j K k i k k i k j K k i k k i j K k i k i j K k i k j i j

D T P T w P D P D T P T w P D P T P T D P T w P D P T P T D w P D P T D w P D T w P D w P

1 1 1 1 1 1

, , , , ,

( )

( ) ( )

= =

j

w i j i i

D w P D Q P D Q sim ,

Can be viewed as the topics are tied among HMMs

slide-56
SLIDE 56

IR 2004 – Berlin Chen 56

Probabilistic Latent Semantic Analysis (cont.)

  • Probability estimation using EM (expectation-

maximization) algorithm Unsupervised Training

– E (expectation) step

  • Define the auxiliary function
  • With the property:

[ ]

( ) ( )

[ ]

∑ ∑

= = Φ

i j i j k

D w D w T i k j i j C

D T w P E D w n L E

,

, ˆ log ,

take expectation

( )

( )

( )

[ ]

∑ ∑ ∑

=

i j k

D w T i k j i j k i j

D T w P D w T P D w n , ˆ log , ,

( )

( )

( ) (

)

[ ] ∑ ∑ ∑

= Φ

i j k

D w T i k k j i j k i j

D T P T w P D w T P D w n ˆ ˆ log , ,

complete data likelihood

empirical distribution the model

( ) ( ) (

)

i k k j i k j

D T P T w P D T w P ≈ ,

without the introduction of query exemplars for training

slide-57
SLIDE 57

IR 2004 – Berlin Chen 57

Probabilistic Latent Semantic Analysis (cont.)

  • Probability estimation using EM (expectation-

maximization) algorithm

– E (expectation) step

  • The expression can be further decomposed

as

  • The auxiliary function

( )

( ) ( ) ( ) (

)

( ) (

)

= =

k

T i k k j i k k j i j i j k i j k

D T P T w P D T P T w P D w P D w T P D w T P ˆ ˆ ˆ ˆ ˆ , ˆ , ˆ

( )

i j k

D w T P , ˆ

( ) ( ) (

)

( ) (

)

( ) (

)

∑∑ ∑ ∑

⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎣ ⎡ = Φ

i j k k

D w T i k k j T i k k j i k k j i j

D T P T w P D T P T w P D T P T w P D w n ˆ ˆ log ,

Kullback-Leibler divergence

slide-58
SLIDE 58

IR 2004 – Berlin Chen 58

Probabilistic Latent Semantic Analysis (cont.)

  • Probability estimation using EM

– M (maximization) step

[ ]

( )

( )

∑ ∑ ∑ ∑

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + = Φ

i k k j

D T i k i T w k j k C

D T P T w P L E 1 1 ρ τ

( )

( ) (

)

( ) ( )⎟

⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + = Φ

∑ ∑∑

j i j k j

w k j k D w k j i j k i j T w P

T w P T w P D w T P D w n ˆ 1 ˆ log , ,

ˆ

τ

( )

( )

( )

( ) ( )⎟

⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + = Φ

∑ ∑ ∑

k j k j k

T i k j w T i k i j k i j D T P

D T P D T P D w T P D w n ˆ 1 ˆ log , ,

ˆ

ρ

normalization constraints using Lagrange multipliers

slide-59
SLIDE 59

IR 2004 – Berlin Chen 59

Probabilistic Latent Semantic Analysis (cont.)

  • Probability estimation using EM

– M (maximization) step

  • Take differentiation

( )

( ) (

)

( ) (

)

∑ ∑ ∑

=

j i i k j

w D i j k i j D i j k i j T w P

D w T P D w n D w T P D w n , , , ,

ˆ

( )

( ) (

)

( ) (

)

( ) (

)

( ) ( ) (

)

( )

i i j k w i j w i j i j k w i j T i j k w i j i j k w i j i k

D n D w T P D w n D w n D w T P D w n D w T P D w n D w T P D w n D T P

j j j k j j

, , , , , , , , , ˆ

∑ ∑ ∑ ∑ ∑ ∑

= = =

The training formula The training formula

slide-60
SLIDE 60

IR 2004 – Berlin Chen 60

Probabilistic Latent Semantic Analysis (cont.)

  • Latent Probability Space

image sequence analysis medical imaging context of contour boundary detection phonetic segmentation

( ) ( ) ( ) (

)

( ) (

) (

)

k i k T k j i k T i k j T i k j i j

T D P T P T w P D T P D T w P D T w P D w P

k k k

∑ ∑ ∑

= = = , , , , ,

( ) ( )

k j k j T

w P

,

: ˆ U

( ) ( )k

k

T P diag : ˆ Σ

( ) ( )

k i k i T

D P

,

: ˆ V

( )

D W P ,

. = .

Dimensionality K=128 (latent classes)

slide-61
SLIDE 61

IR 2004 – Berlin Chen 61

Probabilistic Latent Semantic Analysis (cont.)

  • Latent Probability Space

=

D1 D2 Di Dn mxn kxk mxk rxn

P Umxk Σk VT

kxm

w1 w2 wj wm t1…tk… tK

( ) ( ) (

) (

)

k i k T k j i j

T D P T P T w P D w P

k

= ,

( )

k j T

w P

( )

k

T P

w1 w2 wj wm

( )

k i T

D P

( )

i j D

w P ,

D1 D2 Di Dn

slide-62
SLIDE 62

IR 2004 – Berlin Chen 62

Probabilistic Latent Semantic Analysis (cont.)

  • One more example on TDT1 dataset

aviation space missions family love Hollywood love

slide-63
SLIDE 63

IR 2004 – Berlin Chen 63

Probabilistic Latent Semantic Analysis (cont.)

  • Comparison with LSI

– Decomposition/Approximation

  • LSI: least-squares criterion measured on the L2- or Frobenius

norms of the word-doc matrices

  • PLSA: maximization of the likelihoods functions based on the

cross entropy or Kullback-Leibler divergence between the empirical distribution and the model – Computational complexity

  • LSI: SVD decomposition
  • PLSA: EM training, is time-consuming for iterations ?
slide-64
SLIDE 64

IR 2004 – Berlin Chen 64

Probabilistic Latent Semantic Analysis (cont.)

  • Experimental Results

PLSI-U*

– Two ways to smoothen empirical distribution with PLSI

  • Combine the cosine score with that of the vector space

model (so does LSI)

  • Combine the multinomials individually

Both provide almost identical performance – It’s not known if PLSA was used alone

) | ( ) 1 ( ) | ( ) | (

* i j PLSA i j Empirical i j Q PLSI

d P d P d P ω λ ω λ ω − + =

( )

( )

i i j i j Empirical

d n d w n d P , ) | ( = ω

slide-65
SLIDE 65

IR 2004 – Berlin Chen 65

Probabilistic Latent Semantic Analysis (cont.)

  • Experimental Results

PLSI-Q*

– Use the low-dimensional representation and

(be viewed in a k-dimensional latent space) to evaluate relevance

by means of cosine measure – Combine the cosine score with that of the vector space model – Use the ad hoc approach to re-weight the different model components (dimensions) by

) | ( Q T P

k

) | (

i k

D T P

( ) ( ) ( ) ( )

∑ ∑ ∑ =

− k i k k k k i k k i Q PLSI

D T P Q T P D T P Q T P D Q R

2 2 *

) , (

( )

( ) (

)

, , where , Q Q q T P Q q n Q T P

Q q n k n k

n

∑ =

( )

) , ( 1 ) , ( ) , ( ~

* i VSM i PLSI i Q PLSI

D Q R D Q R D Q R r v ⋅ − + ⋅ =

λ λ

  • nline folded-in
slide-66
SLIDE 66

IR 2004 – Berlin Chen 66

Probabilistic Latent Semantic Analysis (cont.)

  • Why ?

– Reminder that in LSI, the relations between any two docs can be formulated as – PLAS mimics LSI in similarity measure

( ) ( ) ( ) ( )

∑ ∑ ∑ =

− k i k k k k i k k i Q PLSI

D T P Q T P D T P Q T P D Q R

2 2 *

) , (

A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T

Di Ds

( ) (

) ( ) (

) ( ) (

)

[ ] ( ) (

)

[ ] ( ) (

) (

) (

)

( ) (

)

[ ] ( ) (

)

[ ] ( ) ( ) ( ) ( )

∑ ∑ ∑ = ∑ ∑ ∑ = ∑ ∑ ∑ =

− k s k k i k k s k i k k s s k k i i k k s s k i i k k k k i k k k i k k s k k k i s i Q PLSI

D T P D T P D T P D T P D P D T P D P D T P D P D T P D P D T P T P T D P T P T D P T D P T P T P T D P D D R

2 2 2 2 2 2 *

) , ( ( )

vectors row are ˆ and ˆ ˆ ˆ ˆ ˆ ) ˆ , ˆ ( ,

2 s i s i T s i s i s i

D D D D D D D D coine D D sim Σ Σ Σ = Σ Σ =

( ) (

)

( ) (

)

i i k k k i

D P D T P T P T D P =

slide-67
SLIDE 67

IR 2004 – Berlin Chen 67

Probabilistic Latent Semantic Analysis (cont.)

  • Experimental Results
slide-68
SLIDE 68

IR 2004 – Berlin Chen 68

Comparisons

  • TDT-3 Voice of American Spoken Document Collection

– Measured in mean Average Precision (mAP) 0.5989 0.6513 PLSA 0.7852 0.7870 TMM 0.6390 0.6216 0.7156 SD 0.6440 0.6505 0.7174 TD LSI VSM HMM Retrieval Model Using both word- and syllable-level indexing features & MCE Training

256 Topics 256 Topics