CS54701: Information Retrieval CS-54701 Information Retrieval - - PowerPoint PPT Presentation

cs54701 information retrieval
SMART_READER_LITE
LIVE PREVIEW

CS54701: Information Retrieval CS-54701 Information Retrieval - - PowerPoint PPT Presentation

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models Luo Si Department of Computer Science Purdue University Retrieval Model: Language Model Introduction to language model Unigram language


slide-1
SLIDE 1

CS54701: Information Retrieval

CS-54701 Information Retrieval

Retrieval Models: Language models

Luo Si Department of Computer Science Purdue University

slide-2
SLIDE 2

 Introduction to language model

Retrieval Model: Language Model

  • Maximum Likelihood estimation
  • Maximum a posterior estimation
  • Jelinek Mercer Smoothing

 Unigram language model  Document language model estimation  Model-based feedback

slide-3
SLIDE 3

Language Models: Motivation

 Vector space model for information retrieval

  • Documents and queries are vectors in the term space
  • Relevance is measure by the similarity between document

vectors and query vector

 Problems for vector space model

  • Ad-hoc term weighting schemes
  • Ad-hoc similarity measurement

No justification of relationship between relevance and similarity

 We need more principled retrieval models…

slide-4
SLIDE 4

Introduction to Language Models:

 Language model can be created for any language sample

  • A document
  • A collection of documents
  • Sentence, paragraph, chapter, query…

 The size of language sample affects the quality of

language model

  • Long documents have more accurate model
  • Short documents have less accurate model
  • Model for sentence, paragraph or query may not be reliable
slide-5
SLIDE 5

Introduction to Language Models:

 A document language model defines a probability distribution over

indexed terms

  • E.g., the probability of generating a term
  • Sum of the probabilities is 1

 A query can be seen as observed data from unknown models

  • Query also defines a language model (more on this later)

 How might the models be used for IR?

  • Rank documents by Pr( | )
  • Rank documents by language models of and based on

kullback-Leibler (KL) divergence between the models (come later)

i

d

q

i

d

q

slide-6
SLIDE 6

Language Model for IR: Example

Estimating language model for each document sport, basketball, ticket, sport

1

d

basketball, ticket, finance, ticket, sport

2

d

stock, finance, finance, stock

3

d

Language Model for

1

d

Language Model for

2

d

Language Model for

3

d

Estimate the generation probability of Pr( | )

q

i

d q

sport, basketball Generate retrieval results

slide-7
SLIDE 7

Language Models

Three basic problems for language models

What type of probabilistic distribution can be used to

construct language models?

How to estimate the parameters of the distribution of the

language models?

How to compute the likelihood of generating queries given

the language modes of documents?

slide-8
SLIDE 8

Multinomial/Unigram Language Models

Language model built by multinomial distribution on single

terms (i.e., unigram) in the vocabulary

i

d

Examples:

Five words in vocabulary (sport, basketball, ticket, finance, stock) For a document , its language mode is: {Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)}

Formally:

The language model is: {Pi(w) for any word w in vocabulary V}

( ) 1 ( ) 1

i k i k k

P w P w   

slide-9
SLIDE 9

Estimating language model for each document

2

d

sport, basketball, ticket, sport

1

d

basketball, ticket, finance, ticket, sport stock, finance, finance, stock

3

d

Multinomial Model for 1

d

Multinomial/Unigram Language Models

Multinomial Model for 2

d

Multinomial Model for

3

d

slide-10
SLIDE 10

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation:

 Find model parameters that make generation likelihood

reach maximum: There are K words in vocabulary, w1...wK (e.g., 5) Data: one document with counts tfi(w1), …, tfi(wK), and length | | Model: multinomial M with parameters {pi(wk)} Likelihood: Pr( | M)

1 I d ,...,d

i

d

M*=argmaxMPr( |M)

1 I d ,...,d

i

d

M*=argmaxMPr(D|M)

1 I d ,...,d

i

d

i

d

slide-11
SLIDE 11

Maximum Likelihood Estimation (MLE)

( ) ( ) 1 1 1 ' '

| | ( | ) ( ) ( ) ( )... ( ) ( | ) lo g ( | ) ( ) lo g ( ) ( | ) ( ) lo g ( ) ( ( ) 1) ( ) ( ) ( ) ( )

i k i k

K K i tf w tf w i i k i k k k i i K i i i k i k k i i k i k i k k k i k i k i k i k

d p d M p w p w tf w tf w l d M p d M tf w p w l d M tf w p w p w tf w t l p w p w p w  

 

                      

    

( ) ( ) ( ) 1, ( ) | | , ( ) | |

i k i k i i k i k i k i k k

f w c w S in ce p w tf w d S o p w d       

 

Use Lagrange multiplier approach Set partial derivatives to zero Get maximum likelihood estimate

slide-12
SLIDE 12

Estimating language model for each document

2

d

sport, basketball, ticket, sport

1

d

basketball, ticket, finance, ticket, sport stock, finance, finance, stock

3

d

(psp, pb, pt, pf, pst) = (0.5,0.25,0.25,0,0) (psp, pb, pt, pf, pst) = (0.2,0.2,0.4,0.2,0) (psp, pb, pt, pf, pst) = (0,0,0,0.5,0.5)

Maximum Likelihood Estimation (MLE)

slide-13
SLIDE 13

Maximum Likelihood Estimation:

Assign zero probabilities to unseen words in small sample

1 I d ,...,d

Maximum Likelihood Estimation (MLE)

i

d

A specific example:

Only two words in vocabulary (w1=sport, w2=business) like (head, tail) for a coin; A document generates sequence of two words or draw a coin for many times

1 2

( ) ( ) 1 1 1 2

P r( | ) ( ) (1 ( )) ( ) ( )

i i

i tf w tf w i i i i i

d d M p w p w tf w tf w          

Only observe two words (flip the coin twice) and MLE estimators are: “business sport” Pi(w1)=0.5 “sport sport” Pi(w1)=1 ? “business business” Pi(w1)=0 ?

slide-14
SLIDE 14

Maximum Likelihood Estimation (MLE)

A specific example:

Only observe two words (flip the coin twice) and MLE estimators are: “business sport” Pi(w1)*=0.5 “sport sport” Pi(w1)*=1 ? “business business” Pi(w1)*=0 ? Data sparseness problem

slide-15
SLIDE 15

 Maximum a posterior (MAP) estimation  Shrinkage  Bayesian ensemble approach

Solution to Sparse Data Problems

slide-16
SLIDE 16

Maximum A Posterior (MAP) Estimation

Maximum A Posterior Estimation:

Select a model that maximizes the probability of model given

  • bserved data

M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M)

  • Pr(M): Prior belief/knowledge
  • Use prior Pr(M) to avoid zero probabilities

i

d

A specific examples:

Only two words in vocabulary (sport, business) For a document :

 

1 2

( ) ( ) 1 2 1 2

P r( | ) ( ) ( ) P r ( ) ( )

i i

i tf w tf w i i i i i

d M d p w p w M tf w tf w         

Prior Distribution

slide-17
SLIDE 17

Maximum A Posterior (MAP) Estimation

Maximum A Posterior Estimation:

Introduce prior on the multinomial distribution

  • Use prior Pr(M) to avoid zero probabilities, most of coins are more
  • r less unbiased
  • Use Dirichlet prior on p(w)

(x) is gamma function

1

( ) ( 1) ! if

t x

x e t d x n n n

  

     

1 1 1 1

( ) ( | , , ) ( ) , ( ) 1, 0 ( ) 1 ( ) ( )

k

K K i k i k i k i k k K

D ir p p w p w p w

     

        

 

Hyper-parameters Constant for pK

slide-18
SLIDE 18

Maximum A Posterior (MAP) Estimation

2 2 1 1

P r( ) ( ) (1 ( )) M p w p w  

For the two word example: a Dirichlet prior

P(w1)2 (1-P(w1)2)

slide-19
SLIDE 19

Maximum A Posterior:

1 I d ,...,d

1 2 1 2 1 1 2 2

( ) ( ) 1 1 1 1 1 1 ( ) 1 ( ) 1 1 1

P r( | ) P r( ) ( ) (1 ( )) ( ) ( ) ( ) (1 ( ))

i i i i

tf w tf w i i i i i tf w tf w i i

d M M p w p w p w p w p w p w

         

   

Maximum A Posterior (MAP) Estimation

M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M)

Pseudo Counts

1 1 2 2 1

( ) 1 ( ) 1 * 1 1 ( )

arg m ax ( ) (1 ( ))

i i i

tf w tf w i i p w

M p w p w

     

 

slide-20
SLIDE 20

Maximum A Posterior (MAP) Estimation

A specific example:

Only observe two words (flip a coin twice): “sport sport” Pi(w1)*=1 ?

times

P(w1)2 (1-P(w1)2)

slide-21
SLIDE 21

Maximum A Posterior (MAP) Estimation

A specific example:

Only observe two words (flip a coin twice): “sport sport” Pi(w1)*=1 ?

1 1 1 1 1 2 2

( ) 1 ( )* ( ) 1 ( ) 1 2 3 1 4 2 2 3 1 0 3 1 6 3

i i i

tf w p w tf w tf w                     

slide-22
SLIDE 22

MAP Estimation

Unigram Language Model Maximum A Posterior Estimation:

Use Dirichlet prior for multinomial distribution

How to set the parameters for Dirichlet prior

slide-23
SLIDE 23

MAP Estimation

Unigram Language Model Maximum A Posterior Estimation:

Use Dirichlet prior for multinomial distribution

There are K terms in the vocabulary:

1 1 1 1

( ) ( | , , ) ( ) , ( ) 1, 0 ( ) 1 ( ) ( )

k

K K i k i k i k i k k K

D ir p p w p w p w

     

        

 

Hyper-parameters Constant for pK

1

: { ( ),...., ( )}, ( ) 1, 0 ( ) 1

i K i i k i k i k

M ultinom ial p p w p w p w p w    

slide-24
SLIDE 24

*

( ) 1 ( ) ( ( ) 1)

i k k k i i k k k

tf w p w tf w       

MAP Estimation

Unigram Language Model MAP Estimation for unigram language model:

* ( ) 1 1 1

( ) arg m ax ( ) ( ) ( ) ( ) . ( ) 1, 0 ( ) 1

i k k

tf w K i k i k p k k K i k i k k

p p w p w st p w p w

   

        

   Use Lagrange Multiplier; Set derivative to 0

( ) 1

arg m ax ( ) . ( ) 1, 0 ( ) 1

i k k

tf w i k p k i k i k k

p w st p w p w

  

   

 

Pseudo counts set by hyper-parameters

slide-25
SLIDE 25

*

( ) 1 ( ) ( ( ) 1)

i k k k i i k k k

tf w p w tf w       

MAP Estimation

Unigram Language Model MAP Estimation for unigram language model: Use Lagrange Multiplier; Set derivative to 0 How to determine the appropriate value for hyper-parameters?

When nothing observed from a document

*

1 ( ) ( 1)

k k i k k

p w     

What is most likely pi(wk) without looking at the content of the document?

slide-26
SLIDE 26

MAP Estimation

Unigram Language Model MAP Estimation for unigram language model:

What is most likely pi(wk) without looking at the content of the document?

The most likely pi(wk) without looking into the content of the document d is the unigram probability of the collection: –{p(w1|c), p(w2|c),…, p(wK|c)} Without any information, guess the behavior of one member on the behavior of whole population

   

*

1 ( ) 1 ( 1)

k k c k k c k i k k

p w p w p w           

Constant

slide-27
SLIDE 27

*

( ) ( ) ( ) ( )

i k c k k i i k k

tf w p w p w tf w     

MAP Estimation

Unigram Language Model MAP Estimation for unigram language model:

* ( ) ( ) 1 1

( ) arg m ax ( ) ( ) ( ) ( ) . ( ) 1, 0 ( ) 1

i k c k

tf w p w K i k i k p k k K i k i k k

p p w p w st p w p w

            

   Use Lagrange Multiplier; Set derivative to 0

( ) ( )

arg m ax ( ) . ( ) 1, 0 ( ) 1

i k c k

tf w p w i k p k i k i k k

p w st p w p w

 

   

 

Pseudo counts Pseudo document length

slide-28
SLIDE 28

Maximum A Posterior (MAP) Estimation

Step 0: compute the probability on whole collection based collection unigram language model Step 1: for each document , compute its smoothed unigram language model (Dirichlet smoothing) as

( ) ( ) ( )

i k c k i k i

tf w p w p w d     

( ) ( )

i k i c i i i

tf w p w d 

 

Dirichlet MAP Estimation for unigram language model:

i

d

slide-29
SLIDE 29

i

d

Maximum A Posterior (MAP) Estimation

Step 2: For a given query ={tfq(w1),…, tfq(wk)} Dirichlet MAP Estimation for unigram language model:

  • For each document , compute likelihood
  • The larger the likelihood, the more relevant the document is to

the query

( ) ( ) 1 1

( ) ( ) ( | ) ( | )

q k q k

tf w K K tf w i k c k i i i i k k

tf w p w p q d p w d d  

 

               

 

q

slide-30
SLIDE 30

1

( , ) ( ) ( ) ( ) ( )

K i i q k i k k k

sim q d tf w tf w id f w n o rm d

 

( ) 1

( ) ( ) ( | )

q k

tf w K i k c k i i k

tf w p w p q d d  

          

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing:

?

TF-IDF Weighting:

slide-31
SLIDE 31

1

( , ) ( ) ( ) ( ) ( )

K i i q k i k k k

sim q d tf w tf w id f w n o rm d

 

( ) 1

( ) ( ) ( | )

q k

tf w K i k c k i i k

tf w p w p q d d  

          

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing: TF-IDF Weighting:

1

( ) lo g ( | ) ( ) lo g 1 lo g ( ) lo g ( ) ( )

i k i i q k c k c k k

tf w p q d tf w d p w p w   

                    

slide-32
SLIDE 32

1

( ) ( ) ( ) lo g ( ) lo g ( ) ( )

c k i k i q k c k c k k

p w tf w tf w p w d p w    

                   

( ) 1

( ) ( ) ( | )

q k

tf w K i k c k i i k

tf w p w p q d d  

          

 

 

1

lo g ( | ) ( ) lo g ( ) ( ) lo g ( )

i i q k i k c k k

p q d tf w tf w p w d  

   

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing:

1

( ) ( ) lo g 1 lo g ( ) lo g ( ) ( )

i k i q k c k c k k

tf w tf w p w d p w   

                    

slide-33
SLIDE 33

1

( , ) ( ) ( ) ( ) ( )

K i i q k i k k k

sim q d tf w tf w id f w n o rm d

 

1

( ) lo g ( | ) ( ) lo g 1 lo g ( ) lo g ( ) ( )

i k i i q k c k c k k

tf w p q d tf w d p w p w   

                    

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing: TF-IDF Weighting:

Irrelevant part

1

( ) lo g ( | ) ( ) lo g 1 lo g ( ) ( )

i k i i q k c k k

tf w p q d tf w d p w  

                   

slide-34
SLIDE 34

( ) lo g 1 ( )

i k c k

tf w p w        

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing: Look at the tf.idf part

( ) ( ) lo g 1 ( )

i k i k c k

tf w tf w p w        

( ) ( ) lo g 1 ( )

i k c k c k

tf w p w p w        

slide-35
SLIDE 35

( ) ( ) ( )

i k c k i k i

tf w p w p w d     

Dirichlet Smoothing Hyper-Parameter

Dirichlet Smoothing:

Hyper-parameter

When is very small, approach MLE estimator

When is very large, approach probability on whole collection

How to set appropriate ?

slide-36
SLIDE 36

( ) ( ) ( )

i k c k i k i

tf w p w p w d     

Dirichlet Smoothing Hyper-Parameter

Leave One out Validation:

1 1 1 1

( ) 1 ( ) ( | / ) 1

i c i i i

tf w p w p w d w d       

Leave w1 out

1 1

( | / )

i

p w d w

...

wj w1

( | / )

i j j

p w d w

...

( ) 1 ( ) ( | / ) 1

i j c j i i j j i

tf w p w p w d w d       

Leave wj out

... ...

slide-37
SLIDE 37

* 1

arg max ( , ) l C

 

Dirichlet Smoothing Hyper-Parameter

Leave One out Validation:

1 1

( ) 1 ( ) ( , ) log 1

i

d i j c j i i j

tf w p w l d d   

 

              

Leave all words out one by one for a document

1 1 1

( ) 1 ( ) ( , ) log 1

i

d C i j c j i i j

tf w p w l C d   

  

              



Do the procedure for all documents in a collection Find appropriate

1 1

( | / )

i

p w d w

...

wj w1

( | / )

i j j

p w d w

...

slide-38
SLIDE 38

Dirichlet Smoothing Hyper-Parameter

 What type of document/collection would get large ?

– Most documents use similar vocabulary and wording pattern as the whole collection

 What type of document/collection would get small ?

– Most documents use different vocabulary and wording pattern than the whole collection 

slide-39
SLIDE 39

Shrinkage

 Maximum Likelihood (MLE) builds model purely on document

data and generates query word

  • Model may not be accurate when document is short (many unseen

words)

 Shrinkage estimator builds more reliable model by consulting

more general models (e.g., collection language model)

Example: Estimate P(Lung_Cancer|Smoke)

West Lafayette Indiana

U.S.

slide-40
SLIDE 40

Shrinkage

( ) ( ) (1 ) ( )

i k i k c k i

tf w p w p w d     

  Jelinek Mercer Smoothing

  • Assume for each word, with probability , it is generated from

document language model (MLE), with probability 1- , it is generated from collection language model (MLE)

  • Linear interpolation between document language model and

collection language model

JM Smoothing:

slide-41
SLIDE 41

( ) ( ) ( )

i k c k i k i

tf w p w p w d     

( ) ( ) (1 ) ( )

i k i k c k i

tf w p w p w d     

Shrinkage

 Relationship between JM Smoothing and Dirichlet Smoothing

JM Smoothing:

 

1 ( ) ( )

i k c k i

tf w p w d      ( ) 1 ( )

i i k c k i i

d tf w p w d d               

( ) ( )

i i k c k i i i

d tf w p w d d d       

slide-42
SLIDE 42

Model Based Feedback

 Equivalence of retrieval based on query generation likelihood

and Kullback-Leibler (KL) Divergence between query and document language models Kullback-Leibler (KL) Divergence between two probabilistic distributions

 

( ) ( ) log ( )

x

p x K L p q p x q x       

  • It is the distance between two probabilistic distributions
  • It is always larger than zero

How to prove it ?

slide-43
SLIDE 43

Model Based Feedback

 Equivalence of retrieval based on query generation likelihood

and Kullback-Leibler (KL) Divergence between query and document language models

 

 

 

 

( , ) ( ) ( ) ( ) lo g ( ) ( ) lo g ( ) lo g

i i w i i w w

S im q d K L q d q w q w p w q w p w q w q w            

  

Loglikelihood of query generation probability Document independent constant

  • Generalize query representation to be a distribution

(fractional term weighting)

slide-44
SLIDE 44

Estimating language model

i

d

Language Model for

i

d

Estimate the generation probability of Pr( | )

q

i

d

Retrieval results

q

Calculate KL Divergence Retrieval results

q

Estimating query language model

Language Model for q

( )

i

K L q d

Estimating document language model

i

d

Language Model for

i

d

Model Based Feedback

slide-45
SLIDE 45

Calculate KL Divergence Retrieval results

q

Estimating query language model

Language Model for q

( )

i

K L q d

Estimating document language model

i

d

Language Model for

i

d

Feedback Documents from initial results Language Model for

q F

New Query Model

'

q (1 )q q F       

No feedback

'

q q 

Full feedback

'

q q F  1  

Model Based Feedback

slide-46
SLIDE 46

 

 

* 1

arg m ax ( , ) arg m ax lo g ( ) (1 ) ( )

F F

F q n F i C i q i

q l X q w p w   

   

 Assume there is a generative model to produce each word

within feedback document(s)

For each word in feedback document(s), given 

w

w Feedback Documents

qF(w) PC(w) 1- 

Flip a coin Background model Topic words

Model Based Feedback: Estimate F

q

slide-47
SLIDE 47

 For each word, there is a hidden variable telling which

language model it comes from

the 0.12 to 0.05 it 0.04 a 0.02 … sport 0.0001 basketball 0.00005

Background Model pC(w|C)

sport =? basketball =? game =? player =?

Unknown query topic p(w|F)=? “Basketball”

1-=0.8 =0.2

Feedback Documents

If we know the value of hidden variable of each word ...

MLE Estimator

Model Based Feedback: Estimate F

q

slide-48
SLIDE 48

 For each word, the hidden variable

Zi = {1 (feedback), 0 (background}

  • Step1: estimate hidden variable based current on model

parameter (Expectation)

( ) ( )

( 1) ( | 1) ( 1 | ) ( 1) ( | 1) ( 0 ) ( | 0 ) ( ) ( ) (1 ) ( | )

i i i i i i i i i i i t F i t F i C i

p z p w z p z w p z p w z p z p w z q w q w p w C               

E-step

  • Step2: Update model parameters based on the guess in step1

(Maximization)

the (0.1) basketball (0.7) game (0.6) is (0.2) ….

Model Based Feedback: Estimate F

q

( 1)

( , ) ( 1 | ) ( | ) ( , ) ( 1 | )

t i i i F i F j j j j

c w F p z w q w c w F p z w 

  

M-Step

slide-49
SLIDE 49

Expectation-Maximization (EM) algorithm

F

q

  • Step1: (Expectation)
  • Step2: (Maximization)

( ) ( )

( ) ( 1 | ) ( ) (1 ) ( | )

t F i i i t F i C i

q w p z w q w p w C       

( 1)

( , ) ( 1 | ) ( | ) ( , ) ( 1 | )

t i i i F i F j j j j

c w F p z w q w c w F p z w 

  

  • Step 0: Initialize values of

Model Based Feedback: Estimate F

q

Give =0.5

slide-50
SLIDE 50

Properties of parameter 

  • If  is close to 0, most common words can be generated from

collection language model, so more topic words in query language mode

Model Based Feedback: Estimate F

q

  • If  is close to 1, query language model has to generate most

common words, so fewer topic words in query language mode

slide-51
SLIDE 51

 Introduction to language model

Retrieval Model: Language Model

  • Maximum Likelihood estimation
  • Maximum a posterior estimation
  • Jelinek Mercer Smoothing

 Unigram language model  Document language model estimation  Model-based feedback