Probabilistic Information Retrieval CE-324: Modern Information - - PowerPoint PPT Presentation

probabilistic information retrieval
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Information Retrieval CE-324: Modern Information - - PowerPoint PPT Presentation

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Why probabilities in


slide-1
SLIDE 1

Probabilistic Information Retrieval

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Fall 2017

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

slide-2
SLIDE 2

Why probabilities in IR?

User Information Need Documents Document Representation Query Representation

How to match?

In traditional IR systems, matching between each doc and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties?

Uncertain guess

  • f

whether doct has relevant content

Understanding of user need is uncertain

2

slide-3
SLIDE 3

Probabilistic IR

 Probabilistic methods are one of the oldest but also one

  • f the currently hottest topics in IR.

 Traditionally: neat ideas, but didn’t win on performance  It may be different now.

3

slide-4
SLIDE 4

Probabilistic IR topics

 Classical probabilistic retrieval model

 Probability Ranking Principle  Binary independence model (≈ We will see that its a Naïve

Bayes text categorization)

 (Okapi) BM25

 Language model approach to IR

 An important emphasis on this approach in recent work

4

slide-5
SLIDE 5

The document ranking problem

 Problem specification:

 We have a collection of docs  User issues a query  A list of docs needs to be returned

 Ranking method is the core of an IR system:

 In what order do we present documents to the user?

 Idea: Rank by probability of relevance of the doc w.r.t.

information need

 𝑄(𝑆 = 1|𝑒𝑝𝑑𝑗, 𝑟𝑣𝑓𝑠𝑧)

5

slide-6
SLIDE 6

“If a reference retrieval system’s response to each request is a ranking of the docs in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”

[1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538)

Probability Ranking Principle (PRP)

6

slide-7
SLIDE 7

Recall a few probability basics

 Product rule: 𝑞 𝑏, 𝑐 = 𝑞 𝑏 𝑐 𝑄(𝑐)  Sum rule: 𝑞 𝑏 = 𝑐 𝑞(𝑏, 𝑐)  Bayes’ Rule

 Odds:

) ( 1 ) ( ) ( ) ( ) ( a p a p a p a p a O   

Posterior Prior

( | ) ( ) ( | ) ( ) ( | ) ( ) ( | ) ( ) ( | ) ( ) p b a p a p b a p a p a b p b p b a p a p b a p a   

7

slide-8
SLIDE 8

Probability Ranking Principle (PRP)

d: doc 𝑟: query R: relevance of a doc w.r.t. given (fixed) query 𝑆 = 1: relevant 𝑆 = 0: not relevant Need to find probability that a doc 𝒚 is relevant to a query 𝒓. 𝑞(𝑆 = 1|𝑒, 𝑟)

8

𝑞 𝑆 = 0 𝑒, 𝑟 = 1 − 𝑞 𝑆 = 1 𝑒, 𝑟

slide-9
SLIDE 9

Probability Ranking Principle (PRP)

 𝑞(𝑒|𝑆 = 1, 𝑟): probability of 𝑒 in the class of relevant

docs to the query 𝑟.

 𝑞(𝑒|𝑆 = 0, 𝑟): probability of 𝑒 in the class of non-

relevant docs to the query 𝑟.

9

𝑞 𝑆 = 1 𝑒, 𝑟 = 𝑞 𝑒 𝑆 = 1, 𝑟 𝑞(𝑆 = 1|𝑟) 𝑞(𝑒|𝑟) 𝑞 𝑆 = 0 𝑒, 𝑟 = 𝑞 𝑒 𝑆 = 0, 𝑟 𝑞(𝑆 = 0|𝑟) 𝑞(𝑒|𝑟)

slide-10
SLIDE 10

Probability Ranking Principle (PRP)

 How do we compute all those probabilities?

 Do not know exact probabilities, have to use estimates  Binary Independence Model (BIM)

 which we discuss next – is the simplest model

10

slide-11
SLIDE 11

Probabilistic Retrieval Strategy

 Estimate how terms contribute to relevance

 How do things like tf, df, and length influence your judgments

about doc relevance?

 A more nuanced answer is the Okapi formula

 Spärck Jones / Robertson

 Combine the above estimated values to find doc relevance

probability

 Order docs by decreasing probability

11

slide-12
SLIDE 12

Probabilistic Ranking

Basic concept:

“For a given query, if we know some docs that are relevant, terms that occur in those docs should be given greater weighting in searching for other relevant docs. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically.”

Van Rijsbergen

12

slide-13
SLIDE 13

Binary Independence Model

 Traditionally used in conjunction with PRP  “Binary” = Boolean: docs are represented as binary

incidence vectors of terms

 𝒚 = [𝑦1 𝑦2 … 𝑦𝑛]  𝑦𝑗 = 1 iff term 𝑗 is present in document 𝑦.

 “Independence”: terms occur in docs independently  Equivalent to Multivariate Bernoulli Naive Bayes model

 Sometimes used for text categorization [we will see in the next lectures]

13

slide-14
SLIDE 14

Binary Independence Model

 Will use odds and Bayes’ Rule:

( 1| ) ( | 1, ) ( 1| , ) ( | ) ( | , ) ( 0 | ) ( | 0, ) ( 0 | , ) ( | ) P R q P R q P R q P q O R q P R q P R q P R q P q         x x x x x x x

14

slide-15
SLIDE 15

Binary Independence Model

Using Independence Assumption:

1

( | 1, ) ( | 1, ) ( | 0, ) ( | 0, )

n i i i

P x R q p R q p R q P x R q

    

x x

( 1| , ) ( 1| ) ( | 1, ) ( | , ) ( 0 | , ) ( 0 | ) ( | 0, ) P R q P R q P R q O R q P R q P R q P R q          x x x x x

Constant for a given query Needs estimation

1

( | 1, ) ( | , ) ( | ) ( | 0, )

n i i i

P x R q O R q d O R q P x R q

   

15

slide-16
SLIDE 16

Binary Independence Model

Since 𝑦𝑗 is either 0 or 1:

1

( 1| 1, ) ( 0 | 1, ) ( | , ) ( | ) ( 1| 0, ) ( 0 | 0, )

i i

i i x x i i

P x R q P x R q O R q d O R q P x R q P x R q

 

          

 

Let

( 1| 1, )

i i

p P x R q   

Assume, for all terms not occurring in the query (qi=0) that 𝑞𝑗 = 𝑣𝑗 This can be changed (e.g., in relevance feedback)

16

( 1| 0, )

i i

u P x R q   

slide-17
SLIDE 17

Probabilities

document relevant (R=1) not relevant (R=0) term present xi = 1 pi ui term absent xi = 0 (1 – pi) (1 – ui)

Then...

17

slide-18
SLIDE 18

All matching terms Non-matching query terms

Binary Independence Model

All matching terms All query terms

1 1

1 ( | , ) ( | ) 1

i i i i

i i x q x i i q

p p O R q x O R q u u

   

    

 

1 1

(1 ) 1 ( | ) (1 ) 1

i i i

i i i x q q i i i

p u p O R q u p u

  

      

 

18

slide-19
SLIDE 19

Binary Independence Model

Constant for each query

1 1

(1 ) 1 ( | , ) ( | ) (1 ) 1

i i i

i i i x q q i i i

p u p O R q x O R q u p u

  

      

 

Retrieval Status Value:

1 1

(1 ) (1 ) log log (1 ) (1 )

i i i i

i i i i x q x q i i i i

p u p u RSV u p u p

   

     

 

Only quantity to be estimated for rankings

19

slide-20
SLIDE 20

Binary Independence Model

All boils down to computing RSV:

1 1

(1 ) (1 ) log log (1 ) (1 )

i i i i

i i i i x q x q i i i i

p u p u RSV u p u p

   

     

  

 

1

;

i i q

x i

c RSV (1 ) log (1 )

i i i i i

p u c u p   

So, how do we compute ci’s from our data ? cis function as the term weights in this model

20

slide-21
SLIDE 21

BIM: example

 𝑟 = {𝑦1, 𝑦2}  Relevance judgements from 20 docs together with the

distribution of 𝑦1, 𝑦2 within these docs

 𝑞1 = 8/12, 𝑣1 = 3/8  𝑞2 = 7/12 and 𝑣2 = 4/8.  𝑑1 = log 10 /3  𝑑2 = log 7 /5 (1,1) (1,0) (0,1) (0,0)

21

slide-22
SLIDE 22

Binary Independence Model

Estimating RSV coefficients in theory For each term i look at this table of document counts:

Documents Relevant Non-Relevant Total xi=1 s df-s df xi=0 S-s N-df-S+s N-df Total S N-S N

S s pi 

Estimates:

For now, assume no zero terms.

22

𝑣𝑗 = 𝑒𝑔 − 𝑡 𝑂 − 𝑇 𝑑𝑗 ≈ log 𝑡 𝑇 − 𝑡 𝑒𝑔 − 𝑡 𝑂 − 𝑒𝑔 − 𝑇 + 𝑡

Weight of i-th term:

slide-23
SLIDE 23

Estimation – key challenge

 If non-relevant docs are approximated by the whole

collection:

 𝑣𝑗 = 𝑒𝑔

𝑗/𝑂

 prob. of occurrence in non-relevant docs for query

 log(1– 𝑣𝑗)/𝑣𝑗 = log(𝑂– 𝑒𝑔

𝑗)/𝑒𝑔 𝑗 ≈ log 𝑂/𝑒𝑔 𝑗

IDF!

23

slide-24
SLIDE 24

Estimation – key challenge

 𝑞𝑗 cannot be approximated as easily as 𝑣𝑗

 probability of occurrence in relevant docs

 𝑞𝑗 can be estimated in various ways:

 constant (Croft and Harper combination match)

 Then just get idf weighting of terms

 proportional to prob. of occurrence in collection

 Greiff (SIGIR 1998) argues for 1/3 + 2/3 𝑒𝑔𝑗/𝑂

 from relevant docs if know some

 Relevance weighting can be used in a feedback loop

24

slide-25
SLIDE 25

Probabilistic Relevance Feedback

1.

Guess 𝑞𝑗 and 𝑣𝑗 and use it to retrieve a first set of relevant docs 𝑊𝑆.

2.

Interact with the user to refine the description: user specifies some definite members with 𝑆 = 1 (the set 𝑊𝑆) and 𝑆 = 0 (the set 𝑊𝑂𝑆)

3.

Re-estimate 𝑞𝑗 and 𝑣𝑗:

4.

Repeat, thus generating a succession of approximations to relevant docs

25

𝑞𝑗 =

𝑊𝑆𝑗 +1

2

𝑊𝑆 +1, 𝑣𝑗 = 𝑊𝑂𝑆𝑗 +1

2

𝑊𝑂𝑆 +1

slide-26
SLIDE 26

Probabilistic Relevance Feedback

1.

Guess 𝑞𝑗 and 𝑣𝑗 and use it to retrieve a first set of relevant docs 𝑊𝑆.

2.

Interact with the user to refine the description: learn some definite members with 𝑆 = 1 and 𝑆 = 0

3.

Re-estimate 𝑞𝑗 and 𝑣𝑗:

 Or can combine new info with original guess (use Bayesian update):

4.

Repeat, thus generating a succession of approximations to relevant docs

( ) ( 1)

| | | |

t t i i i

VR p p VR  

  

κ is prior weight

26

slide-27
SLIDE 27

27

Iteratively estimating 𝑞𝑗 (= Pseudo-relevance feedback)

1.

Assume that 𝑞𝑗 is constant over all 𝑦𝑗 in query

𝑞𝑗 = 0.5 (even odds) for any given doc

2.

Determine guess of relevant doc set:

𝑊 is fixed size set of highest ranked docs on this model

3.

We need to improve our guesses for 𝑞𝑗 and 𝑣𝑗:

Let 𝑊𝑗 be set of docs containing 𝑦𝑗

𝑞𝑗 = 𝑊

𝑗 +

1 2 𝑊 + 1

Assume if not retrieved then not relevant

𝑣𝑗 = 𝑒𝑔𝑗 – 𝑊

𝑗 + 1/2

𝑂 – 𝑊 + 1 4.

Go to 2. until converges then return ranking

slide-28
SLIDE 28

PRP and BIM

 Getting reasonable approximations of probabilities is possible.  Requires restrictive assumptions:

 boolean representation of docs/queries/relevance  term independence  terms that do not appear in the query don’t affect the outcome  doc relevance values are independent

 Some of these assumptions can be removed  Problem: either require partial relevance information or only

can derive somewhat inferior term weights

28

slide-29
SLIDE 29

Removing term independence

  • In

general, index terms aren’t independent

  • Dependencies can be complex
  • Rijsbergen (1979) proposed model of

simple tree dependencies

  • In 1970s, estimation problems held back

success of this model

  • Exactly

Friedman and Goldszmidt’s Tree Augmented Naive Bayes (AAAI 13, 1996)

  • Each term dependent on one other

29

slide-30
SLIDE 30

A key limitations of the BIM

 BIM was designed for titles or abstracts, and not for

modern full text search

 like much of original IR

 We want to pay attention to term frequency and doc

lengths

 just like in other models we’ve discussed.

30

slide-31
SLIDE 31

Okapi BM25

 BM25 “Best Match 25” (they had a bunch of tries!)

 Developed in the context of the Okapi system  Started to be increasingly adopted by other teams during the

TREC competitions

 It works well

 Goal: Releasing some assumption of BIM while not adding

too many parameters

 (Spärck Jones et al. 2000)

 I’ll omit the theory, but show the form….

31

slide-32
SLIDE 32

Recall: BIM

 Boils down to:  Simplifies to (with constant pi = 0.5)

 

1

log

i i q

x i BIM

df N RSV

i i i i BIM i

u p u p c ) 1 ( ) 1 ( log   

document relevant (R=1) not relevant (R=0) term present xi = 1 pi ui term absent xi = 0 (1 – pi) (1 – ui)

Log odds ratio

;

1

 

i i q

x BIM i BIM

c RSV

slide-33
SLIDE 33

“Early” versions of BM25

 Version 1: using the saturation function  Version 2: BIM simplification to IDF:

 (𝑙1 + 1)

factor doesn’t change ranking, but makes term score 1 when tfi =1

 Similar to tf-idf, but term scores are bounded

ci

BM 25v1(tfi) = ci BIM

tfi k1 +tfi

i i i i v BM i

tf k tf k df N tf c    

1 1 2 25

) 1 ( log ) (

slide-34
SLIDE 34

Document length normalization

 Longer documents are likely to have larger tfi values  Why might documents be longer?

 Verbosity: suggests observed tfi too high  Larger scope: suggests observed tfi may be right

 A real document collection probably has both effects

 … so should apply some kind of normalization

slide-35
SLIDE 35

Document length normalization

 Document length:  avdl:Average document length over collection  Length normalization component:

b =1 full document length normalization

 b =0 no document length normalization

V i i

tf dl , _ ) 1 (            dl av dl b b B 1   b

slide-36
SLIDE 36

Okapi BM25

 Factor in the frequency of each term versus doc length:

 tf𝑗,𝑒 is term freq of 𝑗 in 𝑒  𝑀𝑒 is length of 𝑒 and 𝑀𝑏𝑤𝑓 is ave. doc length  𝑙1 and 𝑐 are tuning parameters

36

); (

25 25 i q i BM i BM

tf c RSV

i i i i BM i

tf dl av dl b b k tf k df N tf c       ) _ ) 1 (( ) 1 ( log ) (

1 1 25

slide-37
SLIDE 37

Okapi BM25

 𝑙1 controls term frequency scaling

 𝑙1 = 0 is binary model; 𝑙1 = large is raw term frequency

 𝑐 controls doc length normalization

 b = 0 is no length normalization; b = 1 is relative frequency

(fully scale by doc length)

 Typically, 𝑙1 is set around 1.2–2 and 𝑐 around 0.75

37

i i q i i BM

tf avdl dl b b k tf k df N RSV      

) ) 1 (( ) 1 ( log

1 1 25

slide-38
SLIDE 38

Resources

  • S. E. Robertson and K. Spärck Jones. 1976. Relevance Weighting of Search
  • Terms. Journal of the American Society for Information Sciences 27(3): 129–146.
  • C. J. van Rijsbergen. 1979. Information Retrieval. 2nd ed. London: Butterworths,

chapter 6. [Most details

  • f

math] http://www.dcs.gla.ac.uk/Keith/Preface.html

  • N. Fuhr. 1992. Probabilistic Models in Information Retrieval. The Computer

Journal, 35(3),243–255. [Easiest read, with BNs]

  • F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell. 1998. Is This

Document Relevant? ... Probably: A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys 30(4): 528–552.

http://www.acm.org/pubs/citations/journals/surveys/1998-30-4/p528-crestani/

[Adds very little material that isn’t in van Rijsbergen or Fuhr ]

38