Probabilistic Information Retrieval
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Fall 2017
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Probabilistic Information Retrieval CE-324: Modern Information - - PowerPoint PPT Presentation
Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Why probabilities in
CE-324: Modern Information Retrieval
Sharif University of Technology
Fall 2017
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
User Information Need Documents Document Representation Query Representation
How to match?
In traditional IR systems, matching between each doc and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties?
Uncertain guess
whether doct has relevant content
Understanding of user need is uncertain
2
Probabilistic methods are one of the oldest but also one
Traditionally: neat ideas, but didn’t win on performance It may be different now.
3
Classical probabilistic retrieval model
Probability Ranking Principle Binary independence model (≈ We will see that its a Naïve
Bayes text categorization)
(Okapi) BM25
Language model approach to IR
An important emphasis on this approach in recent work
4
Problem specification:
We have a collection of docs User issues a query A list of docs needs to be returned
Ranking method is the core of an IR system:
In what order do we present documents to the user?
Idea: Rank by probability of relevance of the doc w.r.t.
𝑄(𝑆 = 1|𝑒𝑝𝑑𝑗, 𝑟𝑣𝑓𝑠𝑧)
5
“If a reference retrieval system’s response to each request is a ranking of the docs in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”
[1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538)
6
Product rule: 𝑞 𝑏, 𝑐 = 𝑞 𝑏 𝑐 𝑄(𝑐) Sum rule: 𝑞 𝑏 = 𝑐 𝑞(𝑏, 𝑐) Bayes’ Rule
Posterior Prior
7
8
𝑞(𝑒|𝑆 = 1, 𝑟): probability of 𝑒 in the class of relevant
𝑞(𝑒|𝑆 = 0, 𝑟): probability of 𝑒 in the class of non-
9
𝑞 𝑆 = 1 𝑒, 𝑟 = 𝑞 𝑒 𝑆 = 1, 𝑟 𝑞(𝑆 = 1|𝑟) 𝑞(𝑒|𝑟) 𝑞 𝑆 = 0 𝑒, 𝑟 = 𝑞 𝑒 𝑆 = 0, 𝑟 𝑞(𝑆 = 0|𝑟) 𝑞(𝑒|𝑟)
How do we compute all those probabilities?
Do not know exact probabilities, have to use estimates Binary Independence Model (BIM)
which we discuss next – is the simplest model
10
Estimate how terms contribute to relevance
How do things like tf, df, and length influence your judgments
about doc relevance?
A more nuanced answer is the Okapi formula
Spärck Jones / Robertson
Combine the above estimated values to find doc relevance
Order docs by decreasing probability
11
Basic concept:
Van Rijsbergen
12
Traditionally used in conjunction with PRP “Binary” = Boolean: docs are represented as binary
𝒚 = [𝑦1 𝑦2 … 𝑦𝑛] 𝑦𝑗 = 1 iff term 𝑗 is present in document 𝑦.
“Independence”: terms occur in docs independently Equivalent to Multivariate Bernoulli Naive Bayes model
Sometimes used for text categorization [we will see in the next lectures]
13
Will use odds and Bayes’ Rule:
( 1| ) ( | 1, ) ( 1| , ) ( | ) ( | , ) ( 0 | ) ( | 0, ) ( 0 | , ) ( | ) P R q P R q P R q P q O R q P R q P R q P R q P q x x x x x x x
14
Using Independence Assumption:
1
( | 1, ) ( | 1, ) ( | 0, ) ( | 0, )
n i i i
P x R q p R q p R q P x R q
x x
( 1| , ) ( 1| ) ( | 1, ) ( | , ) ( 0 | , ) ( 0 | ) ( | 0, ) P R q P R q P R q O R q P R q P R q P R q x x x x x
Constant for a given query Needs estimation
1
n i i i
15
1
( 1| 1, ) ( 0 | 1, ) ( | , ) ( | ) ( 1| 0, ) ( 0 | 0, )
i i
i i x x i i
P x R q P x R q O R q d O R q P x R q P x R q
Let
( 1| 1, )
i i
p P x R q
Assume, for all terms not occurring in the query (qi=0) that 𝑞𝑗 = 𝑣𝑗 This can be changed (e.g., in relevance feedback)
16
( 1| 0, )
i i
u P x R q
document relevant (R=1) not relevant (R=0) term present xi = 1 pi ui term absent xi = 0 (1 – pi) (1 – ui)
17
All matching terms Non-matching query terms
All matching terms All query terms
1 1
i i i i
i i x q x i i q
1 1
i i i
i i i x q q i i i
18
Constant for each query
1 1
i i i
i i i x q q i i i
1 1
i i i i
i i i i x q x q i i i i
Only quantity to be estimated for rankings
19
1 1
i i i i
i i i i x q x q i i i i
1
i i q
x i
i i i i i
20
𝑟 = {𝑦1, 𝑦2} Relevance judgements from 20 docs together with the
𝑞1 = 8/12, 𝑣1 = 3/8 𝑞2 = 7/12 and 𝑣2 = 4/8. 𝑑1 = log 10 /3 𝑑2 = log 7 /5 (1,1) (1,0) (0,1) (0,0)
21
Estimating RSV coefficients in theory For each term i look at this table of document counts:
For now, assume no zero terms.
22
Weight of i-th term:
If non-relevant docs are approximated by the whole
𝑣𝑗 = 𝑒𝑔
𝑗/𝑂
prob. of occurrence in non-relevant docs for query
log(1– 𝑣𝑗)/𝑣𝑗 = log(𝑂– 𝑒𝑔
𝑗)/𝑒𝑔 𝑗 ≈ log 𝑂/𝑒𝑔 𝑗
IDF!
23
𝑞𝑗 cannot be approximated as easily as 𝑣𝑗
probability of occurrence in relevant docs
𝑞𝑗 can be estimated in various ways:
constant (Croft and Harper combination match)
Then just get idf weighting of terms
proportional to prob. of occurrence in collection
Greiff (SIGIR 1998) argues for 1/3 + 2/3 𝑒𝑔𝑗/𝑂
from relevant docs if know some
Relevance weighting can be used in a feedback loop
24
1.
2.
3.
4.
25
𝑊𝑆𝑗 +1
2
𝑊𝑆 +1, 𝑣𝑗 = 𝑊𝑂𝑆𝑗 +1
2
𝑊𝑂𝑆 +1
1.
2.
3.
Or can combine new info with original guess (use Bayesian update):
4.
( ) ( 1)
| | | |
t t i i i
VR p p VR
κ is prior weight
26
27
1.
𝑞𝑗 = 0.5 (even odds) for any given doc
2.
𝑊 is fixed size set of highest ranked docs on this model
3.
Let 𝑊𝑗 be set of docs containing 𝑦𝑗
𝑞𝑗 = 𝑊
𝑗 +
1 2 𝑊 + 1
Assume if not retrieved then not relevant
𝑣𝑗 = 𝑒𝑔𝑗 – 𝑊
𝑗 + 1/2
𝑂 – 𝑊 + 1 4.
Getting reasonable approximations of probabilities is possible. Requires restrictive assumptions:
boolean representation of docs/queries/relevance term independence terms that do not appear in the query don’t affect the outcome doc relevance values are independent
Some of these assumptions can be removed Problem: either require partial relevance information or only
28
success of this model
Friedman and Goldszmidt’s Tree Augmented Naive Bayes (AAAI 13, 1996)
29
BIM was designed for titles or abstracts, and not for
like much of original IR
We want to pay attention to term frequency and doc
just like in other models we’ve discussed.
30
BM25 “Best Match 25” (they had a bunch of tries!)
Developed in the context of the Okapi system Started to be increasingly adopted by other teams during the
TREC competitions
It works well
Goal: Releasing some assumption of BIM while not adding
(Spärck Jones et al. 2000)
I’ll omit the theory, but show the form….
31
Boils down to: Simplifies to (with constant pi = 0.5)
1
i i q
x i BIM
i i i i BIM i
document relevant (R=1) not relevant (R=0) term present xi = 1 pi ui term absent xi = 0 (1 – pi) (1 – ui)
1
i i q
x BIM i BIM
Version 1: using the saturation function Version 2: BIM simplification to IDF:
(𝑙1 + 1)
factor doesn’t change ranking, but makes term score 1 when tfi =1
Similar to tf-idf, but term scores are bounded
BM 25v1(tfi) = ci BIM
i i i i v BM i
1 1 2 25
Longer documents are likely to have larger tfi values Why might documents be longer?
Verbosity: suggests observed tfi too high Larger scope: suggests observed tfi may be right
A real document collection probably has both effects
… so should apply some kind of normalization
Document length: avdl:Average document length over collection Length normalization component:
b =0 no document length normalization
V i i
Factor in the frequency of each term versus doc length:
tf𝑗,𝑒 is term freq of 𝑗 in 𝑒 𝑀𝑒 is length of 𝑒 and 𝑀𝑏𝑤𝑓 is ave. doc length 𝑙1 and 𝑐 are tuning parameters
36
25 25 i q i BM i BM
i i i i BM i
1 1 25
𝑙1 controls term frequency scaling
𝑙1 = 0 is binary model; 𝑙1 = large is raw term frequency
𝑐 controls doc length normalization
b = 0 is no length normalization; b = 1 is relative frequency
(fully scale by doc length)
Typically, 𝑙1 is set around 1.2–2 and 𝑐 around 0.75
37
i i q i i BM
1 1 25
chapter 6. [Most details
math] http://www.dcs.gla.ac.uk/Keith/Preface.html
Journal, 35(3),243–255. [Easiest read, with BNs]
Document Relevant? ... Probably: A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys 30(4): 528–552.
http://www.acm.org/pubs/citations/journals/surveys/1998-30-4/p528-crestani/
[Adds very little material that isn’t in van Rijsbergen or Fuhr ]
38