TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval - - PowerPoint PPT Presentation

▶

Feb 13, 2023 452 likes •549 views

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Binary Independence Models In Bayesian classification, we rank documents by their likelihood ratios P ( D | R = 1 ) calculated from some probabilistic model. P ( D | R = 0 ) The

SLIDE 1

CS6200: Information Retrieval

TF-IDF and Okapi BM25

LM, session 3

SLIDE 2

In Bayesian classification, we rank documents by their likelihood ratios calculated from some probabilistic model. The model predicts the features that a relevant or non-relevant document is likely to have. Our first model is a unigram language model, which independently estimates the probability of each term appearing in a relevant or non-relevant document. Any model like this, based on independent binary features , is called a binary independence model.

Binary Independence Models

Likelihood Ratio

P(D|R = 1) P(D|R = 0)

Binary independence Model

|F|

i=1 P(fi|R = 1)

|F|

i=1 P(fi|R = 0)

fi ∈ F

SLIDE 3

Simplifying the binary independence model leads to a ranking score which allows us to ignore terms not found in the document. This is important for efficient queries.

Ranking with B.I. Models

Let pi := P(fi|R = 1), qi := P(fi|R = 0), di ∈ {0, 1} := value of fi in doc D.

Ranking Score = 1

Then P(D|R = 1) P(D|R = 0) =

i:di=1

pi qi ·

i:di=0

1 − pi 1 − qi =

i:di=1

pi qi ·

i:di=1

1 − qi 1 − pi ·

i:di=1

1 − pi 1 − qi

·
i:di=0

1 − pi 1 − qi =

i:di=1

pi(1 − qi) qi(1 − pi) ·

|F|

1 − pi 1 − qi

rank

i:di=1

pi(1 − qi) qi(1 − pi)

rank

i:di=1

log pi(1 − qi) qi(1 − pi)

SLIDE 4

Under certain assumptions, the ranking score is just IDF:

1. All words have a fixed uniform

probability of appearing in a relevant document: pi = 1/2.

2. Most documents containing the

term are non-relevant, so .

3. Most documents do not contain the

term, so .

Relationship to IDF

Ranking Score, approximated using assumptions, becomes IDF

qi ≈ dfi/D

D − dfi ≈ D

log pi(1 − qi) qi(1 − pi) ≈ log 0.5(1 − dfi

D ) dfi D (1 − 0.5)

= log 1 − dfi

D dfi D

= log D dfi − dfi · D dfi · D = log D − dfi dfi ≈ log D dfi

SLIDE 5

It turns out that we can do better than IDF. To get there, we’ll start by considering the contingency table of all combinations of di and R.

Improving on IDF

R = 1 R = 0 Total di = 1 ri dfi – ri dfi di = 0 R – ri D – R – dfi + ri D – dfi Total R D – R D

pi = ri + 0.5 R + 1 ; qi = dfi − ri + 0.5 D − R + 1

We will estimate pi and qi using this table and a technique called “add-⍺ smoothing,” with ⍺=0.5. This leads to a slightly different ranking score:

i:di=1

log pi(1 − qi) qi(1 − pi) =

i:di=1

log (num(di = 1, R = 1) + 0.5)/(num(di = 0, R = 1) + 0.5) (num(di = 1, R = 0) + 0.5)/(num(di = 0, R = 0) + 0.5) =

i:di=1

log (ri + 0.5)/(R − ri + 0.5) (dfi − ri + 0.5)/(D − R − dfi + ri + 0.5)

SLIDE 6

Let’s unpack this formula to understand it better. The numerator is a ratio of counts of relevant documents the term does and does not appear

in. It’s a likelihood ratio giving the amount of

“evidence of relevance” the term provides. The denominator is the same ratio, for non- relevant documents. It gives the amount of “evidence of non-relevance” for the term. If the term is in many documents, but most of them are relevant, it doesn’t discount the term as IDF would.

Is it better?

A better IDF?

log (ri + 0.5)/(R − ri + 0.5) (dfi − ri + 0.5)/(D − R − dfi + ri + 0.5)

SLIDE 7

Okapi BM25 is one of the strongest “simple” scoring functions, and has proven a useful baseline for experiments and feature for ranking. It combines:

The IDF-like ranking score from the last slide,
the document term frequency tfi,d, normalized

by the ratio of the document’s length dl to the average length avg(dl), and

the query term frequency tfi,q.

Okapi BM25

k1 = 1.2 0 ≤k2 ≤ 1000 b = 0.75

k1, k2, and b are empirically-set parameters. Typical values at TREC are:

Okapi BM25

i:di=qi=1
log
(ri + 0.5)/(R − ri + 0.5)

(dfi − ri + 0.5)/(D − R − dfi + ri + 0.5)

tfi,d + k1 · tfi,d tfi,d + k1((1 − b) + b ·

dl avg(dl)) ·tfi,q + k2 · tfi,q

tfi,q + k2

SLIDE 8

Example query: “president lincoln”

tfpresident,q = tflincoln,q = 1
No relevance information: R = ri = 0
“president” is in 40,000 documents in the

collection: dfpresident = 40,000

“lincoln” is in 300 documents in the

collection: dflincoln = 300

The document length is 90% of the

average length: dl/avg(dl) = 0.9

We pick k1 = 1.2, k2 = 100, b = 0.75

Example: BM25

tfpresident,d tflincoln,d

BM25 15 25 20.66 15 1 12.74 15 5.00 1 25 18.2 25 15.66

The low df term plays a bigger role.

SLIDE 9

Binary Independence Models are a principled, general way to combine evidence from many binary features (not just unigrams!) The version of BM25 shown here is one of many in a family of scoring

functions. Modern alternatives can take additional evidence, such as

anchor text, into account. Next, we’ll generalize what we’ve learned so far into the fundamental topics of machine learning.

TF-IDF and Okapi BM25

Binary Independence Models

P(D|R = 1) P(D|R = 0)

Ranking with B.I. Models

Relationship to IDF

Improving on IDF

Is it better?

Okapi BM25

Example: BM25

Wrapping Up