TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval - - PowerPoint PPT Presentation

tf idf and okapi bm25
SMART_READER_LITE
LIVE PREVIEW

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval - - PowerPoint PPT Presentation

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton Binary Independence Models In Bayesian classification, we rank documents by their likelihood ratios P ( D | R = 1 ) calculated from some probabilistic


slide-1
SLIDE 1

CS6200: Information Retrieval

Slides by: Jesse Anderton

TF-IDF and Okapi BM25

LM, session 3

slide-2
SLIDE 2

In Bayesian classification, we rank documents by their likelihood ratios calculated from some probabilistic model. The model predicts the features that a relevant or non-relevant document is likely to have. Our first model is a unigram language model, which independently estimates the probability of each term appearing in a relevant or non-relevant document. Any model like this, based on independent binary features , is called a binary independence model.

Binary Independence Models

Likelihood Ratio

P(D|R = 1) P(D|R = 0)

Binary independence Model

|F|

i=1 P(fi|R = 1)

|F|

i=1 P(fi|R = 0)

fi ∈ F

slide-3
SLIDE 3

Simplifying the binary independence model leads to a ranking score which allows us to ignore terms not found in the document. This is important for efficient queries.

Ranking with B.I. Models

Let pi := P(fi|R = 1), qi := P(fi|R = 0), di ∈ {0, 1} := value of fi in doc D.

Ranking Score = 1

Then P(D|R = 1) P(D|R = 0) =

  • i:di=1

pi qi ·

  • i:di=0

1 − pi 1 − qi =

  • i:di=1

pi qi ·

i:di=1

1 − qi 1 − pi ·

  • i:di=1

1 − pi 1 − qi

  • ·
  • i:di=0

1 − pi 1 − qi =

  • i:di=1

pi(1 − qi) qi(1 − pi) ·

|F|

  • i=1

1 − pi 1 − qi

rank

=

  • i:di=1

pi(1 − qi) qi(1 − pi)

rank

=

  • i:di=1

log pi(1 − qi) qi(1 − pi)

slide-4
SLIDE 4

Under certain assumptions, the ranking score is just IDF:

  • 1. All words have a fixed uniform

probability of appearing in a relevant document: pi = 1/2.

  • 2. Most documents containing the

term are non-relevant, so .

  • 3. Most documents do not contain the

term, so .

Relationship to IDF

Ranking Score, approximated using assumptions, becomes IDF

qi ≈ dfi/D

D − dfi ≈ D

log pi(1 − qi) qi(1 − pi) ≈ log 0.5(1 − dfi

D ) dfi D (1 − 0.5)

= log 1 − dfi

D dfi D

= log D dfi − dfi · D dfi · D = log D − dfi dfi ≈ log D dfi

slide-5
SLIDE 5

It turns out that we can do better than IDF. To get there, we’ll start by considering the contingency table of all combinations of di and R.

Improving on IDF

R = 1 R = 0 Total di = 1 ri dfi – ri dfi di = 0 R – ri D – R – dfi + ri D – dfi Total R D – R D

pi = ri + 0.5 R + 1 ; qi = dfi − ri + 0.5 D − R + 1

We will estimate pi and qi using this table and a technique called “add-⍺ smoothing,” with ⍺=0.5. This leads to a slightly different ranking score:

  • i:di=1

log pi(1 − qi) qi(1 − pi) =

  • i:di=1

log (num(di = 1, R = 1) + 0.5)/(num(di = 0, R = 1) + 0.5) (num(di = 1, R = 0) + 0.5)/(num(di = 0, R = 0) + 0.5) =

  • i:di=1

log (ri + 0.5)/(R − ri + 0.5) (dfi − ri + 0.5)/(D − R − dfi + ri + 0.5)

slide-6
SLIDE 6

Let’s unpack this formula to understand it better. The numerator is a ratio of counts of relevant documents the term does and does not appear in. It’s a likelihood ratio giving the amount of “evidence of relevance” the term provides. The denominator is the same ratio, for non- relevant documents. It gives the amount of “evidence of non-relevance” for the term. If the term is in many documents, but most of them are relevant, it doesn’t discount the term as IDF would.

Is it better?

A better IDF?

log (ri + 0.5)/(R − ri + 0.5) (dfi − ri + 0.5)/(D − R − dfi + ri + 0.5)

slide-7
SLIDE 7

Okapi BM25 is one of the strongest “simple” scoring functions, and has proven a useful baseline for experiments and feature for ranking. It combines:

  • The IDF-like ranking score from the last

slide,

  • the document term frequency tfi,d,

normalized by the ratio of the document’s length dl to the average length avg(dl), and

  • the query term frequency tfi,q.

Okapi BM25

k1 = 1.2 0 ≤k2 ≤ 1000 b = 0.75

k1, k2, and b are empirically-set parameters. Typical values at TREC are:

Okapi BM25

  • i:di=qi=1
  • log
  • (ri + 0.5)/(R − ri + 0.5)

(dfi − ri + 0.5)/(D − R − dfi + ri + 0.5)

  • ·

tfi,d + k1 · tfi,d tfi,d + k1((1 − b) + b ·

dl avg(dl)) ·tfi,q + k2 · tfi,q

tfi,q + k2

slide-8
SLIDE 8

Example query: “president lincoln”

  • tfpresident,q = tflincoln,q = 1
  • No relevance information: R = ri = 0
  • “president” is in 40,000 documents in the

collection: dfpresident = 40,000

  • “lincoln” is in 300 documents in the

collection: dflincoln = 300

  • The document length is 90% of the

average length: dl/avg(dl) = 0.9

  • We pick k1 = 1.2, k2 = 100, b = 0.75

Example: BM25

tfpresident tflincoln BM25 15 25 20.66 15 1 12.74 15 5.00 1 25 18.2 25 15.66

The low df term plays a bigger role.

slide-9
SLIDE 9

Binary Independence Models are a principled, general way to combine evidence from many binary features (not just unigrams!) The version of BM25 shown here is one of many in a family of scoring

  • functions. Modern alternatives can take additional evidence, such as

anchor text, into account. Next, we’ll generalize what we’ve learned so far into the fundamental topics of machine learning.

Wrapping Up