TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval - PowerPoint PPT Presentation

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton

Binary Independence Models In Bayesian classification, we rank documents by their likelihood ratios P ( D | R = 1 ) calculated from some probabilistic model. P ( D | R = 0 ) The model predicts the features that a relevant or non-relevant document is likely to have. Likelihood Ratio Our first model is a unigram language model, which independently estimates the probability of each term appearing in a � | F | i = 1 P ( f i | R = 1 ) relevant or non-relevant document. � | F | i = 1 P ( f i | R = 0 ) Any model like this, based on independent f i ∈ F binary features , is called a binary independence model . Binary independence Model

Ranking with B.I. Models Simplifying the binary independence model leads to a ranking score which allows us to ignore terms not found in the document. This is important for efficient queries. Let p i := P ( f i | R = 1 ) , q i := P ( f i | R = 0 ) , d i ∈ { 0 , 1 } := value of f i in doc D . Then P ( D | R = 1 ) p i 1 − p i � � P ( D | R = 0 ) = · = 1 q i 1 − q i i : d i = 1 i : d i = 0 � � � p i 1 − q i 1 − p i 1 − p i � � � = · · · q i 1 − p i 1 − q i 1 − q i i : d i = 1 i : d i = 1 i : d i = 1 i : d i = 0 | F | p i ( 1 − q i ) 1 − p i � � = q i ( 1 − p i ) · 1 − q i i : d i = 1 i = 1 p i ( 1 − q i ) log p i ( 1 − q i ) rank rank � � = = Ranking Score q i ( 1 − p i ) q i ( 1 − p i ) i : d i = 1 i : d i = 1

Relationship to IDF log p i ( 1 − q i ) Ranking Score, Under certain assumptions, the q i ( 1 − p i ) ranking score is just IDF: ≈ log 0 . 5 ( 1 − df i D ) approximated using assumptions, df i D ( 1 − 0 . 5 ) 1. All words have a fixed uniform = log 1 − df i probability of appearing in a D relevant document: p i = 1/2 . df i D = log D − df i · D 2. Most documents containing the df i df i · D q i ≈ df i / D term are non-relevant, so . = log D − df i df i 3. Most documents do not contain the ≈ log D term, so . D − df i ≈ D becomes IDF df i

Improving on IDF It turns out that we can do better than IDF. To get there, we’ll start by considering the contingency table of all combinations of d i and R . We will estimate p i and q i using this table and a Total R = 1 R = 0 technique called “add- ⍺ smoothing,” with ⍺ =0.5. r i df i – r i df i d i = 1 p i = r i + 0 . 5 R + 1 ; q i = df i − r i + 0 . 5 R – r i D – R – df i + r i D – df i d i = 0 D − R + 1 Total R D D – R This leads to a slightly different ranking score: log p i ( 1 − q i ) log ( num ( d i = 1 , R = 1 ) + 0 . 5 ) / ( num ( d i = 0 , R = 1 ) + 0 . 5 ) � � q i ( 1 − p i ) = ( num ( d i = 1 , R = 0 ) + 0 . 5 ) / ( num ( d i = 0 , R = 0 ) + 0 . 5 ) i : d i = 1 i : d i = 1 ( r i + 0 . 5 ) / ( R − r i + 0 . 5 ) � = log ( df i − r i + 0 . 5 ) / ( D − R − df i + r i + 0 . 5 ) i : d i = 1

Is it better? Let’s unpack this formula to understand it better. The numerator is a ratio of counts of relevant documents the term does and does not appear in. It’s a likelihood ratio giving the ( r i + 0 . 5 ) / ( R − r i + 0 . 5 ) amount of “evidence of relevance” the term log provides. ( df i − r i + 0 . 5 ) / ( D − R − df i + r i + 0 . 5 ) The denominator is the same ratio, for non- A better IDF? relevant documents. It gives the amount of “evidence of non-relevance” for the term. If the term is in many documents, but most of them are relevant , it doesn’t discount the term as IDF would.

Okapi BM25 Okapi BM25 is one of the strongest “simple” scoring functions, and has proven � � � ( r i + 0 . 5 ) / ( R − r i + 0 . 5 ) a useful baseline for experiments and � log ( df i − r i + 0 . 5 ) / ( D − R − df i + r i + 0 . 5 ) feature for ranking. i : d i = q i = 1 � tf i , d + k 1 · tf i , d avg ( dl ) ) · tf i , q + k 2 · tf i , q It combines: · dl tf i , q + k 2 tf i , d + k 1 (( 1 − b ) + b · • The IDF-like ranking score from the last Okapi BM25 slide, • the document term frequency tf i,d , k 1 , k 2 , and b are empirically-set parameters. normalized by the ratio of the document’s Typical values at TREC are: length dl to the average length avg ( dl ) , and k 1 = 1 . 2 0 ≤ k 2 ≤ 1000 • the query term frequency tf i,q . b = 0 . 75

Example: BM25 Example query: “president lincoln” tf president tf lincoln BM25 • tf president,q = tf lincoln,q = 1 15 25 20.66 • No relevance information: R = r i = 0 • “president” is in 40,000 documents in the 15 1 12.74 collection: df president = 40,000 15 0 5.00 • “lincoln” is in 300 documents in the collection: df lincoln = 300 1 25 18.2 • The document length is 90% of the 0 25 15.66 average length: dl / avg ( dl ) = 0.9 The low df term plays a bigger role. • We pick k 1 = 1.2 , k 2 = 100 , b = 0.75

Wrapping Up Binary Independence Models are a principled, general way to combine evidence from many binary features (not just unigrams!) The version of BM25 shown here is one of many in a family of scoring functions. Modern alternatives can take additional evidence, such as anchor text, into account. Next, we’ll generalize what we’ve learned so far into the fundamental topics of machine learning.

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval - PowerPoint PPT Presentation

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton Binary Independence Models In Bayesian classification, we rank documents by their likelihood ratios P ( D | R = 1 ) calculated from some probabilistic

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Binary Independence Models In

WEDNESDAY 6 DECEMBER 10:00-11:00 ACTIVITIES IN THE IDF SOUTH EAST ASIA REGION 12:00-12:30 IDF

Chapter 6: Vector Semantics, Part II Tf-idf and PPMI are sparse representations tf-idf and PPMI

Logic in Action Introduction: IDF and Special Operations History of the IDF and special

II are : created ? Gal LEIF ) Lastine Gull Elf ) - f Ift F } Idf ) ={ rebut - iff - idf de re

IDF World Dairy Summit, Vilnius, Lithuania September 22nd, 2015 Special thanks to: ZNL

AND RELEVANCE OF DATASETS BENJAMIN Ben (PhD Student) DICENT-IDF laboratory, University of

XLIFF 2.0 the Easy Way: The Okapi XLIFF Toolkit FEISGILTT Dublin June 2014 Yves Savourel ENLASO

Introduction to Natural Language Processing Summary Language models Okapi BM25 Binary

Introduction to Information Retrieval http://informationretrieval.org IIR 11: Probabilistic

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking Principle 2. Binary Independence

What Weve Learned from Users Evaluation, session 11 CS6200: Information Retrieval Users vs.

CIS 530: Vector Semantics part 2 JURAFSKY AND MARTIN CHAPTER 6 Reminders HOMEWORK 3 IS DUE ON

Lyncourt School Technology Plan and Smart Schools Bond Act Matthew Dean Director of Technology Uses

Lecture 38 tf/idf and information retrieval Mark Hasegawa-Johnson 5/1/2020 CC-BY 4.0: you

E.C.Co. E ndometrial C ancer Co nservative treatment A multicentre archive Data collection is

and Agenda Overview ROME 24 | 28 SEPTEMBER 2018 JAKOB KALKO and MARY BETH GARNEAU Voorburg

Targets Benefjciaries Research People affected by diabetes Epidemiology Policy

Probing the Higgs Potential with di-Higgs Production Measuring the Tri-Linear Coupling at

Economies of Scope and Trade Niklas Herzig Bielefeld University May 2017 Niklas Herzig

S. Longo SuperB Workshop - LNF Slide 1/10 Alfresco Document Manager status: Alfresco

Semi-analytical solutions for transport PDEs in heterogeneous media Dr Elliot Carr

New Insights into Disability Beneficiaries Pursuit of Work Presenters Michael Levere, Denise