Probabilistic Information Retrieval CE-324: Modern Information - PowerPoint PPT Presentation

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Why probabilities in IR? Understanding of user User Query need is uncertain Information Need Representation How to match? Document Uncertain guess of whether Documents Representation doct has relevant content In traditional IR systems, matching between each doc and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties? 2

Probabilistic IR  Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.  Traditionally: neat ideas, but didn ’ t win on performance  It may be different now. 3

Probabilistic IR topics  Classical probabilistic retrieval model  Probability Ranking Principle  Binary independence model ( ≈ We will see that its a Naïve Bayes text categorization)  (Okapi) BM25  Language model approach to IR  An important emphasis on this approach in recent work 4

The document ranking problem  Problem specification:  We have a collection of docs  User issues a query  A list of docs needs to be returned  Ranking method is the core of an IR system:  In what order do we present documents to the user?  Idea: Rank by probability of relevance of the doc w.r.t. information need  𝑄(𝑆 = 1|𝑒𝑝𝑑 𝑗 , 𝑟𝑣𝑓𝑠𝑧) 5

Probability Ranking Principle (PRP) “ If a reference retrieval system ’ s response to each request is a ranking of the docs in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data. ” [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538) 6

Recall a few probability basics  Product rule: 𝑞 𝑏, 𝑐 = 𝑞 𝑏 𝑐 𝑄(𝑐)  Sum rule: 𝑞 𝑏 = 𝑐 𝑞(𝑏, 𝑐)  Bayes ’ Rule Prior ( | ) ( ) ( | ) ( ) p b a p a p b a p a   ( | ) p a b  ( ) ( | ) ( ) ( | ) ( ) p b p b a p a p b a p a Posterior  Odds: ( ) ( ) p a p a   ( ) O a  ( ) 1 ( ) p a p a 7

Probability Ranking Principle (PRP) d : doc 𝑟 : query R : relevance of a doc w.r.t. given (fixed) query 𝑆 = 1 : relevant 𝑆 = 0 : not relevant Need to find probability that a doc 𝒚 is relevant to a query 𝒓 . 𝑞(𝑆 = 1|𝑒, 𝑟) 𝑞 𝑆 = 0 𝑒, 𝑟 = 1 − 𝑞 𝑆 = 1 𝑒, 𝑟 8

Probability Ranking Principle (PRP) 𝑞 𝑆 = 1 𝑒, 𝑟 = 𝑞 𝑒 𝑆 = 1, 𝑟 𝑞(𝑆 = 1|𝑟) 𝑞(𝑒|𝑟) 𝑞 𝑆 = 0 𝑒, 𝑟 = 𝑞 𝑒 𝑆 = 0, 𝑟 𝑞(𝑆 = 0|𝑟) 𝑞(𝑒|𝑟)  𝑞(𝑒|𝑆 = 1, 𝑟) : probability of 𝑒 in the class of relevant docs to the query 𝑟 .  𝑞(𝑒|𝑆 = 0, 𝑟) : probability of 𝑒 in the class of non- relevant docs to the query 𝑟 . 9

Probability Ranking Principle (PRP)  How do we compute all those probabilities?  Do not know exact probabilities, have to use estimates  Binary Independence Model (BIM)  which we discuss next – is the simplest model 10

Probabilistic Retrieval Strategy  Estimate how terms contribute to relevance  How do things like tf , df , and length influence your judgments about doc relevance?  A more nuanced answer is the Okapi formula  Spärck Jones / Robertson  Combine the above estimated values to find doc relevance probability  Order docs by decreasing probability 11

Probabilistic Ranking Basic concept: “ For a given query, if we know some docs that are relevant, terms that occur in those docs should be given greater weighting in searching for other relevant docs . By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically. ” Van Rijsbergen 12

Binary Independence Model  Traditionally used in conjunction with PRP  “ Binary ” = Boolean : docs are represented as binary incidence vectors of terms  𝒚 = [𝑦 1 𝑦 2 … 𝑦 𝑛 ]  𝑦 𝑗 = 1 iff term 𝑗 is present in document 𝑦 .  “ Independence ” : terms occur in docs independently  Equivalent to Multivariate Bernoulli Naive Bayes model  Sometimes used for text categorization [we will see in the next lectures] 13

Binary Independence Model    ( 1| , ) ( 1| ) ( | 1, ) P R q x P R q P x R q    ( | , ) O R q x    ( 0 | , ) ( 0 | ) ( | 0, ) P R q x P R q P x R q Constant for a Needs estimation given query Using Independence Assumption:   n ( | 1, ) ( | 1, ) p x R q P x R q   i   ( | 0, ) ( | 0, ) p x R q P x R q  i 1 i  n ( | 1, ) P x R q    i ( | , ) ( | ) O R q d O R q  ( | 0, ) P x R q  1 i i 15

Binary Independence Model Since 𝑦 𝑗 is either 0 or 1:     ( 1| 1, ) ( 0 | 1, ) P x R q P x R q      i i ( | , ) ( | ) O R q d O R q     ( 1| 0, ) ( 0 | 0, ) P x R q P x R q   x 1 x 0 i i i i    ( 1| 1, ) p P x R q Let i i    ( 1| 0, ) u P x R q i i Assume, for all terms not occurring in the query ( q i =0 ) that 𝑞 𝑗 = 𝑣 𝑗 This can be changed (e.g., in relevance feedback) 16

Probabilities document relevant (R=1) not relevant (R=0) term present p i u i x i = 1 term absent (1 – p i ) (1 – u i ) x i = 0 Then... 17

Binary Independence Model  1 p p      i i ( | , ) ( | ) O R q x O R q  1 u u    1 0 x q x i i i i i  1 q i Non-matching All matching terms query terms   (1 ) 1 p u p      ( | ) i i i O R q   (1 ) 1 u p u    1 1 x q q i i i i i i All query terms All matching terms 18

Binary Independence Model   (1 ) 1 p u p      i i i ( | , ) ( | ) O R q x O R q   (1 ) 1 u p u    1 1 x q q i i i i i i Constant for each query Only quantity to be estimated for rankings Retrieval Status Value:   (1 ) (1 ) p u  p u    log i i log i i RSV   (1 ) (1 ) u p u p     1 1 x q x q i i i i i i i i 19

Binary Independence Model All boils down to computing RSV:   (1 ) (1 ) p u p u     i i i i log log RSV   (1 ) (1 ) u p u p     1 1 x q x q i i i i i i i i   (1 ) p u   ; RSV c log i i c i  i (1 ) u p   1 x i q i i i c i s function as the term weights in this model So, how do we compute c i ’ s from our data ? 20

BIM: example  𝑟 = {𝑦 1 , 𝑦 2 }  Relevance judgements from 20 docs together with the distribution of 𝑦 1 , 𝑦 2 within these docs  𝑞 1 = 8/12 , 𝑣 1 = 3/8 (1,1)  𝑞 2 = 7/12 and 𝑣 2 = 4/8 . (1,0) (0,1)  𝑑 1 = log 10 /3 (0,0)  𝑑 2 = log 7 /5 21

Binary Independence Model Estimating RSV coefficients in theory For each term i look at this table of document counts: Documents Relevant Non-Relevant Total x i =1 s df-s df x i =0 S-s N-df-S+s N-df Total S N-S N 𝑣 𝑗 = 𝑒𝑔 − 𝑡 s p i  Estimates: 𝑂 − 𝑇 S For now, 𝑡 𝑇 − 𝑡 assume no Weight of i-th term: 𝑑 𝑗 ≈ log zero terms. 𝑒𝑔 − 𝑡 𝑂 − 𝑒𝑔 − 𝑇 + 𝑡 22

Estimation – key challenge  If non-relevant docs are approximated by the whole collection:  𝑣 𝑗 = 𝑒𝑔 𝑗 /𝑂  prob. of occurrence in non-relevant docs for query  log(1– 𝑣 𝑗 )/𝑣 𝑗 = log(𝑂– 𝑒𝑔 𝑗 )/𝑒𝑔 𝑗 ≈ log 𝑂/𝑒𝑔 IDF! 𝑗 23

Estimation – key challenge  𝑞 𝑗 cannot be approximated as easily as 𝑣 𝑗  probability of occurrence in relevant docs  𝑞 𝑗 can be estimated in various ways:  constant (Croft and Harper combination match)  Then just get idf weighting of terms  proportional to prob. of occurrence in collection  Greiff (SIGIR 1998) argues for 1/3 + 2/3 𝑒𝑔 𝑗 /𝑂  from relevant docs if know some  Relevance weighting can be used in a feedback loop 24

Probabilistic Information Retrieval CE-324: Modern Information - PowerPoint PPT Presentation

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Why probabilities in

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Introduction to Information Retrieval http://informationretrieval.org IIR 11: Probabilistic

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

large scale building datasets: an outline of a performance benchmark Pieter Pauwels, Tarcisio

Extra Dimensions at the LHC Nobuhito Maru (Chuo Univ.) 7/23/2010 Workshop@YITP

The Dover Architecture Hardware Enforcement of Software-Defined Security Policies Team: Greg

Strong Interactions and New Physics - LHC and operation (run-I, run-II) - ATLAS detector - QCD

QFT Scattering Amplitudes from Riemann Surfaces LoopFest XV, University at Buffalo, NY Freddy

Building Trust for Sample Voting N.K.Blanchard IRIF, RSVP, POPSpEC Talk at TeSS 2017 27th June

RESEARCH REACTORS FOR THE DEVELOPMENT OF MATERIALS AND FUELS FOR INNOCATIVE NUCLEAR ENERGY

Modified SerreGreenNaghdi equations with improved or without dispersion D IDIER CLAMOND

Probabilistic Information Retrieval CE-324: Modern Information - PowerPoint PPT Presentation

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Why probabilities in

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Introduction to Information Retrieval http://informationretrieval.org IIR 11: Probabilistic

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

large scale building datasets: an outline of a performance benchmark Pieter Pauwels, Tarcisio

Extra Dimensions at the LHC Nobuhito Maru (Chuo Univ.) 7/23/2010 Workshop@YITP

The Dover Architecture Hardware Enforcement of Software-Defined Security Policies Team: Greg

Strong Interactions and New Physics - LHC and operation (run-I, run-II) - ATLAS detector - QCD

QFT Scattering Amplitudes from Riemann Surfaces LoopFest XV, University at Buffalo, NY Freddy

Building Trust for Sample Voting N.K.Blanchard IRIF, RSVP, POPSpEC Talk at TeSS 2017 27th June

RESEARCH REACTORS FOR THE DEVELOPMENT OF MATERIALS AND FUELS FOR INNOCATIVE NUCLEAR ENERGY

Modified SerreGreenNaghdi equations with improved or without dispersion D IDIER CLAMOND

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models