III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking - PowerPoint PPT Presentation

    III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking Principle 2. Binary Independence Model 3. Okapi BM25 4. Tree Dependence Model 5. Bayesian Networks for IR ! ! Based on MRS Chapter 11 IR&DM ’13/’14 ! 48

TF*IDF vs. Probabilistic IR vs. Statistical LMs • TF*IDF and VSM produce sufficiently good results in practice   but often criticized for being “too ad-hoc” or “not principled” • Typically outperformed by probabilistic retrieval models and statistical language models in IR benchmarks (e.g., TREC) • Probabilistic retrieval models • use generative models of documents as bags-of-words • explicitly model probability of relevance P [ R |d, q ] • Statistical language models • use generative models of documents and queries as sequences-of-words • consider likelihood of generating query from document model or   divergence of document model and query model (e.g., Kullback-Leibler) IR&DM ’13/’14 ! 49

          Probabilistic Information Retrieval • Generative model • probabilistic mechanism for producing documents (or queries) • usually based on a family of parameterized probability distributions   t 1 , …, t M d 1 • Powerful model but restricted through practical limitations • often strong independence assumptions required for tractability • parameter estimation has to deal with sparseness of available data   (e.g., collection with M terms has 2 M distinct possible documents, but   model parameters need to be estimated from N << 2 M documents) IR&DM ’13/’14 ! 50

        Multivariate Bernoulli Model • For generating document d from joint (multivariate)   term distribution Φ • consider binary random variables : d t = 1 if term in d , 0 otherwise • postulate independence among these random variables   Y φ d t t (1 − φ 1 − d t P [ d | Φ ] = ) t t ∈ V φ t = P [term t occurs in a document] • Problems: • underestimates probability of short documents • product for absent terms underestimates probability of likely documents • too much probability mass given to very unlikely term combinations IR&DM ’13/’14 ! 51

    1. Probability Ranking Principle (PRP) “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.” [van Rijsbergen 1979] • PRP with costs [Robertson 1977] defines cost of retrieving d   as the next result in a ranked list for query q as   cost ( d, q ) = C 1 P [ R | d, q ] + C 0 P [ ¯ R | d, q ] with cost constants • C 1 as cost of retrieving a relevant document • C 2 as cost of retrieving an irrelevant document • For C 1 < C 0 , cost is minimized by choosing P [ R | d, q ] arg max d IR&DM ’13/’14 ! 52

Probability Ranking Principle (cont’d) • Probability ranking principle makes two strong assumptions • P [ R |d, q ] can be determined accurately • P [ R |d, q ] and P [ R |d’, q ] are pairwise independent for documents d , d’   • PRP without costs (based on Bayes’ optimal decision rule) • returns set of documents d for which P [ R |d, q ] > (1 - P [ R |d, q ]) • minimizes the expected loss (aka. Bayes’ risk) under the 1/0 loss function   IR&DM ’13/’14 ! 54

2. Binary Independence Model (BIM) • Binary independence model [Robertson and Spärck-Jones 1976]   has traditionally been used with the probabilistic ranking principle • Assumptions: • relevant and irrelevant documents differ in their term distribution • probabilities of term occurrences are pairwise independent • documents are sets of terms , i.e., binary term weights in {0,1} • non-query terms have the same probability of occurring in   relevant and non-relevant documents • relevance of a document is independent of relevance others document IR&DM ’13/’14 ! 55

Ranking Proportional to Relevance Odds (cont’d) P [ ¯ P [ D t | R ] D t | R ] Q R ] × Q = P [ D t | ¯ P [ ¯ D t | ¯ R ] t 2 d t 62 d t 2 q t 2 q (1 − p t ) p t Q q t × Q = ( shortcuts p t and q t ) 1 − q t t 2 d t 62 d t 2 q t 2 q p dt (1 − p t ) 1 � dt Q × Q = t q dt (1 − q t ) 1 � dt t ∈ q t t ∈ q p dt q dt ⇣ ⌘ ⇣ ⌘ (1 − p t ) (1 − q t ) P log − log t t ∝ (1 − p t ) dt (1 − q t ) dt t ∈ q d t log 1 − q t log 1 − p t p t P 1 − p t + P + P = d t log 1 − q t q t t ∈ q t ∈ q t ∈ q d t log 1 − q t p t P 1 − p t + P ( invariant of d ) d t log ∝ q t t ∈ q t ∈ q IR&DM ’13/’14 ! 57

      Estimating p t and q t with a Training Sample • We can estimate p t and q t based on a training sample obtained   by evaluating the query q on a small sample of the corpus and   asking the user for relevance feedback about the results • Let N be the # documents in our sample   R be the # relevant documents in our sample   n t be the # documents in our sample that contain t   r t be the # relevant documents in our sample that contain t   we estimate   p t = r t q t = n t − r t R N − R or with Lidstone smoothing ( λ = 0.5)   p t = r t + 0 . 5 q t = n t − r t + 0 . 5 R + 1 N − R + 1 IR&DM ’13/’14 ! 58

    Smoothing (with Uniform Prior) • Probabilities p t and q t for term t are estimated by   MLE for Binomial distribution • repeated coin tosses for term t in relevant documents ( p t ) • repeated coin tosses for term t in irrelevant documents ( q t ) • Avoid overfitting to the training sample by smoothing estimates • Laplace smoothing (based on Laplace’s law of succession)   p t = r t + 1 q t = n t − r t + 1 R + 2 N − R + 2 • Lidstone smoothing (heuristic generalization with λ > 0) p t = r t + λ q t = n t − r t + λ R + 2 λ N − R + 2 λ IR&DM ’13/’14 ! 59

Binary Independence Model (Example) • Consider query q = { t 1 , …, t 6 } and sample of four documents t 1 t 2 t 3 t 4 t 5 t 6 R ! d 1 1 0 1 1 0 0 1 R = 2 ! N = 4 d 2 1 1 0 1 1 0 1 d 3 0 0 0 1 1 0 0 ! d 4 0 0 1 0 0 0 0 ! n t 2 1 2 3 2 0 r t 2 1 1 2 1 0 ! p t 5/6 1/2 1/2 5/6 1/2 1/6 q t 1/6 1/6 1/2 1/2 1/2 1/6 ! • For document d 6 = { t 1 , t 2 , t 6 } we obtain P [ R | d 6 , q ] ∝ log 5 + log 1 + log 1 5 + log 5 + log 5 + log 5 IR&DM ’13/’14 ! 60

            Estimating p t and q t without a Training Sample • When no training sample is available, we estimate p t and q t as   p t = (1 − p t ) = 1 q t = d f t 2 | D | • p t reflects that we have no information about relevant documents • q t under the assumption that # relevant documents <<< # documents   • When we plug in these estimates of p t and q t , we obtain   d t log | D | − d d t log | D | f t X X X P [ R | d, q ] = d t log 1 + ≈ d f t d f t t ∈ q t ∈ q t ∈ q which can be seen as TF*IDF with binary term frequencies   and logarithmically dampened inverse document frequencies IR&DM ’13/’14 ! 61

Poisson Model • For generating document d from joint (multivariate)   term distribution Φ • consider counting random variables : d t = tf t,d • postulate independence among these random variables • Poisson model with term-specific parameters µ t : e − µ t · µ d t µ d t ! Y = e − P t ∈ V µ t Y t t P [ d | µ ] = d t ! d t ! ! t ∈ V t ∈ d n µ t = 1 X • MLE for µ t from n sample documents { d 1 , …, d n }: tf t,d i ˆ n i =1 • no penalty for absent words • no control of document length   IR&DM ’13/’14 ! 62

            3. Okapi BM25 • Generalizes term weight   w = log p (1 − q ) q (1 − p ) into   w = log p tf q 0 q tf p 0 where p i and q i denote the probability that term occurs i times   in a relevant or irrelevant document, respectively • Postulates Poisson (or 2-Poisson-mixture) distributions for terms p tf = e − λ λ tf q tf = e − µ µ tf tf ! tf ! IR&DM ’13/’14 ! 63

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking - PowerPoint PPT Presentation

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking Principle 2. Binary Independence Model 3. Okapi BM25 4. Tree Dependence Model 5. Bayesian Networks for IR ! ! Based on MRS Chapter 11 IR&DM 13/14 ! 48

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Chapter III: Ranking Principles Information Retrieval & Data Mining Universitt des

Elimination of binary choice sequences Tatsuji Kawai Japan Advanced Institute of Science and

Recovering Preferences from Finite Data Christopher Chambers 1 , Federico Echenique 2 , Nicolas

KernGPLM A Package for Kernel-Based Fitting of Aim of this Talk Generalized Partial Linear

Estimation in the Fixed Effects Ordered Logit Model Chris Muris (SFU) Outline Introduction

Generalized Probit Model in Design of Dose Finding Experiments Yuehui Wu Valerii V. Fedorov

Modelling and Verification Lecture 1 Lecturer: Luca Aceto luca@ru.is or luca.aceto@gmail.com

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris

Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking - PowerPoint PPT Presentation

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking Principle 2. Binary Independence Model 3. Okapi BM25 4. Tree Dependence Model 5. Bayesian Networks for IR ! ! Based on MRS Chapter 11 IR&DM 13/14 ! 48

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Chapter III: Ranking Principles Information Retrieval &amp; Data Mining Universitt des

Elimination of binary choice sequences Tatsuji Kawai Japan Advanced Institute of Science and

Recovering Preferences from Finite Data Christopher Chambers 1 , Federico Echenique 2 , Nicolas

KernGPLM A Package for Kernel-Based Fitting of Aim of this Talk Generalized Partial Linear

Estimation in the Fixed Effects Ordered Logit Model Chris Muris (SFU) Outline Introduction

Generalized Probit Model in Design of Dose Finding Experiments Yuehui Wu Valerii V. Fedorov

Modelling and Verification Lecture 1 Lecturer: Luca Aceto luca@ru.is or luca.aceto@gmail.com

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris

Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Chapter III: Ranking Principles Information Retrieval & Data Mining Universitt des