Query Likelihood Retrieval LM, session 6 CS6200: Information - PowerPoint PPT Presentation

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Retrieval With Language Models So far, we’ve focused on language models like P ( D = w 1 , w 2 , …, w n ). Where’s the query? Remember the key insight from vector space models: we want to represent queries and documents in the same way. The query is just a “short document:” a sequence of words. There are three obvious approaches we can use for ranking: 1. Query likelihood: Train a language model on a document, and estimate the query’s probability. 2. Document likelihood: Train a language model on the query, and estimate the document’s probability. 3. Model divergence: Train language models on the document and the query, and compare them.

Query Likelihood Retrieval Suppose that the query specifies a topic. We want to know the probability of a document being generated from rank P ( D | Q ) = P ( Q | D ) P ( D ) that topic, or P ( D | Q ) . = P ( Q | D ) Assuming uniform prior However, the query is very small, and � documents are long: document = P ( w | D ) Naive Bayes unigram model language models have less variance. w ∈ Q rank � In the Query Likelihood Model , we use = log P ( w | D ) Numerically stable version Bayes' Rule to rank documents based w ∈ Q on the probability of generating the query from the documents’ language models.

Example: Query Likelihood Wikipedia: WWI Query: “deadliest war in history” World War I ( WWI or WW1 or World War One ), Term P(w|D) log P(w|D) also known as the First World War or the Great War , was a global war centred in Europe deadliest 1/94 = 0.011 -1.973 that began on 28 July 1914 and lasted until 11 war 6/94 = 0.063 -1.195 November 1918. More than 9 million combatants and 7 million civilians died as a in 3/94 = 0.032 -1.496 result of the war, a casualty rate exacerbated by the belligerents' technological and industrial history 1/94 = 0.011 -1.973 sophistication, and tactical stalemate. It was Π = 2.30e-7 Σ = -6.637 one of the deadliest conflicts in history, paving the way for major political changes, including revolutions in many of the nations involved.

Example: Query Likelihood Query: “deadliest war in history” Wikipedia: Taiping Rebellion Term P(w|D) log P(w|D) The Taiping Rebellion was a massive civil war in southern China from 1850 to 1864, deadliest 1/56 = 0.017 -1.748 against the ruling Manchu Qing dynasty. It war 1/56 = 0.017 -1.748 was a millenarian movement led by Hong Xiuquan, who announced that he had in 2/56 = 0.035 -1.447 received visions, in which he learned that he history 1/56 = 0.017 -1.748 was the younger brother of Jesus. At least 20 million people died, mainly civilians, in one of Π = 2.56e-8 Σ = − 6.691 the deadliest military conflicts in history.

Wrapping Up There are many ways to move beyond this basic model. • Use n-gram or skip-gram probabilities, instead of unigrams. • Model document probabilities P ( D ) based on length, authority, genre, etc. instead of assuming a uniform probability. • Use the tools from the VSM slides: stemming, stopping, etc. Next, we’ll see how to fix a major issue with our probability estimates: what happens if a query term doesn’t appear in the document?

Query Likelihood Retrieval LM, session 6 CS6200: Information - PowerPoint PPT Presentation

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton Retrieval With Language Models So far, weve focused on language models like P ( D = w 1 , w 2 , , w n ). Wheres the query? Remember the

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Zero-query information retrieval system no explicit query from user IR triggered by

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query

Query Languages Query Languages Berlin Chen 2004 Reference: 1. Modern Information Retrieval ,

Query Operations Query Operations Berlin Chen 2004 Reference: 1. Modern Information Retrieval .

Introduction to Information Retrieval & Web Search Kevin Duh Johns Hopkins University Fall

TREC Video Retrieval Evaluation TRECVID 2019 Ian Soboroff* Alan Smeaton, Yvette Graham

Portfolio Theory of Information Retrieval Jun Wang and Jianhan Zhu jun.wang@cs.ucl.ac.uk

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif University of Technology

Lecture 4: Term Weighting and the Vector Space Model Information Retrieval Computer Science

Cross-Lingual Information Retrieval Language Technology I Language Technology I Crosslingual

Henry Corrigan-Gibbs Dmitry Kogan EPFL & MIT Stanford Eurocrypt 2020 PIR schemes with

Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from

Query Likelihood Retrieval LM, session 6 CS6200: Information - PowerPoint PPT Presentation

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton Retrieval With Language Models So far, weve focused on language models like P ( D = w 1 , w 2 , , w n ). Wheres the query? Remember the

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Zero-query information retrieval system no explicit query from user IR triggered by

Information Retrieval &gt; Query Us User er Query Words Query Words Search Personalization

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query

Query Languages Query Languages Berlin Chen 2004 Reference: 1. Modern Information Retrieval ,

Query Operations Query Operations Berlin Chen 2004 Reference: 1. Modern Information Retrieval .

Introduction to Information Retrieval &amp; Web Search Kevin Duh Johns Hopkins University Fall

TREC Video Retrieval Evaluation TRECVID 2019 Ian Soboroff* Alan Smeaton, Yvette Graham

Portfolio Theory of Information Retrieval Jun Wang and Jianhan Zhu jun.wang@cs.ucl.ac.uk

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif University of Technology

Lecture 4: Term Weighting and the Vector Space Model Information Retrieval Computer Science

Cross-Lingual Information Retrieval Language Technology I Language Technology I Crosslingual

Henry Corrigan-Gibbs Dmitry Kogan EPFL &amp; MIT Stanford Eurocrypt 2020 PIR schemes with

Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Introduction to Information Retrieval & Web Search Kevin Duh Johns Hopkins University Fall

Henry Corrigan-Gibbs Dmitry Kogan EPFL & MIT Stanford Eurocrypt 2020 PIR schemes with