Information Retrieval Modeling Russian Summer School in Information - PowerPoint PPT Presentation

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra 1/40

Overview 1. Smoothing methods 2. Translation models 3. Document priors 4. ... 2/40

Course Material • Djoerd Hiemstra, “Language Models, Smoothing, and N-grams'’, In M. Tamer Özsu and Ling Liu (eds.) Encyclopedia of Database Systems, Springer , 2009 3/40

Noisy channel paradigm (Shannon 1948) I (input) O (output) noisy channel ● hypothesise all possible input texts I and take the one with the highest probability, symbolically:  I = argmax P  I ∣ O  I = argmax P  I ⋅ P  O ∣ I  I 4/40

Noisy channel paradigm (Shannon 1948) D (document) T 1 , T 2 , … (query) noisy channel ● hypothesise all possible documents D and take the one with the highest probability, symbolically:  D = argmax P  D ∣ T 1 ,T 2 , ⋯ D = argmax P  D ⋅ P  T 1 ,T 2 , ⋯∣ D  D 5/40

Noisy channel paradigm • Did you get the picture? Formulate the following systems as a noisy channel: – Automatic Speech Recognition – Optical Character Recognition – Parsing of Natural Language – Machine Translation – Part-of-speech tagging 6/40

Statistical language models • Given a query T 1 ,T 2 ,…,T n , rank the documents according to the following probability measure: n P  T 1 ,T 2 , ... ,T n ∣ D = ∏  1 − i  P  T i  i P  T i ∣ D  i = 1 λ i : probability that the term on position i is important 1 − λ i : probability that the term is unimportant P ( T i | D ) : probability of an important term P ( T i ) : probability of an unimportant term 7/40

Statistical language models ● Definition of probability measures: tf  t i ,d  P  T i = t i ∣ D = d =  important term  ∑ t tf  t ,d  df  t i  P  T i = t i =  unimportant term  ∑ t df  t  λ i = 0.5 8/40

Statistical language models • How to estimate value of λ i ? – For ad-hoc retrieval (i.e. no previously retrieved documents to guide the search) λ i = constant (i.e. each term equally important) – Note that for extreme values: λ i = 0 : term does not influence ranking λ i = 1 : term is mandatory in retrieved docs. lim λ i → 1 : docs containing n query terms are ranked above docs containing n − 1 terms (Hiemstra 2002) 9/40

Statistical language models • Presentation as hidden Markov model – finite state machine: probabilities governing transitions – sequence of state transitions cannot be determined from sequence of output symbols (i.e. are hidden) 10/40

Statistical language models • Implementation n P  T 1 ,T 2 , ⋯ ,T n ∣ D  = ∏  1 − i  P  T i  i P  T i ∣ D  i = 1 ⋮  i P  T i ∣ D  n P  T 1 ,T 2 , ⋯ ,T n ∣ D  ∝ ∑ log  1   1 − i  P  T i   i = 1 11/40

Statistical language models • Implementation as vector product: ∑ score  q , d  = q k ⋅ d k k ∈ matching terms q k = tf  k ,q  tf  k ,d  ∑ t df  t   k d k = log  1  ⋅  df  k  ∑ t tf  t ,d  1 − k 12/40

Smoothing • Sparse data problem: – many events that are plausible in reality are not found in the data used to estimate probabilities. – i.e., documents are short, and do not contain all words that would be good index terms 13/40

No smoothing • Maximum likelihood estimate tf  t i , d  P  T i = t i ∣ D = d = ∑ t tf  t ,d  – Documents that do not contain all terms get zero probability (are not retrieved) 14/40

Laplace smoothing • Simply add 1 to every possible event tf  t i ,d  1 P  T i = t i ∣ D = d = ∑ t  tf  t ,d  1  – over-estimates probabilities of unseen events 15/40

Linear interpolation smoothing • Linear combination of maximum likelihood and model that is less sparse P  T i ∣ D = 1 − P  T i  P  T i ∣ D  , where 0 ≤≤ 1 – also called “Jelinek-Mercer smoothing” 16/40

Dirichlet smoothing • Has a relatively big effect on small documents, but a relatively small effect on big documents. ∑ t tf  t ,d  P  T i = t i ∣ D = d = tf  t i , d  P  T i ∣ C  ¿ ¿ (Zhai & Lafferty 2004) 17/40

Cross-language IR cross-language information retrieval zoeken in anderstalige informatie recherche d'informations multilingues 18/40

Language models & translation • Cross-language information retrieval (CLIR): – Enter query in one language (language of choice) and retrieve documents in one or more other languages. – The system takes care of automatic translation 19/40

Language models & translation • Noisy channel paradigm D (doc.) T 1 , T 2 , … (query) S 1 , S 2 , … (request) noisy noisy channel channel ● hypothesise all possible documents D and take the one with the highest probability:  D = argmax P  D ∣ S 1 ,S 2 , ⋯ D P  D ⋅ ∑ = argmax P  T 1 ,T 2 , ⋯ ;S 1 ,S 2 , ⋯∣ D  T 1 , T 2 , ⋯ D 21/40

Language models & translation • Cross-language information retrieval : – Assume that the translation of a word/term does not depend on the document in which it occurs. – if: S 1 , S 2 ,…, S n is a Dutch query of length n – and t i 1 , t i 2 ,…, t im are m English translations of the Dutch query term S i P  S 1 ,S 2 , ... ,S n ∣ D = m i n ∏ ∑ P  S i ∣ T i = t ij  1 − P  T i = t ij  P  T i = t ij ∣ D  i = 1 j = 1 22/40

Language models & translation • Presentation as hidden Markov model 23/40

Language models & translation • How does it work in practice? – Find for each Russian query term N i the possible translations t i 1 , t i 2 ,…, t im and translation probabilities – Combine them in a structured query – Process structured query 24/40

Language models & translation • Example: – Russian query: ОСТОРОЖНО РАДИОАКТИВНЫЕ ОТХОДЫ – Translations of ОСТОРОЖНО : dangerous (0.8) or hazardous (1.0) – Translations of РАДИОАКТИВНЫЕ ОТХОДЫ : radioactivity (0.3) or radioactive chemicals (0.3) or radioactive waste t (0.1) – Structured query: (( 0.8 dangerous ∪ 1.0 hazardous ) , ( 0.3 fabric ∪ 0.3 chemicals ∪ 0.1 dust )) 25/40

Structured query – Structured query: (( 0.8 dangerous ∪ 1.0 hazardous ) , ( 0.3 fabric ∪ 0.3 chemicals ∪ 0.1 dust )) 26/40

Language models & translation • Other applications using the translation model – On-line stemming – Synonym expansion – Spelling correction – ‘fuzzy’ matching – Extended (ranked) Boolean retrieval 27/40

Language models & translation • Note that: – λ i = 1, for all 0 ≤ i ≤ n : Boolean retrieval – Stemming and on-line morphological generation give exact same results: P ( funny ∪ funnies , table ∪ tables ∪ tabled ) = P ( funni , tabl ) 28/40

Experimental Results • translation language model – (source: parallel corpora) – average precision: 0.335 (83 % of base line) • no translation model, using all translations: – average precision: 0.308 (76 % of base line) • manual disambiguated run (take best translation) – average precision: 0.315 (78 % of base line) (Hiemstra and De Jong 1999) 29/40

Prior probabilities

Prior probabilities and static ranking • Noisy channel paradigm (Shannon 1948) D (document) T 1 , T 2 , … (query) noisy channel ● hypothesise all possible documents D and take the one with the highest probability, symbolically:  D = argmax P  D ∣ T 1 ,T 2 , ⋯ D = argmax P  D ⋅ P  T 1 ,T 2 , ⋯∣ D  D 31/40

Prior probability of relevance on informational queries ← probability of relevance P doclen  D = C ⋅ doclen  D  document length → 32/40

Priors in Entry Page Search • Sources of Information – Document length – Number of links pointing to a document – The depth of the URL – Occurrence of cue words (‘welcome’,’home’) – number of links in a document – page traffic 33/40

Prior probability of relevance on navigational queries ← probability of relevance document length → 34/40

Priors in Entry Page Search • Assumption – Entry pages referenced more often • Different types of inlinks – From other hosts (recommendation) – From same host (navigational) • Both types point often to entry pages 35/40

Priors in Entry Page Search ← probability of relevance P inlinks  D = C ⋅ inlinkCount  D  nr of inlinks → 36/40

Priors in Entry Page Search: URL depth • Top level documents are often entry pages • Four types of URLs – root: www.romip.ru/ – subroot: www.romip.ru/russir2009/ – path: www.romip.ru/russir2009/en/ – file: www.romip.ru/russir2009/en/venue.html 37/40

Priors in Entry Page Search: results method Content Anchors 0.3375 0.4188 P ( Q|D) 0.2634 0.5600 P ( Q|D ) P doclen ( D ) 0.4974 0.5365 P ( Q|D ) P inlink ( D ) 0.7705 0.6301 P ( Q|D ) P URL ( D ) (Kraaij, Westerveld and Hiemstra 2002) 38/40

Information Retrieval Modeling Russian Summer School in Information - PowerPoint PPT Presentation

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra 1/40 Overview 1. Smoothing methods 2. Translation models 3. Document priors 4. ... 2/40 Course Material

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Quantum Integer Programming 47-779 Ising Model 1 William Larimer Mellon, Founder Agenda o

Disclosure Statement of Financial Interest I currently have, or have had over the last two years,

Augmented Lagrangians and Decomposition in Convex and Nonconvex Programming Terry Rockafellar

Containment Problem for points on another reducible conic Mike Janssen (joint with A. Denkert)

Proof Complexity of Quantified Boolean Formulas Olaf Beyersdorff School of Computing, University

::::: E F EXT " Some - c x . PI roots of unity "

Brian Siana Najmeh Emami , Anahita Alavi, Timothy Gburek Johan Richard, Dan Stark, Dan Weisz, Ben

This weeks Message: Behold, I am making everything new 1 Peter 5:8-9 Context 1 Peter

Information Retrieval Modeling Russian Summer School in Information - PowerPoint PPT Presentation

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra 1/40 Overview 1. Smoothing methods 2. Translation models 3. Document priors 4. ... 2/40 Course Material

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Quantum Integer Programming 47-779 Ising Model 1 William Larimer Mellon, Founder Agenda o

Disclosure Statement of Financial Interest I currently have, or have had over the last two years,

Augmented Lagrangians and Decomposition in Convex and Nonconvex Programming Terry Rockafellar

Containment Problem for points on another reducible conic Mike Janssen (joint with A. Denkert)

Proof Complexity of Quantified Boolean Formulas Olaf Beyersdorff School of Computing, University

::::: E F EXT &quot; Some - c x . PI roots of unity &quot;

Brian Siana Najmeh Emami , Anahita Alavi, Timothy Gburek Johan Richard, Dan Stark, Dan Weisz, Ben

This weeks Message: Behold, I am making everything new 1 Peter 5:8-9 Context 1 Peter

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

::::: E F EXT " Some - c x . PI roots of unity "