Overview 1. Introduction to information retrieval and three basic - PDF document

ECIR Tutorial 30 March 2008 Advanced Language Modeling Approaches ( case study: expert search ) Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra 1/100 Overview 1. Introduction to information retrieval and three basic probabilistic approaches – The probabilistic model / Naive Bayes – Google PageRank – Language Models 2. Advanced language Modeling approaches 1 – Statistical Translation – Prior Probabilities 3. Advanced language Modeling approaches 2 – Relevance Models & Expert Search – EM-training & Expert Search – Probabilistic random walks & Expert Search 3/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Information Retrieval information off-line computation documents problem on-line computation representation representation indexed query documents comparison feedback retrieved documents 4/100 5/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 7/100 9/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 10/100 11/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 PART 1 Introduction to probabilistic information retrieval 12/100 IR Models: probabilistic models • Rank documents by the probability that, for instance: – A random document from the documents that contain the query is relevant (known as “the probabilistic model” or “naïve Bayes”) – A random surfer visits the page (known as “Google PageRank”) – Random words from the document form the query (known as “language models”) 13/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Probabilistic model (Robertson & Sparck-Jones 1976) • Probability of getting (retrieving) a relevant document from the set of documents indexed by "social". r = 1 (number�of�relevant�docs� containing�"social") R = 11 �(number�of�relevant�docs) n = 1000 (number�of�docs� containing�"social")� N = 10000 (total�number�of�docs) 14/100 Probabilistic retrieval P � L � D �= P � D � L � P � L � • Bayes’ rule P � D � • Conditional P � D � L �= ∏ P � D k � L � independence k 15/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Google PageRank (Brin & Page 1998) • Suppose a million monkeys browse the www by randomly following links • At any time, what percentage of the monkeys do we expect to look at page D ? • Compute the probability, and use it to rank the documents that contain all query terms 16/100 Google PageRank • Given a document D , the documents page rank at step n is: ∑ P n � D �=� 1 � λ � P 0 � D �� λ � P n � 1 � I � P � D � I �� I linking to D ● where P ( D | I ) :� � probability�that�the�monkey�reaches�page� D �� through�page� I� ( =� 1�/�#outlinks�of� I� ) λ :�� probability�that�the�follows�a�link 1 − λ : �� probability�that�the�monkey�types�a�url 17/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Language models? • A language model assigns a probability to a piece of language (i.e. a sequence of tokens) P (how are you today) > P (cow barks moo souflé) > P (asj mokplah qnbgol yokii) 18/100 Language models (Hiemstra 1998) • Let's assume we point blindly, one at a time, at 3 words in a document. • What is the probability that I, by accident, pointed at the words “ECIR", “models" and “tutorial"? • Compute the probability, and use it to rank the documents. 19/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Language models P � T 1 , ... ,T n � D � P � D � • P � D � T 1 , ... ,T n �= P � T 1 , ... ,T n � • Probability theory / hidden Markov model theory • Successfully applied to speech recognition, and: – optical character recognition, part-of-speech tagging, stochastic grammars, spelling correction, machine translation, etc. 21/100 Half way conclusion • Email filtering? • Naive Bayes • Navigational Web • PageRank Queries? • Informational • Language Queries? Models • Expert Search? • ... 22/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 PART 2 Advanced statistical language models 23/100 Noisy channel paradigm (Shannon 1948) I �(input) O �(output) noisy�channel ● hypothesise�all�possible�input�texts� I and�take� the�one�with�the�highest�probability,� symbolically: � I = argmax P � I � O � I = argmax P � I �⋅ P � O � I � I 24/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Noisy channel paradigm (Shannon 1948) D �(document) T 1 , T 2 , … (query) noisy�channel ● hypothesise�all�possible�documents� D and� take�the�one�with�the�highest�probability,� symbolically: � D = argmax P � D � T 1 ,T 2 , �� D = argmax P � D �⋅ P � T 1 ,T 2 , �� D � D 25/100 Noisy channel paradigm • Did you get the picture? Formulate the following systems as a noisy channel: – Automatic Speech Recognition – Optical Character Recognition – Parsing of Natural Language – Machine Translation – Part-of-speech tagging 26/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Statistical language models • Given a query T 1 ,T 2 ,…,T n , rank the documents according to the following probability measure: n P � T 1 ,T 2 , ... ,T n � D �= ∏ �� 1 � λ i � P � T i �� λ i P � T i � D �� i = 1 λ i : � probability�that�the�term�on�position� i is�important� 1 − λ i : � probability�that�the�term�is�unimportant P ( T i | D ) : � probability�of�an�important�term P ( T i ) : �� probability�of�an�unimportant�term 27/100 Statistical language models ● Definition�of�probability�measures: tf � t i , d � P � T i = t i � D = d �= � important term � ∑ t tf � t ,d � df � t i � P � T i = t i �= � unimportant term � ∑ t df � t � λ i = 0.5 28/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Exercise: an expert search test collection 1. Define your personal three-word language model: Choose three words (and for each word a probability) 2. Write two joint papers, each with two or more co-authors of your choice for the Int. Conference on Short Papers (ICSP) – Papers must not exceed two words per author – Use only words from your personal language model – ICSP does not do blind reviewing, so clearly put the names of the authors on the paper – Deadline: after the coffee-break. 3. Question: Can the PC find out who are experts on x ? 29/100 Exercise 2: simple LM scoring • Calculate the language modeling scores for the query y on your document(s) – What needs to be decided before we are able to do this? – 5 minutes! 30/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Statistical language models • How to estimate value of λ i ? – For ad-hoc retrieval (i.e. no previously retrieved documents to guide the search) λ i = constant (i.e. each term equally important) – Note that for extreme values: λ i = 0 : term does not influence ranking λ i = 1 : term is mandatory in retrieved docs. lim λ i → 1 : docs containing n query terms are ranked above docs containing n − 1 terms (Hiemstra 2004) 31/100 Statistical language models • Presentation as hidden Markov model – finite state machine: probabilities governing transitions – sequence of state transitions cannot be determined from sequence of output symbols (i.e. are hidden) 32/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Statistical language models • Implementation n P � T 1 ,T 2 , � ,T n � D �= ∏ �� 1 � λ i � P � T i �� λ i P � T i � D �� i = 1 � n λ i P � T i � D � P � T 1 ,T 2 , � ,T n � D �∝ ∑ log � 1 � � 1 � λ i � P � T i � � i = 1 33/100 Statistical language models • Implementation as vector product: ∑ score � q ,d � = q k ⋅ d k k ∈ matching terms q k = tf � k ,q � tf � k ,d � ∑ t df � t � λ k d k = log � 1 � ⋅ � df � k � ∑ t tf � t ,d � 1 � λ k 34/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 Cross-language IR cross-language information retrieval zoeken in anderstalige informatie recherche d'informations multilingues 35/100 Language models & translation • Cross-language information retrieval (CLIR): – Enter query in one language (language of choice) and retrieve documents in one or more other languages. – The system takes care of automatic translation 36/100 Djoerd Hiemstra

ECIR Tutorial 30 March 2008 37/100 Language models & translation • Noisy channel paradigm D �(doc.) T 1 , T 2 , … (query) S 1 , S 2 , … (request) noisy�channel noisy�channel ● hypothesise�all�possible�documents� D and� take�the�one�with�the�highest�probability: � D = argmax P � D � S 1 ,S 2 , �� D P � D �⋅ ∑ = argmax P � T 1 ,T 2 , � ;S 1 ,S 2 , �� D � D T 1 , T 2 , � 38/100 Djoerd Hiemstra

Overview 1. Introduction to information retrieval and three basic - PDF document

ECIR Tutorial 30 March 2008 Advanced Language Modeling Approaches ( case study: expert search ) Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra 1/100 Overview 1. Introduction to information retrieval and three basic probabilistic

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 SF park overview OVERVIEW PRESENTATION / 2

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 Acknowledgements OVERVIEW PRESENTATION / 2 SF

INVESTOR PRESENTATION FEBRUARY 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

INVESTOR PRESENTATION MAY 2019 Index Executive Summary Company Overview Business Overview

INVESTOR PRESENTATION MARCH 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

1 Overview Overview Regional demographic overview Regional demographic overview Workforce

Covid-19 and Business Interruption: Maximizing Insurance Coverage and Federal Grants Counsel

OVERVIEW OVERVIEW OVERVIEW OVERVIEW The qualifications are aimed at primary school

An overview to Maltese An overview to Maltese An overview to Maltese An overview to Maltese

GSM System Overview GSM System Overview GSM System Overview GSM System Overview Phone Lin

Butterball Employees Butterball Employees Butterball Employees Benefits Overview Ruan Benefits

Program-for-Results Financing Overview Overview Overview of World Bank Instruments

INVESTOR PRESENTATION Index Executive Summary Company Overview Business Overview Industry

Key Maths 3 UK Assessm ent overview Claire Parsons Overview 1. Key Maths 3 UK (overview) 2.

Federal Fiscal Year 2017-18 CHASE Fee Program June 21, 2018 Overview CHASE Overview Fee

From Program Verification to Program Synthesis Overview Jaak Ristioja March 30, 2010 1 / 91

Statement Problem A: Bijection In this problem, you have to establish a bijection between

Inductive definitions for discrete plane geometry. Jean Duprat* Laurent Vuillon** *Lip,

CS686: Proximity Queries Sung-Eui Yoon ( ) Course URL:

Probabilistic Databases Guy Van den Broeck Scalable Uncertainty Management (SUM) Sep 21, 2016

Collector Ring (CR) - FAIR EoI 13b: RF Systems Prerequisites for All Tasks Official

NGN Access Results of public consultations and directions Press conference of 28 November 2007

Energy Informatics 3-1 Introduction to Computer Networking Christian Schindelhauer Technical