III.4 Statistical Language Models III.4 Statistical LM (MRS book, - PowerPoint PPT Presentation

III.4 Statistical Language Models • III.4 Statistical LM (MRS book, Chapter 12*) – 4.1 What is a statistical language model? – 4.2 Smoothing Methods – 4.3 Extended LMs *With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing Methods for Language Models Applied to Information Retrieval, TOIS 22(2), 2004 IR&DM, WS'11/12 November 10, 2011 III.1

III.4.1 What is a Statistical Language Model? Generative model for word sequences (generates probability distribution of word sequences, or bag-of-words, or set-of-words, or structured doc, or ...) Example: P[“Today is Tuesday”] = 0.01 P[“The Eigenvalue is positive”] = 0.001 P[“Today Wednesday is”] = 0.000001 LM itself highly context- / application-dependent Application examples: • speech recognition : given that we heard “Julia” and “feels”, how likely will we next hear “happy” or “habit”? • text classification : given that we saw “soccer” 3 times and “game” 2 times, how likely is the news about sports? • information retrieval : given that the user is interested in math, how likely would the user use “distribution” in a query? IR&DM, WS'11/12 November 10, 2011 III.2

Types of Language Models Key idea: A document is a good match to a query if the document model is likely to generate the query , i.e., if P(q|d ) “is high”.   A language model is well-formed over alphabet ∑ if . P ( s ) 1   * s Generic Language Model Unigram Language Model “Today” “Today is Tuesday” 0.1 0.01 “is” “The Eigenvalue is positive” 0.001 0.3 “Tuesday” “Today Wednesday is” 0.2 0.00001 “Wednesday” … 0.2 … How to handle sequences? Bigram Language Model • Chain Rule (requires long chains of cond. prob.):  P ( t t t t ) P ( t ) P ( t | t ) P ( t | t t ) P ( t | t t t ) “Today” 0.1 1 2 3 4 1 2 1 3 1 2 4 1 2 3 “is” | “Today” 0.4 • Bigram LM (pairwise cond. prob.): “Tuesday” | “is” 0.8  P ( t t t t ) P ( t ) P ( t | t ) P ( t | t ) P ( t | t ) … bi 1 2 3 4 1 2 1 3 2 4 3 • Unigram LM (no cond. prob.):  P ( t t t t ) P ( t ) P ( t ) P ( t ) P ( t ) uni 1 2 3 4 1 2 3 4 IR&DM, WS'11/12 November 10, 2011 III.3

Text Generation with (Unigram) LM LM  d : P[word |  d ] sample document d LM for text 0.2 mining 0.1 topic 1: Article n-gram 0.01 IR&DM on cluster 0.02 “T ext ... Mining” healthy 0.000001 … different  d for different d LM for food 0.25 topic 2: nutrition 0.1 Article healthy 0.05 Health on diet 0.02 “Food Nutrition” ... n-gram 0.00002 … IR&DM, WS'11/12 November 10, 2011 III.4

Basic LM for IR Which LM is more likely parameter estimation to generate q? text ? (better explains q) mining ? Article n-gram ? on cluster ? “Text Mining” ... ? healthy ? … Query q: “data mining algorithms” food ? ? nutrition ? Article healthy ? on “Food diet ? Nutrition” ... n-gram ? … IR&DM, WS'11/12 November 10, 2011 III.5

LM Illustration: Document as Model and Query as Sample Model M A A A A B B estimate likelihood C C C of observing the query D P [ | M] E E E E E A A B C E E query document d: sample of M used for parameter estimation IR&DM, WS'11/12 November 10, 2011 III.6

LM Illustration: Document as Model and Query as Sample Model M + A A A A B B C estimate likelihood D C C C A of observing the query F D A E B P [ | M] E E E E E A A B C E E query document d + background corpus and/or smoothing used for parameter estimation IR&DM, WS'11/12 November 10, 2011 III.7

Multi-Bernoulli vs. Multinomial LM Multi-Bernoulli:      X ( q ) 1 X ( q ) P [ q | d ] p ( d ) ( 1 p ( d )) j j j j j with X j (q)=1 if j  q, 0 otherwise Multinomial:   | q |     f ( q ) P [ q | d ] p ( d ) j    j q j f ( j ) f ( j ) ... f ( j )   1 2 | q | with f j (q) = f(j) = frequency of j in q and ∑ j f(j) = |q| multinomial LM more expressive and usually preferred IR&DM, WS'11/12 November 10, 2011 III.9

LM Scoring by Kullback-Leibler Divergence   | q |     f ( q ) log P [ q | d ] log p ( d ) j    2 2 j q j f ( j ) f ( j ) ... f ( j )   1 2 | q |    f ( q ) log p ( d ) j 2 j j q   H ( f ( q ), p ( d )) neg. cross-entropy    neg. cross-entropy H ( f ( q ), p ( d )) H ( f ( q )) + entropy   D ( f ( q ) || p ( d )) f ( q )    j f ( q ) log neg. KL divergence j 2 of  q and  d j p ( d ) j IR&DM, WS'11/12 November 10, 2011 III.10

III.4.2 Smoothing Methods Absolutely crucial to avoid overfitting and make LMs useful in practice (one LM per doc, one LM per query)! Possible methods: • Laplace smoothing Choice and • Absolute Discounting parameter setting • Jelinek-Mercer smoothing still mostly • Dirichlet-prior smoothing “black art” • Katz smoothing (or empirical) • Good-Turing smoothing • ... most with their own parameters IR&DM, WS'11/12 November 10, 2011 III.11

Laplace Smoothing and Absolute Discounting freq ( j , d ) Estimation of  d : p j (d) by MLE would yield | d |   where | d | freq ( j , d ) j Additive Laplace smoothing:  freq ( j , d ) 1  for multinomial over ˆ p j ( d )  | d | m vocabulary W with |W|=m Absolute discounting:   max( freq ( j , d ) , 0 ) freq ( j , C )    with corpus C, ˆ p ( d ) j d  [0,1] | d | | C |    # distinct terms in d  where d | d | IR&DM, WS'11/12 November 10, 2011 III.12

Jelinek-Mercer Smoothing Idea: use linear combination of doc LM with background LM (corpus, common language); could also consider freq ( j , d ) freq ( j , C ) query log as      ˆ p j ( d ) ( 1 ) background LM | d | | C | for query Parameter tuning of  by cross-validation with held-out data: • divide set of relevant (d,q) pairs into n partitions • build LM on the pairs from n-1 partitions • choose  to maximize precision (or recall or F1) on n th partition • iterate with different choice of n th partition and average IR&DM, WS'11/12 November 10, 2011 III.13

Jelinek-Mercer Smoothing: Relationship to TF*IDF       P [ q | ] P [ q | d ] ( 1 ) P [ q | C ]   tf ( i , d ) df ( i )         with absolute log ( 1 )      i q tf ( k , d ) df ( k )   frequencies tf, df k k     df ( k ) tf ( i , d )       k log 1       i q tf ( k , d ) 1 df ( i )   k relative tf ~ relative idf IR&DM, WS'11/12 November 10, 2011 III.14

Dirichlet-Prior Smoothing  ~ Dirichlet ( ) prior MAP for  with    P [ f | ] P [ ]     M ( ) : P [ | f ] Dirichlet distribution      P [ f | ] P [ ] d as prior    with term frequencies f ~ Dirichlet ( f ) in document d posterior       f 1 ˆ | d | P [ j | d ] P [ j | C ]     ˆ    p ( d ) arg max  M ( ) j j  j j        j n m | d | | d | j with  j set to  P[j|C]+1 for the Dirichlet hypergenerator and  > 1 set to multiple of average document length with    ( )               j 1 .. m j 1 f ( ,..., ; ,..., ) 1 j Dirichlet:     1 m 1 m j 1 .. m j j j 1 .. m ( )  j 1 .. m j (Dirichlet is conjugate prior for parameters of multinomial distribution: Dirichlet prior implies Dirichlet posterior, only with different parameters) IR&DM, WS'11/12 November 10, 2011 III.15

III.4 Statistical Language Models III.4 Statistical LM (MRS book, - PowerPoint PPT Presentation

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12) 4.1 What is a statistical language model? 4.2 Smoothing Methods 4.3 Extended LMs With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing

III.4 Statistical Language Models 1. Basics of Statistical Language Models 2. Query-Likelihood

Chapter 7 Language models Statistical Machine Translation Language models Language models

Models of Language Evolution models thereof its evolution language Models of Language Evolution

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

I III IV I III IV I III IV BUILDING TRUST Radical Candor Chart HIGH I III IV

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec

Introduction to Information Retrieval http://informationretrieval.org IIR 12: Language Models for

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Q2FY19 Financial Results Presentation For the quarter ended 30 Sep 2018 Chua Sock Koong, Group

Dive into Scala Joost den Boer Freelancer / Contractor email : jdboer@diversit.eu blog :

World Cup draw: quantifying (un)fairness and (im)balance Julien Guyon Bloomberg L.P.,

+ 2. Model Selection Scores 3. New Stuff: fNML Score 2/30 + Bayesian Networks 3/30 Conditional

ACADEMIC LYCEUM INTERNATIONAL HOUSE TASHKENT SUBJECT ENGLISH LANGUAGE COURSE 1 ST

Agile in Vodafone Ilaria Curti Vimercate 19th May 2018 The end OCT 17 miniAD Helen 2018

SAFETY COMMITTEE (AOSC) THURSDAY 11 TH MAY WELCOME ! 1. Apologies 2. Acceptance of the minutes

Effective Ethics for Busy People Kingsley Davies @kings13y Kingsley @underscoreio

III.4 Statistical Language Models III.4 Statistical LM (MRS book, - PowerPoint PPT Presentation

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12*) 4.1 What is a statistical language model? 4.2 Smoothing Methods 4.3 Extended LMs *With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing

III.4 Statistical Language Models 1. Basics of Statistical Language Models 2. Query-Likelihood

Chapter 7 Language models Statistical Machine Translation Language models Language models

Models of Language Evolution models thereof its evolution language Models of Language Evolution

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

I III IV I III IV I III IV BUILDING TRUST Radical Candor Chart HIGH I III IV

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec

Introduction to Information Retrieval http://informationretrieval.org IIR 12: Language Models for

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Q2FY19 Financial Results Presentation For the quarter ended 30 Sep 2018 Chua Sock Koong, Group

Dive into Scala Joost den Boer Freelancer / Contractor email : jdboer@diversit.eu blog :

World Cup draw: quantifying (un)fairness and (im)balance Julien Guyon Bloomberg L.P.,

+ 2. Model Selection Scores 3. New Stuff: fNML Score 2/30 + Bayesian Networks 3/30 Conditional

ACADEMIC LYCEUM INTERNATIONAL HOUSE TASHKENT SUBJECT ENGLISH LANGUAGE COURSE 1 ST

Agile in Vodafone Ilaria Curti Vimercate 19th May 2018 The end OCT 17 miniAD Helen 2018

SAFETY COMMITTEE (AOSC) THURSDAY 11 TH MAY WELCOME ! 1. Apologies 2. Acceptance of the minutes

Effective Ethics for Busy People Kingsley Davies @kings13y Kingsley @underscoreio

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12) 4.1 What is a statistical language model? 4.2 Smoothing Methods 4.3 Extended LMs With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models