III.4 Statistical Language Models III.4 Statistical LM (MRS book, - - PowerPoint PPT Presentation

iii 4 statistical language models
SMART_READER_LITE
LIVE PREVIEW

III.4 Statistical Language Models III.4 Statistical LM (MRS book, - - PowerPoint PPT Presentation

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12*) 4.1 What is a statistical language model? 4.2 Smoothing Methods 4.3 Extended LMs *With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing


slide-1
SLIDE 1

III.4 Statistical Language Models

November 10, 2011 III.1 IR&DM, WS'11/12

  • III.4 Statistical LM (MRS book, Chapter 12*)

– 4.1 What is a statistical language model? – 4.2 Smoothing Methods – 4.3 Extended LMs

*With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing Methods for Language Models Applied to Information Retrieval, TOIS 22(2), 2004

slide-2
SLIDE 2

III.4.1 What is a Statistical Language Model?

Generative model for word sequences (generates probability distribution of word sequences,

  • r bag-of-words, or set-of-words, or structured doc, or ...)

Example: P[“Today is Tuesday”] = 0.01 P[“The Eigenvalue is positive”] = 0.001 P[“Today Wednesday is”] = 0.000001 LM itself highly context- / application-dependent Application examples:

  • speech recognition: given that we heard “Julia” and “feels”,

how likely will we next hear “happy” or “habit”?

  • text classification: given that we saw “soccer” 3 times and “game”

2 times, how likely is the news about sports?

  • information retrieval: given that the user is interested in math,

how likely would the user use “distribution” in a query?

November 10, 2011 III.2 IR&DM, WS'11/12

slide-3
SLIDE 3

Types of Language Models

November 10, 2011 IR&DM, WS'11/12 III.3

 

*

1 ) (

s

s P

A language model is well-formed over alphabet ∑ if . Key idea: A document is a good match to a query if the document model is likely to generate the query, i.e., if P(q|d) “is high”.

“Today is Tuesday” 0.01 “The Eigenvalue is positive” 0.001 “Today Wednesday is” 0.00001 …

Generic Language Model

“Today” 0.1 “is” 0.3 “Tuesday” 0.2 “Wednesday” 0.2

Unigram Language Model

“Today” 0.1 “is” | “Today” 0.4 “Tuesday” | “is” 0.8 …

Bigram Language Model

) | ( ) | ( ) | ( ) ( ) (

3 2 1 4 2 1 3 1 2 1 4 3 2 1

t t t t P t t t P t t P t P t t t t P  ) ( ) ( ) ( ) ( ) (

4 3 2 1 4 3 2 1

t P t P t P t P t t t t P

uni

 ) | ( ) | ( ) | ( ) ( ) (

3 4 2 3 1 2 1 4 3 2 1

t t P t t P t t P t P t t t t P

bi

  • Chain Rule (requires long chains of cond. prob.):
  • Bigram LM (pairwise cond. prob.):
  • Unigram LM (no cond. prob.):

How to handle sequences?

slide-4
SLIDE 4

Text Generation with (Unigram) LM

November 10, 2011 IR&DM, WS'11/12 III.4

text 0.2 mining 0.1 n-gram 0.01 cluster 0.02 ... healthy 0.000001 …

LM for topic 1: IR&DM

food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 ... n-gram 0.00002 …

LM for topic 2: Health LM d: P[word | d]

Article

  • n

“Text Mining” Article

  • n

“Food Nutrition”

document d sample different d for different d

slide-5
SLIDE 5

Basic LM for IR

November 10, 2011 IR&DM, WS'11/12 III.5

text ? mining ? n-gram ? cluster ? ... healthy ? … food ? nutrition ? healthy ? diet ? ... n-gram ? … Article

  • n

“Text Mining” Article

  • n

“Food Nutrition”

parameter estimation Query q: “data mining algorithms”

? ?

Which LM is more likely to generate q? (better explains q)

slide-6
SLIDE 6

LM Illustration:

Document as Model and Query as Sample

November 10, 2011 IR&DM, WS'11/12 III.6

A A C A D E E E E C C B A E B

Model M

document d: sample of M used for parameter estimation

P [ | M]

A A B C E E estimate likelihood

  • f observing the query

query

slide-7
SLIDE 7

LM Illustration:

Document as Model and Query as Sample

November 10, 2011 IR&DM, WS'11/12 III.7

A A C A D E E E E C C B A E B

Model M P [ | M]

A A B C E E estimate likelihood

  • f observing the query

query document d + background corpus and/or smoothing used for parameter estimation

C A D A B E F

+

slide-8
SLIDE 8

Prob.-IR vs. Language Models

] , | [ ] , | [ q R d P q R d P 

November 10, 2011 IR&DM, WS'11/12 III.8

P[R|d,q] User likes doc (R) given that it has features d and user poses query q

  • prob. IR

(ranking proportional to relevance odds)

] [ ] | , [ R P R d q P   ] [ ] | [ ] , | [ R P R d P R d q P    ] | [ d q P 

  • statist. LM

(ranking proportional to query likelihood)

 

  

q j d]

| j [ P log ] d | q [ P log ) d , q ( s

query likelihood:

] d | q [ P log argmax

  • k

d

top-k query result: MLE would be tfj / |d|

slide-9
SLIDE 9

Multi-Bernoulli vs. Multinomial LM

November 10, 2011 IR&DM, WS'11/12 III.9

Multi-Bernoulli:

) ( 1 ) (

)) ( 1 ( ) ( ] | [

q X j q X j j

j j

d p d p d q P

   

with Xj(q)=1 if jq, 0 otherwise

Multinomial:

) ( | | 2 1

) ( ) ( ... ) ( ) ( | | ] | [

q f j q j q

j

d p j f j f j f q d q P

         

with fj(q) = f(j) = frequency of j in q and ∑j f(j) = |q|

multinomial LM more expressive and usually preferred

slide-10
SLIDE 10

LM Scoring by Kullback-Leibler Divergence

November 10, 2011 IR&DM, WS'11/12 III.10

) ( | | 2 1 2 2

) ( ) ( ... ) ( ) ( | | log ] | [ log

q f j q j q

j

d p j f j f j f q d q P

         

 

q j j j

d p q f ) ( log ) (

2

)) ( ), ( ( d p q f H  

  • neg. cross-entropy

)) ( ( )) ( ), ( ( q f H d p q f H    )) ( || ) ( ( d p q f D   ) ( ) ( log ) (

2

d p q f q f

j j j j

 

  • neg. KL divergence
  • f q and d
  • neg. cross-entropy

+ entropy

slide-11
SLIDE 11

III.4.2 Smoothing Methods

November 10, 2011 IR&DM, WS'11/12 III.11

Possible methods:

  • Laplace smoothing
  • Absolute Discounting
  • Jelinek-Mercer smoothing
  • Dirichlet-prior smoothing
  • Katz smoothing
  • Good-Turing smoothing
  • ...

most with their own parameters

Absolutely crucial to avoid overfitting and make LMs useful in practice (one LM per doc, one LM per query)! Choice and parameter setting still mostly “black art” (or empirical)

slide-12
SLIDE 12

Laplace Smoothing and Absolute Discounting

November 10, 2011 IR&DM, WS'11/12 III.12

Estimation of d: pj(d) by MLE would yield Additive Laplace smoothing:

m d d j freq d p j    | | 1 ) , ( ) ( ˆ

| | ) , ( d d j freq Absolute discounting: | | ) , ( | | ) , ) , ( max( ) ( ˆ C C j freq d d j freq d p

d j

    

j

d j freq d ) , ( | | where with corpus C, [0,1] where

| | # d d in terms distinct

d

   

for multinomial over vocabulary W with |W|=m

slide-13
SLIDE 13

Jelinek-Mercer Smoothing

November 10, 2011 IR&DM, WS'11/12 III.13

Idea: use linear combination of doc LM with background LM (corpus, common language); could also consider query log as background LM for query

| | ) , ( ) 1 ( | | ) , ( ) ( ˆ C C j freq d d j freq d p j     

Parameter tuning of  by cross-validation with held-out data:

  • divide set of relevant (d,q) pairs into n partitions
  • build LM on the pairs from n-1 partitions
  • choose  to maximize precision (or recall or F1) on nth partition
  • iterate with different choice of nth partition and average
slide-14
SLIDE 14

Jelinek-Mercer Smoothing: Relationship to TF*IDF

November 10, 2011 IR&DM, WS'11/12 III.14

] | [ ) 1 ( ] | [ ] | [ C q P d q P q P      

  

          

q i k k

k df i df d k tf d i tf ) ( ) ( ) 1 ( ) , ( ) , ( log  

  

           

q i k k

i df k df d k tf d i tf ) ( ) ( 1 ) , ( ) , ( 1 log  

with absolute frequencies tf, df relative tf ~ relative idf

slide-15
SLIDE 15

Dirichlet-Prior Smoothing

               

| | ] | [ | | ] | [ | | 1 d C j P d d j P d m n f

j j j

November 10, 2011 IR&DM, WS'11/12 III.15

) ( max arg ˆ ) ( ˆ  

 M

d p

j

j j

  with j set to  P[j|C]+1 for the Dirichlet hypergenerator and  > 1 set to multiple of average document length Dirichlet:

1 .. 1 .. 1 .. 1 1 1

) ( ) ( ) ,..., ; ,..., (

   

     

j

j m j j m j j m j m m

f

      

with

 

m j j .. 1

1 

(Dirichlet is conjugate prior for parameters of multinomial distribution: Dirichlet prior implies Dirichlet posterior, only with different parameters)

   

       d P f P P f P f P M ] [ ] | [ ] [ ] | [ ] | [ : ) ( MAP for  with Dirichlet distribution as prior

with term frequencies f in document d

) ( ~  Dirichlet

prior

) ( ~   f Dirichlet

posterior

slide-16
SLIDE 16

Dirichlet-Prior Smoothing: Relationship to Jelinek-Mercer Smoothing

         | | ] | [ | | ] | [ | | d C j P d d j P d

November 10, 2011 IR&DM, WS'11/12 III.16

] | [ ) 1 ( ] | [ ) ( ˆ C j P d j P d p j     

with

    | | | | d d

where 1=  P[1|C], ..., m=  P[m|C] are the parameters

  • f the underlying Dirichlet distribution, with constant  > 1

typically set to multiple of average document length

 Jelinek-Mercer special case of Dirichlet!

with MLEs P[j|d], P[j|C] tf j

from corpus

slide-17
SLIDE 17

Effect of Dirichlet Smoothing

November 10, 2011 IR&DM, WS'11/12 III.17

2 4 6 8 10 12 14 16 18 20 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

p(w|c) p(w|d) p(w|d) using Dirichlet Prior

Source: Rong Jin, Language Modeling Approaches for Information Retrieval, http://www.cse.msu.edu/~cse484/lectures/lang_model.ppt

slide-18
SLIDE 18

Two-Stage Smoothing [Zhai/Lafferty, TOIS 2004]

November 10, 2011 IR&DM, WS'11/12 III.18

Query = “the algorithms for data mining” d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004 p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) But: p(q|d1) > p(q|d2) ! We should make p(“the”) and p(“for”) less different for all docs. Combine Dirichlet (good at short keyword queries) and Jelinek-Mercer smoothing (good at verbose queries)!

slide-19
SLIDE 19

Two-Stage Smoothing [Zhai/Lafferty, TOIS 2004]

November 10, 2011 IR&DM, WS'11/12 III.19

c(w,d) |d| P(w|d) = + p(w|C) + 

Stage-1

  • Explain unseen words
  • Dirichlet prior

 (1-) + p(w|U)

Stage-2

  • Explain noise in query
  • 2-component mixture

Source: Manning/Raghavan/Schütze, lecture12-lmodels.ppt U: user’s background LM, or approximated by corpus LM C

slide-20
SLIDE 20

III.4.3 Extended LMs

November 10, 2011 IR&DM, WS'11/12 III.20

Large variety of extensions:

  • Term-specific smoothing

(JM with term-specific j, e.g., based on idf values)

  • Parsimonious LM

(JM-style smoothing with smaller feature space)

  • N-gram (Sequence) Models (e.g. HMMs)
  • (Semantic) Translation Models
  • Cross-Lingual Models
  • Query-Log- & Click-Stream-based LM
  • LMs for Question Answering
slide-21
SLIDE 21

(Semantic) Translation Model



 

q j w

d w P w j P d q P ] | [ ] | [ ] | [

November 10, 2011 IR&DM, WS'11/12 III.21

with word-word translation model P[j|w] Opportunities and difficulties:

  • synonymy, hypernymy/hyponymy, polysemy
  • efficiency
  • training

estimate P[j|w] by overlap statistics on background corpus (Dice coefficients, Jaccard coefficients, etc.)

slide-22
SLIDE 22

Translation Models for Cross-Lingual IR 

 

q j w

d w P w j P d q P ] | [ ] | [ ] | [

November 10, 2011 IR&DM, WS'11/12 III.22

see also benchmark CLEF: http://www.clef-campaign.org/

with q in language F (e.g. French) and d in language E (e.g. English)

needs estimations of P[j|w] from cross-lingual corpora (docs available in both F and E) Can rank docs in E (or F) for queries in F Example: q: “moteur de recherche” returns d: “Quaero is a French initiative for developing a search engine that can serve as a European alternative to Google ... ”

slide-23
SLIDE 23

Query-Log-Based LM (User LM)

| | ) , ( ] | [

i i i

q q w freq q w P 

November 10, 2011 IR&DM, WS'11/12 III.23

Idea: For current query qk, leverage the following:

  • prior query history Hq = q1 ... qk-1 and
  • prior click stream Hc = d1 ... dk-1 as background LMs

Example: qk = “Java library” benefits from qk-1 = “cgi programming” Simple Mixture Model with Fixed Coefficient Interpolation:

More advanced models with Dirichlet priors in the literature…

 

 

1 .. 1

] | [ 1 1 ] | [

k i i q

q w P k H w P | | ) , ( ] | [

i i i

d d w freq d w P 

 

 

1 .. 1

] | [ 1 1 ] | [

k i i c

d w P k H w P ] | [ ) 1 ( ] | [ ] , | [

c q c q

H w P H w P H H w P      ] , | [ ) 1 ( ] | [ ] | [

c q k k

H H w P q w P w P      

 

slide-24
SLIDE 24

Entity Search with LM [Nie et al.: WWW’07]

November 10, 2011 IR&DM, WS'11/12 III.24

LM (entity e) = prob. distr. of words seen in context of e

] [ ) 1 ( ] | [ ) , ( q P e q P q e score     

 ] [ ] | [

i i i i

q P e q P

Query q: “Dutch soccer player Barca”

Candidate entities:

e1: Johan Cruyff e2: Ruud van Nistelroy e3: Ronaldinho e4: Zinedine Zidane e5: FC Barcelona

Dutch goalgetter soccer champion Dutch player Ajax Amsterdam trainer Barca 8 years Camp Nou played soccer FC Barcelona Jordi Cruyff son

Additionally weighted by extraction accuracy

Zizou champions league 2002 Real Madrid van Nistelroy Dutch soccer world cup best player 2005 lost against Barca

)) ( | ) ( ( e q KL

LM LM

 

query: keywords  answer: entities

slide-25
SLIDE 25

Language Models for Question Answering (QA)

November 10, 2011 IR&DM, WS'11/12 III.25

Use of LMs:

  • Passage retrieval: likelihood of passage generating question
  • Translation model: likelihood of answer generating question with
  • param. estim. from manually compiled question-answer corpus

question E.g. factoid questions: who? where? when? ...

Example: Where is the Louvre museum located?

query passages answers

question-type-specific NL parsing finding most promising short text passages NL parsing and entity extraction

... The Louvre is the most visited and one of the oldest, largest, and most famous art galleries and museums in the world. It is located in Paris, France. Its address is Musée du Louvre, 75058 Paris cedex 01. ... Q: Louvre museum location A: The Louvre museum is in Paris.

slide-26
SLIDE 26

LM for Temporal Search

November 10, 2011 IR&DM, WS'11/12 III.26

Keyword queries that express temporal interest Example: q = “FIFA world cup 1990s”  would not retrieve doc d = “France won the FIFA world cup in 1998”

)] ( | ) ( [ )] ( | ) ( [ ] | [ d time q time P d text q text P d q P  

Approach:

  • extract temporal phrases from docs
  • normalize temporal expressions
  • split query and docs into text  time

 

 

q x tempexpr d y tempexpr

y x P d time q time P ] | [ )] ( | ) ( [

| | | | : ] | [ y y x y x P  

(plus smoothing)

with |x| = end(x)  begin(x)

slide-27
SLIDE 27

Summary of Section III.4

November 10, 2011 IR&DM, WS'11/12 III.27

  • LMs are a clean form of generative models

for docs, corpora, queries:

  • one LM per doc (with doc itself for parameter estimation)
  • likelihood of LM generating query yields ranking of docs
  • for multinomial model: equivalent to ranking by KL (q || d)
  • parameter smoothing is essential:
  • use background corpus, query&click log, etc.
  • Jelinek-Mercer and Dirichlet smoothing perform very well
  • LMs very useful for specialized IR: cross-lingual, passages, etc.
slide-28
SLIDE 28

Additional Literature for Section III.4

November 10, 2011 IR&DM, WS'11/12 III.28

Statistical Language Models in General:

  • Manning/Raghavan/Schütze book, Chapter 12
  • Djoerd Hiemstra: Language Models, Smoothing, and N-grams, in: Encyclopedia
  • f Database Systems, Springer, 2009
  • Cheng Xiang Zhai, Statistical Language Models for Information Retrieval,

Morgan & Claypool Publishers, 2008

  • Cheng Xiang Zhai, Statistical Language Models for Information Retrieval:

A Critical Review, Foundations and Trends in Information Retrieval 2(3), 2008

  • X. Liu, W.B. Croft: Statistical Language Modeling for Information Retrieval,

Annual Review of Information Science and Technology 39, 2004

  • J. Ponte, W.B. Croft: A Language Modeling Approach to Information Retrieval,

SIGIR 1998

  • C. Zhai, J. Lafferty: A Study of Smoothing Methods for Language Models

Applied to Information Retrieval, TOIS 22(2), 2004

  • C. Zhai, J. Lafferty: A Risk Minimization Framework for Information Retrieval,

Information Processing and Management 42, 2006

  • M.E. Maron, J.L. Kuhns: On Relevance, Probabilistic Indexing, and Information

Retrieval, Journal of the ACM 7, 1960

slide-29
SLIDE 29

Additional Literature for Section III.4

November 10, 2011 IR&DM, WS'11/12 III.29

LMs for Specific Retrieval Tasks:

  • X. Shen, B. Tan, C. Zhai: Context-Sensitive Information Retrieval Using

Implicit Feedback, SIGIR 2005

  • Y. Lv, C. Zhai, Positonal Language Models for Information Retrieval, SIGIR 2009
  • V. Lavrenko, M. Choquette, W.B. Croft: Cross-lingual relevance models. SIGIR‘02
  • D. Nguyen, A. Overwijk, C. Hauff, D. Trieschnigg, D. Hiemstra, F. de Jong:

WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia. CLEF 2008

  • C. Clarke: Web Question Answering. Encyclopedia of Database Systems 2009
  • C. Clarke, E.L. Terra: Passage retrieval vs. document retrieval for factoid

question answering. SIGIR 2003

  • D. Shen, J.L. Leidner, A. Merkel, D. Klakow: The Alyssa System at TREC 2006:

A Statistically-Inspired Question Answering System. TREC 2006

  • Z. Nie, Y. Ma, S. Shi, J.-R. Wen, W.-Y. Ma: Web object retrieval. WWW 2007
  • H. Zaragoza et al.: Ranking very many typed entities on wikipedia. CIKM 2007
  • P. Serdyukov, D. Hiemstra: Modeling Documents as Mixtures of Persons for

Expert Finding. ECIR 2008

  • S. Elbassuoni, M. Ramanath, R. Schenkel, M. Sydow, G. Weikum:

Language-model-based Ranking for Queries on RDF-Graphs. CIKM 2009

  • K. Berberich, O. Alonso, S. Bedathur, G. Weikum: A Language Modeling

Approach for Temporal Information Needs. ECIR 2010