iii 4 statistical language models
play

III.4 Statistical Language Models III.4 Statistical LM (MRS book, - PowerPoint PPT Presentation

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12*) 4.1 What is a statistical language model? 4.2 Smoothing Methods 4.3 Extended LMs *With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing


  1. III.4 Statistical Language Models • III.4 Statistical LM (MRS book, Chapter 12*) – 4.1 What is a statistical language model? – 4.2 Smoothing Methods – 4.3 Extended LMs *With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing Methods for Language Models Applied to Information Retrieval, TOIS 22(2), 2004 IR&DM, WS'11/12 November 10, 2011 III.1

  2. III.4.1 What is a Statistical Language Model? Generative model for word sequences (generates probability distribution of word sequences, or bag-of-words, or set-of-words, or structured doc, or ...) Example: P[“Today is Tuesday”] = 0.01 P[“The Eigenvalue is positive”] = 0.001 P[“Today Wednesday is”] = 0.000001 LM itself highly context- / application-dependent Application examples: • speech recognition : given that we heard “Julia” and “feels”, how likely will we next hear “happy” or “habit”? • text classification : given that we saw “soccer” 3 times and “game” 2 times, how likely is the news about sports? • information retrieval : given that the user is interested in math, how likely would the user use “distribution” in a query? IR&DM, WS'11/12 November 10, 2011 III.2

  3. Types of Language Models Key idea: A document is a good match to a query if the document model is likely to generate the query , i.e., if P(q|d ) “is high”.   A language model is well-formed over alphabet ∑ if . P ( s ) 1   * s Generic Language Model Unigram Language Model “Today” “Today is Tuesday” 0.1 0.01 “is” “The Eigenvalue is positive” 0.001 0.3 “Tuesday” “Today Wednesday is” 0.2 0.00001 “Wednesday” … 0.2 … How to handle sequences? Bigram Language Model • Chain Rule (requires long chains of cond. prob.):  P ( t t t t ) P ( t ) P ( t | t ) P ( t | t t ) P ( t | t t t ) “Today” 0.1 1 2 3 4 1 2 1 3 1 2 4 1 2 3 “is” | “Today” 0.4 • Bigram LM (pairwise cond. prob.): “Tuesday” | “is” 0.8  P ( t t t t ) P ( t ) P ( t | t ) P ( t | t ) P ( t | t ) … bi 1 2 3 4 1 2 1 3 2 4 3 • Unigram LM (no cond. prob.):  P ( t t t t ) P ( t ) P ( t ) P ( t ) P ( t ) uni 1 2 3 4 1 2 3 4 IR&DM, WS'11/12 November 10, 2011 III.3

  4. Text Generation with (Unigram) LM LM  d : P[word |  d ] sample document d LM for text 0.2 mining 0.1 topic 1: Article n-gram 0.01 IR&DM on cluster 0.02 “T ext ... Mining” healthy 0.000001 … different  d for different d LM for food 0.25 topic 2: nutrition 0.1 Article healthy 0.05 Health on diet 0.02 “Food Nutrition” ... n-gram 0.00002 … IR&DM, WS'11/12 November 10, 2011 III.4

  5. Basic LM for IR Which LM is more likely parameter estimation to generate q? text ? (better explains q) mining ? Article n-gram ? on cluster ? “Text Mining” ... ? healthy ? … Query q: “data mining algorithms” food ? ? nutrition ? Article healthy ? on “Food diet ? Nutrition” ... n-gram ? … IR&DM, WS'11/12 November 10, 2011 III.5

  6. LM Illustration: Document as Model and Query as Sample Model M A A A A B B estimate likelihood C C C of observing the query D P [ | M] E E E E E A A B C E E query document d: sample of M used for parameter estimation IR&DM, WS'11/12 November 10, 2011 III.6

  7. LM Illustration: Document as Model and Query as Sample Model M + A A A A B B C estimate likelihood D C C C A of observing the query F D A E B P [ | M] E E E E E A A B C E E query document d + background corpus and/or smoothing used for parameter estimation IR&DM, WS'11/12 November 10, 2011 III.7

  8. Prob.-IR vs. Language Models User likes doc (R) P[R|d,q] given that it has features d and user poses query q P [ d | R , q ]    P [ q , d | R ] P [ R ] P [ d | R , q ]    P [ q | d , R ] P [ d | R ] P [ R ] prob. IR  statist. LM P [ q | d ] (ranking proportional to (ranking proportional to relevance odds) query likelihood) query likelihood:      s ( q , d ) log P [ q | d ] log P [ j | d ] j q top-k query result: k - argmax log P [ q | d ] d MLE would be tf j / |d| IR&DM, WS'11/12 November 10, 2011 III.8

  9. Multi-Bernoulli vs. Multinomial LM Multi-Bernoulli:      X ( q ) 1 X ( q ) P [ q | d ] p ( d ) ( 1 p ( d )) j j j j j with X j (q)=1 if j  q, 0 otherwise Multinomial:   | q |     f ( q ) P [ q | d ] p ( d ) j    j q j f ( j ) f ( j ) ... f ( j )   1 2 | q | with f j (q) = f(j) = frequency of j in q and ∑ j f(j) = |q| multinomial LM more expressive and usually preferred IR&DM, WS'11/12 November 10, 2011 III.9

  10. LM Scoring by Kullback-Leibler Divergence   | q |     f ( q ) log P [ q | d ] log p ( d ) j    2 2 j q j f ( j ) f ( j ) ... f ( j )   1 2 | q |    f ( q ) log p ( d ) j 2 j j q   H ( f ( q ), p ( d )) neg. cross-entropy    neg. cross-entropy H ( f ( q ), p ( d )) H ( f ( q )) + entropy   D ( f ( q ) || p ( d )) f ( q )    j f ( q ) log neg. KL divergence j 2 of  q and  d j p ( d ) j IR&DM, WS'11/12 November 10, 2011 III.10

  11. III.4.2 Smoothing Methods Absolutely crucial to avoid overfitting and make LMs useful in practice (one LM per doc, one LM per query)! Possible methods: • Laplace smoothing Choice and • Absolute Discounting parameter setting • Jelinek-Mercer smoothing still mostly • Dirichlet-prior smoothing “black art” • Katz smoothing (or empirical) • Good-Turing smoothing • ... most with their own parameters IR&DM, WS'11/12 November 10, 2011 III.11

  12. Laplace Smoothing and Absolute Discounting freq ( j , d ) Estimation of  d : p j (d) by MLE would yield | d |   where | d | freq ( j , d ) j Additive Laplace smoothing:  freq ( j , d ) 1  for multinomial over ˆ p j ( d )  | d | m vocabulary W with |W|=m Absolute discounting:   max( freq ( j , d ) , 0 ) freq ( j , C )    with corpus C, ˆ p ( d ) j d  [0,1] | d | | C |    # distinct terms in d  where d | d | IR&DM, WS'11/12 November 10, 2011 III.12

  13. Jelinek-Mercer Smoothing Idea: use linear combination of doc LM with background LM (corpus, common language); could also consider freq ( j , d ) freq ( j , C ) query log as      ˆ p j ( d ) ( 1 ) background LM | d | | C | for query Parameter tuning of  by cross-validation with held-out data: • divide set of relevant (d,q) pairs into n partitions • build LM on the pairs from n-1 partitions • choose  to maximize precision (or recall or F1) on n th partition • iterate with different choice of n th partition and average IR&DM, WS'11/12 November 10, 2011 III.13

  14. Jelinek-Mercer Smoothing: Relationship to TF*IDF       P [ q | ] P [ q | d ] ( 1 ) P [ q | C ]   tf ( i , d ) df ( i )         with absolute log ( 1 )      i q tf ( k , d ) df ( k )   frequencies tf, df k k     df ( k ) tf ( i , d )       k log 1       i q tf ( k , d ) 1 df ( i )   k relative tf ~ relative idf IR&DM, WS'11/12 November 10, 2011 III.14

  15. Dirichlet-Prior Smoothing  ~ Dirichlet ( ) prior MAP for  with    P [ f | ] P [ ]     M ( ) : P [ | f ] Dirichlet distribution      P [ f | ] P [ ] d as prior    with term frequencies f ~ Dirichlet ( f ) in document d posterior       f 1 ˆ | d | P [ j | d ] P [ j | C ]     ˆ    p ( d ) arg max  M ( ) j j  j j        j n m | d | | d | j with  j set to  P[j|C]+1 for the Dirichlet hypergenerator and  > 1 set to multiple of average document length with    ( )               j 1 .. m j 1 f ( ,..., ; ,..., ) 1 j Dirichlet:     1 m 1 m j 1 .. m j j j 1 .. m ( )  j 1 .. m j (Dirichlet is conjugate prior for parameters of multinomial distribution: Dirichlet prior implies Dirichlet posterior, only with different parameters) IR&DM, WS'11/12 November 10, 2011 III.15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend