 
              Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton
Limits of Uniform Smoothing Uniform smoothing assigns the same probability to all unseen words, which isn’t realistic. This is easiest to see for n-gram models: P ( house | the , white ) > P ( effortless | the , white ) We strongly believe that “house” is more likely to follow “the white” than “effortless” is, even if neither trigram appears in our training data. Our bigram counts should help: “white house” probably appears more often than “white effortless.” We can use bigram probabilities as a background distribution to help smooth our trigram probabilities.
Jelinek-Mercer Smoothing One way to combine foreground and background distributions is to take their linear combination. This is the simplest form of Jelinek-Mercer Smoothing. ˆ p ( e ) = λp fg ( e ) + ( 1 − λ ) p bg ( e ) , 0 < λ < 1 For instance, you can smooth n-grams with (n-1)-gram probabilities. ˆ p ( w n | w 1 , . . . , w n − 1 ) = λp ( w n | w 1 , . . . , w n − 1 ) + ( 1 − λ ) p ( w n | w 2 , . . . , w n − 1 ) You can also smooth document estimates with corpus-wide estimates. p ( w | d ) = λ tf w , d cf w ˆ | d | + ( 1 − λ ) � w cf w
Relationship to Laplace Smoothing | d | Pick λ = | d | + | V | Most smoothing techniques amount to p ( w | d ) = λ tf w , d | d | + ( 1 − λ ) 1 finding a particular value for λ in ˆ | V | Jelinek-Mercer smoothing. � tf w , d � 1 � � | d | | V | = | d | + For instance, add-one smoothing is | d | + | V | | d | + | V | | V | Jelinek-Mercer smoothing with a tf w , d 1 = | d | + | V | + uniform background distribution and a | d | + | V | particular value of λ . = tf w , d + 1 | d | + | V |
Relationship to TF-IDF � � λ tf w , d | d | + ( 1 − λ ) df w TF-IDF is also closely related to � log P ( q | d ) = log | c | w ∈ q Jelinek-Mercer smoothing. � � λ tf w , d | d | + ( 1 − λ ) df w log( 1 − λ ) df w � � = log + | c | | c | If you smooth the query likelihood w ∈ q : tf w , d > 0 w ∈ q : tf w , d = 0 λ tf w , d | d | + ( 1 − λ ) df w � � model with a corpus-wide background log( 1 − λ ) df w | c | � � = log + ( 1 − λ ) df w | c | probability, the resulting scoring | c | w ∈ q w ∈ q : tf w , d > 0 λ tf w , d function is proportional to TF and � � | d | rank � = log + 1 inversely proportional to DF. ( 1 − λ ) df w | c | w ∈ q : tf w , d > 0
Dirichlet Smoothing Dirichlet Smoothing is the same as Jelinek-Mercer smoothing, picking λ based on document length and a parameter μ – an estimate of the cf w tf w , d + μ � w cf w average doc length. ˆ p ( w | d ) = | d | + μ μ λ = 1 − | d | + μ cf w tf w , d + μ � w cf w � log p ( q | d ) = log The scoring function to the right is the | d | + μ w ∈ q Bayesian posterior using a Dirichlet prior with parameters: � � cf w 1 cf w n μ , . . . , μ � � w cf w w cf w
Example: Dirichlet Smoothing Query: “president lincoln” tf president,d 15 cf w tf w , d + μ � w cf w � log p ( q | d ) = log | d | + μ cf president 160,000 w ∈ q = log 15 + 2 , 000 × ( 160 , 000 / 10 9 ) tf lincoln,d 25 1 , 800 + 2 , 000 + log 25 + 2 , 000 × ( 2 , 400 / 10 9 ) cf lincoln 2,400 1 , 800 + 2 , 000 = log( 15 . 32 / 3 , 800 ) + log( 25 . 005 / 3 , 800 ) |d| 1,800 = − 5 . 51 + − 5 . 02 = − 10 . 53 Σ w cf w 10 9 μ 2,000
E ff ect of Dirichlet Smoothing Dirichlet Smoothing is a good choice for many IR tasks. ML Score Smoothed tf president tf lincoln • As with all smoothing techniques, it never Score assigns zero probability to a term. 15 25 -3.937 -10.53 • It is a Bayesian posterior which considers 15 1 -5.334 -13.75 how the document differs from the corpus. 15 0 N/A -19.05 • It normalizes by document length, so 1 25 -5.113 -12.99 estimates from short documents and long documents are comparable. 0 25 N/A -14.40 • It runs quickly, compared to many more exotic smoothing techniques.
Wrapping Up Much of this information about smoothing is discussed in more detail, and with empirical analysis, by the following paper: Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2 (April 2004), 179-214. There are many other smoothing techniques. We have focused on those most often used in document scoring for IR. Next, we’ll look at scoring documents using the query’s language model, instead of the document’s.
Recommend
More recommend