1
Statistical NLP
Spring 2011
Lecture 3: Language Models II
Dan Klein – UC Berkeley
Smoothing
- We often want to make estimates from sparse statistics:
- Smoothing flattens spiky distributions so they generalize better
- Very important all over NLP, but easy to do badly!
- We’ll illustrate with bigrams today (h = previous word, could be anything).
P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total
allegations
charges motion benefits
…
allegations reports claims
charges
request
motion benefits
…
allegations reports
claims
request
P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total
Kneser-Ney
Kneser-Ney smoothing combines these two ideas
Absolute discounting Lower order continuation probabilities
KN smoothing repeatedly proven effective Why should things work like this?
Predictive Distributions
Parameter estimation: With parameter variable: Predictive distribution:
a b c a
θ = P(w) = [a:0.5, b:0.25, c:0.25]
a b c a Θ a b c a Θ
W
“Chinese Restaurant” Processes
[Teh, 06, diagrams from Teh]
P k ∝ c P ∝ α P w θw /V P k ∝ c − d P ∝ α dK Dirichlet Process Pitman-Yor Process
Hierarchical Models
c a d b b b a
Θ0 Θc Θd Θa
a b a
Θg [MacKay and Peto, 94, Teh 06]
e a b f a b b b
Θe
a
Θf
a e f
Θb