Smoothing Statistical NLP We often want to make estimates from - - PDF document

smoothing statistical nlp
SMART_READER_LITE
LIVE PREVIEW

Smoothing Statistical NLP We often want to make estimates from - - PDF document

Smoothing Statistical NLP We often want to make estimates from sparse statistics: Spring 2011 P(w | denied the) 3 allegations allegations 2 reports 1 claims reports charges benefits motion 1 request claims request 7 total


slide-1
SLIDE 1

1

Statistical NLP

Spring 2011

Lecture 3: Language Models II

Dan Klein – UC Berkeley

Smoothing

  • We often want to make estimates from sparse statistics:
  • Smoothing flattens spiky distributions so they generalize better
  • Very important all over NLP, but easy to do badly!
  • We’ll illustrate with bigrams today (h = previous word, could be anything).

P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total

allegations

charges motion benefits

allegations reports claims

charges

request

motion benefits

allegations reports

claims

request

P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total

Kneser-Ney

Kneser-Ney smoothing combines these two ideas

Absolute discounting Lower order continuation probabilities

KN smoothing repeatedly proven effective Why should things work like this?

Predictive Distributions

Parameter estimation: With parameter variable: Predictive distribution:

a b c a

θ = P(w) = [a:0.5, b:0.25, c:0.25]

a b c a Θ a b c a Θ

W

“Chinese Restaurant” Processes

[Teh, 06, diagrams from Teh]

P k ∝ c P ∝ α P w θw /V P k ∝ c − d P ∝ α dK Dirichlet Process Pitman-Yor Process

Hierarchical Models

c a d b b b a

Θ0 Θc Θd Θa

a b a

Θg [MacKay and Peto, 94, Teh 06]

e a b f a b b b

Θe

a

Θf

a e f

Θb

d

slide-2
SLIDE 2

2

What Actually Works?

  • Trigrams and beyond:
  • Unigrams, bigrams

generally useless

  • Trigrams much better (when

there’s enough data)

  • 4-, 5-grams really useful in

MT, but not so much for speech

  • Discounting
  • Absolute discounting, Good-

Turing, held-out estimation, Witten-Bell

  • Context counting
  • Kneser-Ney construction
  • flower-order models
  • See [Chen+Goodman]

reading for tons of graphs!

[Graphs from Joshua Goodman]

Data >> Method?

  • Having more data is better…
  • … but so is using a better estimator
  • Another issue: N > 3 has huge costs in speech and MT decoders

5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 1 2 3 4 5 6 7 8 9 10 20 n-gram order Entropy

100,000 Katz 100,000 KN 1,000,000 Katz 1,000,000 KN 10,000,000 Katz 10,000,000 KN all Katz all KN

Tons of Data?

[Brants et al, 2007]

Large Scale Methods

  • Language models get big, fast

English Gigawords corpus: 2G tokens, 0.3G trigrams, 1.2G 5-grams Google N-grams: 13M unigrams, 0.3G bigrams, ~1G 3-, 4-, 5-grams Need to access entries very often, ideally in memory

  • What do you do when language models get too big?

Distributing LMs across machines Quantizing probabilities Random hashing (e.g. Bloom filters) [Talbot and Osborne 07]

A Simple Java Hashmap?

Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Double = 8 bytes (obj) + 8 bytes (double) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes

Obvious alternatives:

  • Sorted arrays
  • Open addressing

Word+Context Encodings

slide-3
SLIDE 3

3

Word+Context Encodings Compression Memory Requirements Speed and Caching

Full LM

LM Interfaces Approximate LMs

  • Simplest option: hash-and-hope

Array of size K ~ N (optional) store hash of keys Store values in direct-address or open addressing Collisions: store the max What kind of errors can there be?

  • More complex options, like bloom filters (originally for membership,

but see Talbot and Osborne 07), perfect hashing, etc

slide-4
SLIDE 4

4

Beyond N-Gram LMs

  • Lots of ideas we won’t have time to discuss:

Caching models: recent words more likely to appear again Trigger models: recent words trigger other words Topic models

A few other classes of ideas

Syntactic models: use tree models to capture long-distance syntactic effects [Chelba and Jelinek, 98] Discriminative models: set n-gram weights to improve final task accuracy rather than fit training set density [Roark, 05, for ASR; Liang et. al., 06, for MT] Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero if the look like real zeros [Mohri and Roark, 06] Bayesian document and IR models [Daume 06]