smoothing statistical nlp
play

Smoothing Statistical NLP We often want to make estimates from - PDF document

Smoothing Statistical NLP We often want to make estimates from sparse statistics: Spring 2011 P(w | denied the) 3 allegations allegations 2 reports 1 claims reports charges benefits motion 1 request claims request 7 total


  1. Smoothing Statistical NLP � We often want to make estimates from sparse statistics: Spring 2011 P(w | denied the) 3 allegations allegations 2 reports 1 claims reports charges benefits motion 1 request claims request … 7 total � Smoothing flattens spiky distributions so they generalize better P(w | denied the) 2.5 allegations 1.5 reports allegations allegations 0.5 claims charges benefits motion Lecture 3: Language Models II 0.5 request reports … 2 other claims request 7 total Dan Klein – UC Berkeley � Very important all over NLP, but easy to do badly! � We’ll illustrate with bigrams today (h = previous word, could be anything). Kneser-Ney Predictive Distributions � Parameter estimation: � Kneser-Ney smoothing combines these two ideas � Absolute discounting θ = P(w) = [a:0.5, b:0.25, c:0.25] a b a c � With parameter variable: � Lower order continuation probabilities Θ a b a c � KN smoothing repeatedly proven effective � Predictive distribution: � Why should things work like this? Θ a b a c W Hierarchical Models “Chinese Restaurant” Processes c a b d a b e a b f a b Θ 0 Dirichlet Process Pitman-Yor Process Θ a Θ b Θ c Θ d Θ e Θ f Θ g P ���� �� ����� k � ∝ c � P ���� �� ����� k � ∝ c � − d P ���� �� ��� ������ ∝ α P ���� �� ��� ������ ∝ α � dK b b b b d e f a a a a P ������ ������ ���� w � � θ � � w � � � /V [Teh, 06, diagrams from Teh] [MacKay and Peto, 94, Teh 06] 1

  2. What Actually Works? Data >> Method? � Trigrams and beyond: � Having more data is better… � Unigrams, bigrams generally useless 10 � Trigrams much better (when 9.5 100,000 Katz there’s enough data) 9 � 100,000 KN 4-, 5-grams really useful in MT, but not so much for 8.5 1,000,000 Katz Entropy speech 8 1,000,000 KN 7.5 � 10,000,000 Katz Discounting 7 10,000,000 KN � Absolute discounting, Good- Turing, held-out estimation, 6.5 all Katz Witten-Bell 6 all KN 5.5 � Context counting 1 2 3 4 5 6 7 8 9 10 20 � Kneser-Ney construction n-gram order oflower-order models [Graphs from � � Joshua Goodman] … but so is using a better estimator See [Chen+Goodman] reading for tons of graphs! � Another issue: N > 3 has huge costs in speech and MT decoders Tons of Data? Large Scale Methods � Language models get big, fast � English Gigawords corpus: 2G tokens, 0.3G trigrams, 1.2G 5-grams � Google N-grams: 13M unigrams, 0.3G bigrams, ~1G 3-, 4-, 5-grams � Need to access entries very often, ideally in memory � What do you do when language models get too big? � Distributing LMs across machines � Quantizing probabilities � Random hashing (e.g. Bloom filters) [Talbot and Osborne 07] [Brants et al, 2007] A Simple Java Hashmap? Word+Context Encodings Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Double = 8 bytes (obj) + 8 bytes (double) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing 2

  3. Word+Context Encodings Compression Memory Requirements Speed and Caching Full LM LM Interfaces Approximate LMs � Simplest option: hash-and-hope � Array of size K ~ N � (optional) store hash of keys � Store values in direct-address or open addressing � Collisions: store the max � What kind of errors can there be? � More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc 3

  4. Beyond N-Gram LMs � Lots of ideas we won’t have time to discuss: � Caching models: recent words more likely to appear again � Trigger models: recent words trigger other words � Topic models � A few other classes of ideas � Syntactic models: use tree models to capture long-distance syntactic effects [Chelba and Jelinek, 98] � Discriminative models: set n-gram weights to improve final task accuracy rather than fit training set density [Roark, 05, for ASR; Liang et. al., 06, for MT] � Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero if the look like real zeros [Mohri and Roark, 06] � Bayesian document and IR models [Daume 06] 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend