Smoothing Statistical NLP We often want to make estimates from - PDF document

Smoothing Statistical NLP � We often want to make estimates from sparse statistics: Spring 2011 P(w | denied the) 3 allegations allegations 2 reports 1 claims reports charges benefits motion 1 request claims request … 7 total � Smoothing flattens spiky distributions so they generalize better P(w | denied the) 2.5 allegations 1.5 reports allegations allegations 0.5 claims charges benefits motion Lecture 3: Language Models II 0.5 request reports … 2 other claims request 7 total Dan Klein – UC Berkeley � Very important all over NLP, but easy to do badly! � We’ll illustrate with bigrams today (h = previous word, could be anything). Kneser-Ney Predictive Distributions � Parameter estimation: � Kneser-Ney smoothing combines these two ideas � Absolute discounting θ = P(w) = [a:0.5, b:0.25, c:0.25] a b a c � With parameter variable: � Lower order continuation probabilities Θ a b a c � KN smoothing repeatedly proven effective � Predictive distribution: � Why should things work like this? Θ a b a c W Hierarchical Models “Chinese Restaurant” Processes c a b d a b e a b f a b Θ 0 Dirichlet Process Pitman-Yor Process Θ a Θ b Θ c Θ d Θ e Θ f Θ g P �� k � ∝ c � P �� k � ∝ c � − d P �� ∝ α P �� ∝ α � dK b b b b d e f a a a a P �� w � � θ � � w � � � /V [Teh, 06, diagrams from Teh] [MacKay and Peto, 94, Teh 06] 1

What Actually Works? Data >> Method? � Trigrams and beyond: � Having more data is better… � Unigrams, bigrams generally useless 10 � Trigrams much better (when 9.5 100,000 Katz there’s enough data) 9 � 100,000 KN 4-, 5-grams really useful in MT, but not so much for 8.5 1,000,000 Katz Entropy speech 8 1,000,000 KN 7.5 � 10,000,000 Katz Discounting 7 10,000,000 KN � Absolute discounting, Good- Turing, held-out estimation, 6.5 all Katz Witten-Bell 6 all KN 5.5 � Context counting 1 2 3 4 5 6 7 8 9 10 20 � Kneser-Ney construction n-gram order oflower-order models [Graphs from � � Joshua Goodman] … but so is using a better estimator See [Chen+Goodman] reading for tons of graphs! � Another issue: N > 3 has huge costs in speech and MT decoders Tons of Data? Large Scale Methods � Language models get big, fast � English Gigawords corpus: 2G tokens, 0.3G trigrams, 1.2G 5-grams � Google N-grams: 13M unigrams, 0.3G bigrams, ~1G 3-, 4-, 5-grams � Need to access entries very often, ideally in memory � What do you do when language models get too big? � Distributing LMs across machines � Quantizing probabilities � Random hashing (e.g. Bloom filters) [Talbot and Osborne 07] [Brants et al, 2007] A Simple Java Hashmap? Word+Context Encodings Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Double = 8 bytes (obj) + 8 bytes (double) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing 2

Word+Context Encodings Compression Memory Requirements Speed and Caching Full LM LM Interfaces Approximate LMs � Simplest option: hash-and-hope � Array of size K ~ N � (optional) store hash of keys � Store values in direct-address or open addressing � Collisions: store the max � What kind of errors can there be? � More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc 3

Beyond N-Gram LMs � Lots of ideas we won’t have time to discuss: � Caching models: recent words more likely to appear again � Trigger models: recent words trigger other words � Topic models � A few other classes of ideas � Syntactic models: use tree models to capture long-distance syntactic effects [Chelba and Jelinek, 98] � Discriminative models: set n-gram weights to improve final task accuracy rather than fit training set density [Roark, 05, for ASR; Liang et. al., 06, for MT] � Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero if the look like real zeros [Mohri and Roark, 06] � Bayesian document and IR models [Daume 06] 4

Smoothing Statistical NLP We often want to make estimates from - PDF document

Smoothing Statistical NLP We often want to make estimates from sparse statistics: Spring 2011 P(w | denied the) 3 allegations allegations 2 reports 1 claims reports charges benefits motion 1 request claims request 7 total

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12*) 4.1 What is

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

THE COMPARISON OF INCOME THE COMPARISON OF INCOME SMOOTHING AND MARKET SMOOTHING AND MARKET

Testing for Poverty Traps: Asset Smoothing versus Consumption Smoothing in Burkina Faso (with

8.2 Surface Smoothing Hao Li http://cs621.hao-li.com 1 Mesh Optimization Smoothing Low

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !

Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based

Kernel Smoothing Methods (Part 1) Henry Tan Georgetown University April 13, 2015 Georgetown

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing

8.2 Surface Smoothing Weikai Chen http://cs621.hao-li.com 1 Mesh Optimization Smoothing

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Smoothing In image processing literature, the weighting averaging operation is referred to as

Titel Untertitel Referent Firma Things that gain from disorder What doesnt kill me makes

Ready to Resist Call 13 Sunday, July 23, 2017 Tonights Agenda 1. Welcome: Jennifer

DARK ENERGY PROBES Phenomenology Enrique Gaztaaga, ICE (IEEC/CSIC) Barcelona Outline Intro

The integrated cluster finder - a part of the ARCHES project Alexey Mints, Axel Schwope and

Haben wir alle wichtigen Features getestet? www.qs-tag.de Ticket-Coverage live am Beispiel Dr.

19-11-20 Neutral theory 2 (2019 abridged): Neutral theory of molecular evolution Motoo Kimura:

Kryptographie f ur Langzeitsicherheit Tanja Lange 27. November 2015 Bundesdruckerei Seminar

Software Design, Modelling and Analysis in UML Lecture 15: Hierarchical State Machines I

Sambuz

Useful Links

Newsletter

Mail Us

Smoothing Statistical NLP We often want to make estimates from - PDF document

Smoothing Statistical NLP We often want to make estimates from sparse statistics: Spring 2011 P(w | denied the) 3 allegations allegations 2 reports 1 claims reports charges benefits motion 1 request claims request 7 total

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12*) 4.1 What is

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

THE COMPARISON OF INCOME THE COMPARISON OF INCOME SMOOTHING AND MARKET SMOOTHING AND MARKET

Testing for Poverty Traps: Asset Smoothing versus Consumption Smoothing in Burkina Faso (with

8.2 Surface Smoothing Hao Li http://cs621.hao-li.com 1 Mesh Optimization Smoothing Low

Image Smoothing ! Chicken-and-egg dilemma! &quot; ! Edge preserving image smoothing !

Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based

Kernel Smoothing Methods (Part 1) Henry Tan Georgetown University April 13, 2015 Georgetown

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing

8.2 Surface Smoothing Weikai Chen http://cs621.hao-li.com 1 Mesh Optimization Smoothing

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Smoothing In image processing literature, the weighting averaging operation is referred to as

Titel Untertitel Referent Firma Things that gain from disorder What doesnt kill me makes

Ready to Resist Call 13 Sunday, July 23, 2017 Tonights Agenda 1. Welcome: Jennifer

DARK ENERGY PROBES Phenomenology Enrique Gaztaaga, ICE (IEEC/CSIC) Barcelona Outline Intro

The integrated cluster finder - a part of the ARCHES project Alexey Mints, Axel Schwope and

Haben wir alle wichtigen Features getestet? www.qs-tag.de Ticket-Coverage live am Beispiel Dr.

19-11-20 Neutral theory 2 (2019 abridged): Neutral theory of molecular evolution Motoo Kimura:

Kryptographie f ur Langzeitsicherheit Tanja Lange 27. November 2015 Bundesdruckerei Seminar

Software Design, Modelling and Analysis in UML Lecture 15: Hierarchical State Machines I

Sambuz

Useful Links

Newsletter

Mail Us

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !