Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei - PowerPoint PPT Presentation

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

This lecture v Kneser-Ney smoothing v Discriminative Language Models v Neural Language Models v Evaluation: Cross-entropy and perplexity 6501 Natural Language Processing 2

Recap: Smoothing v Add-one smoothing v Add- 𝜇 smoothing v parameters tuned by the cross-validation v Witten-Bell Smoothing v T: # word types N: # tokens v T/(N+T): total prob. mass for unseen words v N/(N+T): total prob. mass for observed tokens v Good-Turing v Reallocate the probability mass of n-grams that occur r+1 times to n-grams that occur r times. 6501 Natural Language Processing 3

Recap: Back-off and interpolation v Idea: even if we’ve never seen “red glasses”, we know it is more likely to occur than “red abacus” v Interpolation: p average (z | xy) = µ 3 p(z | xy) + µ 2 p(z | y) + µ 1 p(z) where µ 3 + µ 2 + µ 1 = 1 and all are ≥ 0 6501 Natural Language Processing 4

Absolute Discounting v Save ourselves some time and just subtract 0.75 (or some d)! discounted bigram Interpolation weight c ( w , w ) d − i 1 i P ( w | w ) ( w ) P ( w ) − = + λ AbsoluteDi scounting i i 1 i 1 − − c ( w ) i 1 − unigram v But should we really just use the regular unigram P(w)? 6501 Natural Language Processing 5

Kneser-Ney Smoothing v Better estimate for probabilities of lower-order unigrams! v Shannon game: I can’t see without my reading ___________ ? Francisco glasses v “Francisco” is more common than “glasses” v … but “Francisco” always follows “San” 6501 Natural Language Processing 6

Kneser-Ney Smoothing v Instead of P(w): “How likely is w” v P continuation (w): “How likely is w to appear as a novel continuation? v For each word, count the number of bigram types it completes v Every bigram type was a novel continuation the first time it was seen P CONTINUATION ( w ) ∝ { w i − 1 : c ( w i − 1 , w ) > 0} 6501 Natural Language Processing 7

Kneser-Ney Smoothing v How many times does w appear as a novel continuation: P CONTINUATION ( w ) ∝ { w i − 1 : c ( w i − 1 , w ) > 0} v Normalized by the total number of word bigram types {( w j − 1 , w j ): c ( w j − 1 , w j ) > 0} { w i − 1 : c ( w i − 1 , w ) > 0} P CONTINUATION ( w ) = {( w j − 1 , w j ): c ( w j − 1 , w j ) > 0} 6501 Natural Language Processing 8

Kneser-Ney Smoothing v Alternative metaphor: The number of # of word types seen to precede w |{ w i − 1 : c ( w i − 1 , w ) > 0}| v normalized by the # of words preceding all words: { w i − 1 : c ( w i − 1 , w ) > 0} P CONTINUATION ( w ) = ∑ { w ' i − 1 : c ( w ' i − 1 , w ') > 0} w ' v A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability 6501 Natural Language Processing 9

Kneser-Ney Smoothing KN ( w i | w i − 1 ) = max( c ( w i − 1 , w i ) − d ,0) P + λ ( w i − 1 ) P CONTINUATION ( w i ) c ( w i − 1 ) λ is a normalizing constant; the probability mass we’ve discounted d λ ( w i − 1 ) = c ( w i − 1 ) { w : c ( w i − 1 , w ) > 0} The number of word types that can follow w i-1 the normalized discount = # of word types we discounted = # of times we applied normalized discount 6501 Natural Language Processing 10

Kneser-Ney Smoothing: Recursive formulation i i − 1 ) = max( c KN ( w i − n + 1 ) − d ,0) i − 1 ) P i − 1 P KN ( w i | w i − n + 1 + λ ( w i − n + 1 KN ( w i | w i − n + 2 ) i − 1 ) c KN ( w i − n + 1 ! # count ( • ) for the highest order c KN ( • ) = " continuationcount ( • ) for lower order # $ Continuation count = Number of unique single word contexts for 6501 Natural Language Processing 11

Practical issue: Huge web-scale n-grams v How to deal with, e.g., Google N-gram corpus v Pruning v Only store N-grams with count > threshold. v Remove singletons of higher-order n-grams 6501 Natural Language Processing 12

Huge web-scale n-grams v Efficiency v Efficient data structures v e.g. tries https://en.wikipedia.org/wiki/Trie v Store words as indexes, not strings v Quantize probabilities (4-8 bits instead of 8-byte float) 6501 Natural Language Processing 13

Smoothing This dark art is why NLP is taught in the engineering school. There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique. 6501 Natural Language Processing 14 600.465 - Intro to NLP - J. Eisner 14

Conditional Modeling v Generative language model (tri-gram model): 𝑄(𝑥 ) , … 𝑥 + ) = P 𝑥 ) 𝑄 𝑥 0 𝑥 ) …𝑄 𝑥 + 𝑥 +10 ,𝑥 +1) v Then, we compute the conditional probabilities by maximum likelihood estimation v Can we model 𝑄 𝑥 $ 𝑥 % , 𝑥 ' directly? v Given a context x, which outcomes y are likely in that context? P ( NextWord= y | PrecedingWords= x ) 6501 Natural Language Processing 15 600.465 - Intro to NLP - J. Eisner 15

Modeling conditional probabilities v Let’s assume (𝑡𝑑𝑝𝑠𝑓 𝑦,𝑧 D ) 𝑄(𝑧|𝑦) = exp(score x,y )/∑ exp ED Y: NextWord, x: PrecedingWords v 𝑄(𝑧|𝑦) is high ⇔ score(x,y) is high v This is called soft-max v Require that P(y | x) ≥ 0, and ∑ 𝑄(𝑧|𝑦) = 1; E not true of score(x,y) 6501 Natural Language Processing 16

Linear Scoring v Score(x,y): How well does y go with x? v Simplest option: a linear function of (x,y). But (x,y) isn’t a number ⇒ describe it by some numbers (i.e. numeric features) v Then just use a linear function of those numbers. Weight of the k th feature. To be learned … Whether (x,y) has feature k(0 or 1) Ranges over all features Or how many times it fires ( ≥ 0) Or how strongly it fires (real #) 6501 Natural Language Processing 17

What features should we use? v Model p w I w $1), w $10 ): 𝑔 ' ( “𝑥 $1) , 𝑥 $10 ”, “𝑥 $ ” ) for Score( “𝑥 $1) , 𝑥 $10 ”, “𝑥 $ ”) can be v # “ 𝑥 $1) ” appears in the training corpus. v 1, if “ 𝑥 $ ” is an unseen word; 0, otherwise. v 1, if “ 𝑥 $1) , 𝑥 $10 ” = “a red”; 0, otherwise. v 1, if “ 𝑥 $10 ” belongs to the “color” category; 0 otherwise. 6501 Natural Language Processing 18

What features should we use? v Model p ”𝑕𝑚𝑏𝑡𝑡𝑓𝑡” ”𝑏 𝑠𝑓𝑒”): 𝑔 ' ( “𝑠𝑓𝑒”, “𝑏”, “𝑕𝑚𝑏𝑡𝑡𝑓𝑡” ) for Score (“𝑠𝑓𝑒”, “𝑏”, “𝑕𝑚𝑏𝑡𝑡𝑓𝑡”) v # “ 𝑠𝑓𝑒 ” appears in the training corpus. v 1, if “ 𝑏 ” is an unseen word; 0, otherwise. v 1, if “ a 𝑠𝑓𝑒 ” = “a red”; 0, otherwise. v 1, if “ 𝑠𝑓𝑒 ” belongs to the “color” category; 0 otherwise. 6501 Natural Language Processing 19

Log-Linear Conditional Probability unnormalized prob (at least it’s positive!) where we choose Z(x) to ensure that Partition function thus, 6501 Natural Language Processing 20 600.465 - Intro to NLP - J. Eisner 20

This version is “discriminative training”: Training θ to learn to predict y from x, maximize p(y|x). Whereas in “generative models”, we learn to model x, too, by maximizing p(x,y). v n training examples v feature functions f 1 , f 2 , … v Want to maximize p(training data| θ ) v Easier to maximize the log of that: v Alas, some weights θ i may be optimal at - ∞ or + ∞ . When would this happen? What’s going “wrong”? 6501 Natural Language Processing 21

Generalization via Regularization v n training examples v feature functions f 1 , f 2 , … v Want to maximize p(training data| θ ) ⋅ p prior ( θ ) v Easier to maximize the log of that v Encourages weights close to 0. 𝑞 𝜄 ∝ 𝑓 X Y /Z Y v “L2 regularization”: Corresponds to a Gaussian prior 6501 Natural Language Processing 22

Gradient-based training v Gradually adjust θ in a direction that improves Gradient ascent to gradually increase f( θ ): while ( ∇ f( θ ) ≠ 0) // not at a local max or min θ = θ + 𝜃 ∇ f( θ ) // for some small 𝜃 > 0 Remember: ∇ f( θ ) = ( ∂ f( θ )/ ∂θ 1 , ∂ f( θ )/ ∂θ 2 , …) update means: θ k += ∂ f( θ ) / ∂θ k 6501 Natural Language Processing 23

Gradient-based training v Gradually adjust θ in a direction that improves v Gradient w.r.t 𝜄 6501 Natural Language Processing 24

More complex assumption? v 𝑄(𝑧|𝑦) = exp(score x,y )/ ∑ exp(𝑡𝑑𝑝𝑠𝑓 𝑦,𝑧 ′ ) 𝑧′ Y: NextWord, x: PrecedingWords v Assume we saw: red glasses; yellow glasses; green glasses; blue glasses red shoes; yellow shoes; green shoes; What is P(shoes; blue)? v Can we learn categories of words(representation) automatically? v Can we build a high order n-gram model without blowing up the model size? 6501 Natural Language Processing 25

Neural language model v Model 𝑄(𝑧|𝑦) with a neural network Example 1: One hot vector: each component of the vector represents one word [0, 0, 1, 0, 0] Example 2: word embeddings 6501 Natural Language Processing 26

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei - PowerPoint PPT Presentation

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Kneser-Ney smoothing v

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

ICS 667 Advanced HCI Design Methods 08. Intro to Evaluation Analytic Evaluation Dan Suthers

Lecture 3: Advanced SQL 1 / 64 Advanced SQL Relational Language Relational Language User

ICS 667 Advanced HCI Design Methods 09. Empirical Evaluation Dan Suthers Spring 2005 Methods

Formal Methods and Cryptography Lecture 25 Formal Methods Formal Methods Logical foundations

Formal Methods and Cryptography Lecture 24 1 Formal Methods 2 Formal Methods Logical

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Dependability Evaluation through Markovian model Markovian model The combinatorial methods are

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Clustering ! Hierarchical methods ! Model-based methods ! Density-based methods 1 2 What is

Meshless Meshless Methods Meshless Meshless Methods Methods Methods Contents

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Air Force Retraining Program Air Force Retraining Program These slides are intended for those

Preserving Organizational Knowledge PINNACLE GROUP The 658s: Cameron Asbell, Stacy Brown, Teryn

Ensembles of Classifiers Larry Holder CSE 6363 Machine Learning Computer Science and

Designing ML Experiments Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction Training

Cross-Language Prominence Detection Andrew Rosenberg 1 , Erica Cooper 2 , Rivka Levitan 2 , Julia

Morale Issues in your Library Lori Reed consultant, strategic planner, LJ "Mover &

Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes

High Energy Neutrino Cross Sections Neutrino 2004, 18 June 2004 Mary Hall Reno Energy Ranges

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei - PowerPoint PPT Presentation

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Kneser-Ney smoothing v

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

ICS 667 Advanced HCI Design Methods 08. Intro to Evaluation Analytic Evaluation Dan Suthers

Lecture 3: Advanced SQL 1 / 64 Advanced SQL Relational Language Relational Language User

ICS 667 Advanced HCI Design Methods 09. Empirical Evaluation Dan Suthers Spring 2005 Methods

Formal Methods and Cryptography Lecture 25 Formal Methods Formal Methods Logical foundations

Formal Methods and Cryptography Lecture 24 1 Formal Methods 2 Formal Methods Logical

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Dependability Evaluation through Markovian model Markovian model The combinatorial methods are

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Clustering ! Hierarchical methods ! Model-based methods ! Density-based methods 1 2 What is

Meshless Meshless Methods Meshless Meshless Methods Methods Methods Contents

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Air Force Retraining Program Air Force Retraining Program These slides are intended for those

Preserving Organizational Knowledge PINNACLE GROUP The 658s: Cameron Asbell, Stacy Brown, Teryn

Ensembles of Classifiers Larry Holder CSE 6363 Machine Learning Computer Science and

Designing ML Experiments Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction Training

Cross-Language Prominence Detection Andrew Rosenberg 1 , Erica Cooper 2 , Rivka Levitan 2 , Julia

Morale Issues in your Library Lori Reed consultant, strategic planner, LJ &quot;Mover &amp;

Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes

High Energy Neutrino Cross Sections Neutrino 2004, 18 June 2004 Mary Hall Reno Energy Ranges

Morale Issues in your Library Lori Reed consultant, strategic planner, LJ "Mover &