Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei - - PowerPoint PPT Presentation

โ–ถ
lecture 4 language model evaluation and advanced methods
SMART_READER_LITE
LIVE PREVIEW

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei - - PowerPoint PPT Presentation

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Kneser-Ney smoothing v


slide-1
SLIDE 1

Lecture 4: Language Model Evaluation and Advanced methods

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 6501 Natural Language Processing

slide-2
SLIDE 2

This lecture

v Kneser-Ney smoothing v Discriminative Language Models v Neural Language Models v Evaluation: Cross-entropy and perplexity

2 6501 Natural Language Processing

slide-3
SLIDE 3

Recap: Smoothing

v Add-one smoothing v Add-๐œ‡ smoothing

vparameters tuned by the cross-validation

v Witten-Bell Smoothing

vT: # word types N: # tokens vT/(N+T): total prob. mass for unseen words vN/(N+T): total prob. mass for observed tokens

v Good-Turing

vReallocate the probability mass of n-grams that

  • ccur r+1 times to n-grams that occur r times.

3 6501 Natural Language Processing

slide-4
SLIDE 4

Recap: Back-off and interpolation

v Idea: even if weโ€™ve never seen โ€œred glassesโ€, we know it is more likely to occur than โ€œred abacusโ€ v Interpolation:

paverage(z | xy) = ยต3 p(z | xy) + ยต2 p(z | y) + ยต1 p(z) where ยต3 + ยต2 + ยต1 = 1 and all are โ‰ฅ 0

4 6501 Natural Language Processing

slide-5
SLIDE 5

Absolute Discounting

v Save ourselves some time and just subtract 0.75 (or some d)!

v But should we really just use the regular unigram P(w)?

5

) ( ) ( ) ( ) , ( ) | (

1 1 1 1 scounting AbsoluteDi

w P w w c d w w c w w P

i i i i i i โˆ’ โˆ’ โˆ’ โˆ’

+ โˆ’ = ฮป

discounted bigram unigram

Interpolation weight 6501 Natural Language Processing

slide-6
SLIDE 6

Kneser-Ney Smoothing

v Better estimate for probabilities of lower-order unigrams!

vShannon game: I canโ€™t see without my reading___________? vโ€œFranciscoโ€ is more common than โ€œglassesโ€ vโ€ฆ but โ€œFranciscoโ€ always follows โ€œSanโ€

6

Francisco glasses

6501 Natural Language Processing

slide-7
SLIDE 7

Kneser-Ney Smoothing

v Instead of P(w): โ€œHow likely is wโ€ v Pcontinuation(w): โ€œHow likely is w to appear as a novel continuation?

vFor each word, count the number of bigram types it completes vEvery bigram type was a novel continuation the first time it was seen

7

P

CONTINUATION(w)โˆ {wiโˆ’1 :c(wiโˆ’1,w) > 0}

6501 Natural Language Processing

slide-8
SLIDE 8

Kneser-Ney Smoothing

v How many times does w appear as a novel continuation: v Normalized by the total number of word bigram types

8

P

CONTINUATION(w) =

{wiโˆ’1 :c(wiโˆ’1,w) > 0} {(wjโˆ’1,wj):c(wjโˆ’1,wj) > 0}

P

CONTINUATION(w)โˆ {wiโˆ’1 :c(wiโˆ’1,w) > 0}

{(wjโˆ’1,wj):c(wjโˆ’1,wj) > 0}

6501 Natural Language Processing

slide-9
SLIDE 9

Kneser-Ney Smoothing

v Alternative metaphor: The number of # of word types seen to precede w v normalized by the # of words preceding all words: v A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability

9

P

CONTINUATION(w) =

{wiโˆ’1 :c(wiโˆ’1,w) > 0} {w'iโˆ’1 :c(w'iโˆ’1,w') > 0}

w'

โˆ‘

|{wiโˆ’1 :c(wiโˆ’1,w) > 0}|

6501 Natural Language Processing

slide-10
SLIDE 10

Kneser-Ney Smoothing

10

P

KN(wi | wiโˆ’1) = max(c(wiโˆ’1,wi)โˆ’ d,0)

c(wiโˆ’1) + ฮป(wiโˆ’1)P

CONTINUATION(wi)

ฮป(wiโˆ’1) = d c(wiโˆ’1) {w :c(wiโˆ’1,w) > 0}

ฮป is a normalizing constant; the probability mass weโ€™ve discounted

the normalized discount The number of word types that can follow wi-1 = # of word types we discounted = # of times we applied normalized discount

6501 Natural Language Processing

slide-11
SLIDE 11

Kneser-Ney Smoothing: Recursive formulation

11

P

KN (wi | wiโˆ’n+1 iโˆ’1 ) = max(cKN (wiโˆ’n+1 i

)โˆ’ d,0) cKN (wiโˆ’n+1

iโˆ’1 )

+ ฮป(wiโˆ’n+1

iโˆ’1 )P KN (wi | wiโˆ’n+2 iโˆ’1

) cKN(โ€ข) = count(โ€ข) for the highest order continuationcount(โ€ข) for lower order ! " # $ #

Continuation count = Number of unique single word contexts for ยŸ

6501 Natural Language Processing

slide-12
SLIDE 12

Practical issue: Huge web-scale n-grams v How to deal with, e.g., Google N-gram corpus v Pruning

vOnly store N-grams with count > threshold.

v Remove singletons of higher-order n-grams

12 6501 Natural Language Processing

slide-13
SLIDE 13

Huge web-scale n-grams

v Efficiency

vEfficient data structures

v e.g. tries

vStore words as indexes, not strings vQuantize probabilities (4-8 bits instead of 8-byte float)

13

https://en.wikipedia.org/wiki/Trie

6501 Natural Language Processing

slide-14
SLIDE 14

600.465 - Intro to NLP - J. Eisner 14

Smoothing

This dark art is why NLP is taught in the engineering school.

14

There are more principled smoothing methods, too. Weโ€™ll look next at log-linear models, which are a good and popular general technique.

6501 Natural Language Processing

slide-15
SLIDE 15

Conditional Modeling

v Generative language model (tri-gram model):

v Then, we compute the conditional probabilities by maximum likelihood estimation

v Can we model ๐‘„ ๐‘ฅ$ ๐‘ฅ%, ๐‘ฅ' directly? v Given a context x, which outcomes y are likely in that context?

P (NextWord=y | PrecedingWords=x)

15 600.465 - Intro to NLP - J. Eisner 15

๐‘„(๐‘ฅ), โ€ฆ ๐‘ฅ+)= P ๐‘ฅ) ๐‘„ ๐‘ฅ0 ๐‘ฅ) โ€ฆ๐‘„ ๐‘ฅ+ ๐‘ฅ+10,๐‘ฅ+1)

6501 Natural Language Processing

slide-16
SLIDE 16

Modeling conditional probabilities

v Letโ€™s assume

๐‘„(๐‘ง|๐‘ฆ) = exp(score x,y )/โˆ‘ exp (๐‘ก๐‘‘๐‘๐‘ ๐‘“ ๐‘ฆ,๐‘งD )

ED

Y: NextWord, x: PrecedingWords

v๐‘„(๐‘ง|๐‘ฆ) is high โ‡” score(x,y) is high vThis is called soft-max vRequire that P(y | x) โ‰ฅ 0, and โˆ‘ ๐‘„(๐‘ง|๐‘ฆ)

E

= 1; not true of score(x,y)

16 6501 Natural Language Processing

slide-17
SLIDE 17

Linear Scoring

v Score(x,y): How well does y go with x? v Simplest option: a linear function of (x,y).

But (x,y) isnโ€™t a number โ‡’ describe it by some numbers (i.e. numeric features)

v Then just use a linear function of those numbers.

17

Ranges over all features Whether (x,y) has feature k(0 or 1) Or how many times it fires (โ‰ฅ 0) Or how strongly it fires (real #) Weight of the kth feature. To be learned โ€ฆ

6501 Natural Language Processing

slide-18
SLIDE 18

What features should we use?

v Model p wI w$1), w$10):

๐‘”

'(โ€œ๐‘ฅ$1), ๐‘ฅ$10โ€, โ€œ๐‘ฅ$โ€) for Score(โ€œ๐‘ฅ$1), ๐‘ฅ$10โ€, โ€œ๐‘ฅ$โ€) can be

v # โ€œ๐‘ฅ$1)โ€ appears in the training corpus. v 1, if โ€œ๐‘ฅ$โ€ is an unseen word; 0, otherwise. v 1, if โ€œ๐‘ฅ$1), ๐‘ฅ$10โ€ = โ€œa redโ€; 0, otherwise. v 1, if โ€œ๐‘ฅ$10โ€ belongs to the โ€œcolorโ€ category; 0 otherwise.

18 6501 Natural Language Processing

slide-19
SLIDE 19

What features should we use?

v Model p โ€๐‘•๐‘š๐‘๐‘ก๐‘ก๐‘“๐‘กโ€ โ€๐‘ ๐‘ ๐‘“๐‘’โ€):

๐‘”

'(โ€œ๐‘ ๐‘“๐‘’โ€, โ€œ๐‘โ€, โ€œ๐‘•๐‘š๐‘๐‘ก๐‘ก๐‘“๐‘กโ€) for Score(โ€œ๐‘ ๐‘“๐‘’โ€, โ€œ๐‘โ€, โ€œ๐‘•๐‘š๐‘๐‘ก๐‘ก๐‘“๐‘กโ€)

v # โ€œ๐‘ ๐‘“๐‘’โ€ appears in the training corpus. v 1, if โ€œ๐‘โ€ is an unseen word; 0, otherwise. v 1, if โ€œa ๐‘ ๐‘“๐‘’โ€ = โ€œa redโ€; 0, otherwise. v 1, if โ€œ๐‘ ๐‘“๐‘’โ€ belongs to the โ€œcolorโ€ category; 0 otherwise.

19 6501 Natural Language Processing

slide-20
SLIDE 20

Log-Linear Conditional Probability

20

600.465 - Intro to NLP - J. Eisner 20

where we choose Z(x) to ensure that unnormalized prob (at least itโ€™s positive!) thus,

Partition function

6501 Natural Language Processing

slide-21
SLIDE 21

v n training examples v feature functions f1, f2, โ€ฆ v Want to maximize p(training data|ฮธ) v Easier to maximize the log of that: v Alas, some weights ฮธi may be optimal at -โˆž or +โˆž. When would this happen? Whatโ€™s going โ€œwrongโ€?

Training ฮธ

21

This version is โ€œdiscriminative trainingโ€: to learn to predict y from x, maximize p(y|x).

Whereas in โ€œgenerative modelsโ€, we learn to model x, too, by maximizing p(x,y).

6501 Natural Language Processing

slide-22
SLIDE 22

Generalization via Regularization

v n training examples v feature functions f1, f2, โ€ฆ v Want to maximize p(training data|ฮธ) โ‹… pprior(ฮธ) v Easier to maximize the log of that v Encourages weights close to 0.

v โ€œL2 regularizationโ€: Corresponds to a Gaussian prior

22

๐‘ž ๐œ„ โˆ ๐‘“ X Y/ZY

6501 Natural Language Processing

slide-23
SLIDE 23

Gradient-based training

v Gradually adjust ฮธ in a direction that improves

23

Gradient ascent to gradually increase f(ฮธ): while (โˆ‡f(ฮธ) โ‰  0) // not at a local max or min ฮธ = ฮธ + ๐œƒโˆ‡f(ฮธ) // for some small ๐œƒ > 0 Remember: โˆ‡f(ฮธ) = (โˆ‚f(ฮธ)/โˆ‚ฮธ1, โˆ‚f(ฮธ)/โˆ‚ฮธ2, โ€ฆ) update means: ฮธk += โˆ‚f(ฮธ) / โˆ‚ฮธk

6501 Natural Language Processing

slide-24
SLIDE 24

Gradient-based training

v Gradually adjust ฮธ in a direction that improves v Gradient w.r.t ๐œ„

24 6501 Natural Language Processing

slide-25
SLIDE 25

More complex assumption?

v ๐‘„(๐‘ง|๐‘ฆ) = exp(score x,y )/ โˆ‘ exp(๐‘ก๐‘‘๐‘๐‘ ๐‘“ ๐‘ฆ,๐‘งโ€ฒ )

๐‘งโ€ฒ

Y: NextWord, x: PrecedingWords v Assume we saw:

What is P(shoes; blue)?

v Can we learn categories of words(representation) automatically? v Can we build a high order n-gram model without blowing up the model size?

25

red glasses; yellow glasses; green glasses; blue glasses red shoes; yellow shoes; green shoes;

6501 Natural Language Processing

slide-26
SLIDE 26

Neural language model

v Model ๐‘„(๐‘ง|๐‘ฆ) with a neural network

26

Example 1: One hot vector: each component of the vector represents one word [0, 0, 1, 0, 0] Example 2: word embeddings

6501 Natural Language Processing

slide-27
SLIDE 27

Neural language model

v Model ๐‘„(๐‘ง|๐‘ฆ) with a neural network

27

Learned matrices to project the input vectors Obtain (y|x) by performing softmax Concatenate projected vectors Non-linear function e.g., โ„Ž = tanh (๐‘‹b ๐‘‘ โƒ‘ + ๐‘)

6501 Natural Language Processing

slide-28
SLIDE 28

Why?

v Potentially generalize to unseen contexts

vExample: P(โ€œredโ€ | โ€œtheโ€, โ€œshoesโ€, โ€œareโ€) vThis does not occurs in training corpus but [โ€œtheโ€, โ€glassesโ€, โ€areโ€, โ€œredโ€] does. vIf the word representations of โ€œredโ€ and โ€œblueโ€ are similar, then the model can generalize.

v Why are โ€œredโ€ and โ€œblueโ€ similar?

vBecause NN saw โ€œred skirtโ€, โ€œblue skirtโ€, โ€œred penโ€, โ€blue penโ€, etc.

28 6501 Natural Language Processing

slide-29
SLIDE 29

Training neural language models

v Can use gradient ascent as well v Using the chain rule to derive the gradient a.k.a. back propagation v More complex NN architectures can be used โ€“ e.g., LSTM, char-based models

29 6501 Natural Language Processing

slide-30
SLIDE 30

Language model evaluation

v How to compare models?

v we need an unseen text set, why?

v Information theory: study resolution of uncertainty.

vPerplexity: measure how well a probability distribution predicts a sample

30 6501 Natural Language Processing

slide-31
SLIDE 31

Cross-Entropy

v A common measure of model quality

v Task-independent v Continuous โ€“ slight improvements show up here even if they donโ€™t change # of right answers on task

v Just measure probability of (enough) test data

v Higher prob means model better predicts the future v Thereโ€™s a limit to how well you can predict random stuff v Limit depends on โ€œhow randomโ€ the dataset is (easier to predict weather than headlines, especially in Arizona)

31 6501 Natural Language Processing

slide-32
SLIDE 32

32

Cross-Entropy (โ€œxentโ€)

v Want prob of test data to be high:

p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) โ€ฆ 1/8 * 1/8 * 1/8 * 1/16 โ€ฆ

v high prob โ†’ low xent by 3 cosmetic improvements:

v Take logarithm (base 2) to prevent underflow:

log (1/8 * 1/8 * 1/8 * 1/16 โ€ฆ) = log 1/8 + log 1/8 + log 1/8 + log 1/16 โ€ฆ = (-3) + (-3) + (-3) + (-4) + โ€ฆ

v Negate to get a positive value in bits 3+3+3+4+โ€ฆ v Divide by length of text ร  3.25 bits per letter (or per word)

6501 Natural Language Processing

Average? Geometric average

  • f 1/23,1/23, 1/23, 1/24

= 1/23.25 โ‰ˆ 1/9.5

slide-33
SLIDE 33

33

Cross-Entropy (โ€œxentโ€)

v Want prob of test data to be high:

p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) โ€ฆ 1/8 * 1/8 * 1/8 * 1/16 โ€ฆ

v Cross-entropy ร  3.25 bits per letter (or per word)

v Want this to be small (equivalent to wanting good compression!) v Lower limit is called entropy โ€“ obtained in principle as cross-entropy of the true model measured on an infinite amount of data

v perplexity = 2xent (meaning โ‰ˆ9.5 choices)

Average? Geometric average

  • f 1/23,1/23, 1/23, 1/24

= 1/23.25 โ‰ˆ 1/9.5

6501 Natural Language Processing

slide-34
SLIDE 34

More math: Entropy H(X)

v The entropy H(๐‘ž) of a discrete random variable ๐‘Œ is the expected negative log probability:

H p = โˆ’ โˆ‘ ๐‘ž ๐‘ฆ log0 ๐‘„(๐‘ฆ)

k

v Entropy is measure of uncertainty

34 6501 Natural Language Processing

slide-35
SLIDE 35

Entropy of coin tossing

v Toss a coin P(H)=๐‘ž, P(T)=1 โˆ’ p v H(p)= โˆ’๐‘ž log0 ๐‘ž โˆ’ 1 โˆ’ ๐‘ž log0 1 โˆ’ ๐‘ž

vp=0.5: H(p)= 1 vP=1: H(p) = 0

6501 Natural Language Processing 35

slide-36
SLIDE 36

Entropy of coin tossing

v Toss a coin P(H)=๐‘ž, P(T)=1 โˆ’ p v H(p)= โˆ’๐‘ž log0 ๐‘ž โˆ’ 1 โˆ’ ๐‘ž log0 1 โˆ’ ๐‘ž

vp=0.5: H(p)= 1 vP=1: H(p) = 0

6501 Natural Language Processing 36

slide-37
SLIDE 37

How many bits to encode messages

v Consider three letters (A, B, C, D): v If p=(ยฝ, ยฝ, 0, 0), how many bits per letter in average to encode a message ~ p?

vEncode A as 0, B as 1;AAABBBAA โ‡’ 00011100

v If p=(ยผ , ยผ , ยผ , ยผ )

vA: 00, B: 01, C:10, D:11; ABDA โ‡’ 00011100

v How about p=(ยฝ, ยผ, ยผ, 0)

vA: 0, B:10, C:11; AAACBA โ‡’ 00011100

6501 Natural Language Processing 37

slide-38
SLIDE 38

More math: Cross Entropy

v Cross-entropy:

  • vAvg. # bits to encode events ~ p(x) using a coding

scheme m(x) vH p, ๐‘› = โˆ’ โˆ‘ ๐‘ž ๐‘ฆ log0 ๐‘›(๐‘ฆ)

k

vNot symmetric: H p, ๐‘› โ‰  H ๐‘›,๐‘ž vLower bounded by H(p)

v Let p=(ยฝ, ยผ, ยผ, 0)

vWe encode A:00, B:01, C:10, D:11 (i,e., m= (ยผ, ยผ, ยผ, ยผ)) vAAACBA?

38 6501 Natural Language Processing

000000100100

slide-39
SLIDE 39

Perplexity and geometric mean v

6501 Natural Language Processing 39

it assi Perplexity(w1... wN) = 2H(w1. . . wN) = 2โˆ’ 1

N log2 m(w1. . . wN)

= m(w1... wN)โˆ’ 1

N

=

N

s 1 m(w1... wN)

slide-40
SLIDE 40

An experiment

v Train: 38M WSJ text, |V|= 20k v Test: 1.5M WSJ text v Word level LSTM ~85 v Char level ~79

6501 Natural Language Processing 40

Results:

Unigram Bigram Trigram Perplexity 962 170 109