statistical natural language processing
play

Statistical Natural Language Processing N-gram Language Models ar - PowerPoint PPT Presentation

Statistical Natural Language Processing N-gram Language Models ar ltekin University of Tbingen Seminar fr Sprachwissenschaft Summer Semester 2017 Motivation Estimation Summer Semester 2017 SfS / University of Tbingen .


  1. Motivation Short answer: No. Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, = ? applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? Estimation sentence, and divide it by the total Can we count the occurrences of the How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach )

  2. Motivation Short answer: No. Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? Estimation sentence, and divide it by the total How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach ) • Can we count the occurrences of the P ( ) = ?

  3. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? sentence, and divide it by the total How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach ) • Can we count the occurrences of the P ( ) = ? • Short answer: No.

  4. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? sentence, and divide it by the total How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach ) • Can we count the occurrences of the P ( ) = ? • Short answer: No.

  5. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? sentence, and divide it by the total How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach ) • Can we count the occurrences of the P ( ) = ? • Short answer: No.

  6. Motivation We use probabilities of parts of the sentence (words) to Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, generality), we can write Estimation calculate the probability of the whole sentence 13 / 86 applying the chain rule Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation • The solution is to decompose • Using the chain rule of probability (without loss of P ( w 1 , w 2 , . . . , w m ) = P ( w 2 | w 1 ) × P ( w 3 | w 1 , w 2 ) × . . . × P ( w m | w 1 , w 2 , . . . w m − 1 )

  7. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Not really, the last term is equally diffjcult to estimate 14 / 86 Extensions Example: applying the chain rule Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | I like ) × P ( with | I like pizza ) × P ( spinach | I like pizza with ) • Did we solve the problem?

  8. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 14 / 86 Example: applying the chain rule Extensions Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | I like ) × P ( with | I like pizza ) × P ( spinach | I like pizza with ) • Did we solve the problem? • Not really, the last term is equally diffjcult to estimate

  9. Motivation We make a conditional independence assumption: probabilities of Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation and 15 / 86 the Markov assumption Assigning probabilities to sentences Evaluation Extensions Smoothing Back-ofg & Interpolation words are independent, given n previous words P ( w i | w 1 , . . . , w i − 1 ) = P ( w i | w i − n + 1 , . . . , w i − 1 ) m ∏ P ( w 1 , . . . , w m ) = P ( w i | w i − n + 1 , . . . , w i − 1 ) i = 1 For example, with n = 2 (bigram, fjrst order Markov model): m ∏ P ( w 1 , . . . , w m ) = P ( w i | w i − 1 ) i = 1

  10. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Now, hopefully, we can count them in a corpus 16 / 86 Example: bigram probabilities of a sentence Extensions Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | I like ) × P ( with | I like pizza ) × P ( spinach | I like pizza with )

  11. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 16 / 86 Example: bigram probabilities of a sentence Extensions Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | like ) × P ( with | pizza ) × P ( spinach | with ) • Now, hopefully, we can count them in a corpus

  12. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, number of times I occurs in the corpus corpus. 17 / 86 based on their frequencies in a corpus Extensions Maximum-likelihood estimation (MLE) Evaluation Smoothing Back-ofg & Interpolation • Maximum-likelihood estimation of n-gram probabilities is • We are interested in conditional probabilities of the form: P ( w i | w 1 , . . . , w i − 1 ) , which we estimate using C ( w i − n + 1 . . . w i ) P ( w i | w i − n + 1 , . . . , w i − 1 ) = C ( w i − n + 1 . . . w i − 1 ) where, C () is the frequency (count) of the sequence in the • For example, the probability P ( like | I ) would be C ( I like ) P ( like | I ) = C ( I ) = number of times I like occurs in the corpus

  13. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, rameters (conditional probabilities). Training an n-gram model involves estimating these pa- 18 / 86 Extensions MLE estimation of an n-gram language model Back-ofg & Interpolation Smoothing Evaluation An n-gram model conditioned on n − 1 previous words. • In a 1-gram (unigram) model, P ( w i ) = C ( w i ) N • In a 2-gram (bigram) model, P ( w i ) = P ( w i | w i − 1 ) = C ( w i − 1 w i ) C ( w i − 1 ) • In a 3-gram (trigram) model, P ( w i ) = P ( w i | w i − 2 w i − 1 ) = C ( w i − 2 w i − 1 w i ) C ( w i − 2 w i − 1 )

  14. Motivation that , afraid do ’m Dave can sorry freq . ’t Traditionally, can’t is tokenized as ca␣n’t (similar to have␣n’t , is␣n’t etc.), but for our purposes can␣’t is more readable. Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 I ngram Estimation A small corpus Evaluation Smoothing Back-ofg & Interpolation Extensions Unigrams Unigrams are simply the single words (or tokens). I’m sorry, Dave. freq I’m afraid I can’t do that. Unigram counts ngram freq ngram freq ngram 19 / 86

  15. Motivation can freq I Estimation afraid do ’m Dave that freq sorry . ’t Traditionally, can’t is tokenized as ca␣n’t (similar to have␣n’t , is␣n’t etc.), but for our purposes can␣’t is more readable. Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 ngram , ngram When tokenized, we Evaluation Smoothing Back-ofg & Interpolation Extensions Unigrams Unigrams are simply the single words (or tokens). A small corpus I ’m sorry , Dave . freq I ’m afraid I can ’t do that . 19 / 86 ngram ngram freq Unigram counts types . have 15 tokens , and 11 3 1 1 1 2 1 1 1 1 2 1

  16. Motivation . do ’m Dave can that sorry ’t , Estimation , 'm I . sorry Dave What is the most likely sentence according to this model? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 afraid 20 / 86 Unigram probability of a sentence freq I Extensions freq Back-ofg & Interpolation ngram Smoothing freq Evaluation ngram Unigram counts freq ngram ngram 3 1 1 1 2 1 1 1 1 2 1 P ( I 'm sorry , Dave . ) = P ( I ) × P ( 'm ) × P ( sorry ) × P ( , ) × P ( Dave ) × P ( . ) 3 2 1 1 1 2 = × × × × × 15 15 15 15 15 15 = 0 . 000 001 05

  17. Motivation , Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation ’t . sorry that can Dave ’m do afraid 20 / 86 I ngram Smoothing Evaluation Extensions Unigram probability of a sentence Unigram counts ngram freq Back-ofg & Interpolation freq ngram freq ngram freq 3 1 1 1 2 1 1 1 1 2 1 P ( I 'm sorry , Dave . ) = P ( I ) × P ( 'm ) × P ( sorry ) × P ( , ) × P ( Dave ) × P ( . ) 3 2 1 1 1 2 = × × × × × 15 15 15 15 15 15 = 0 . 000 001 05 • P ( , 'm I . sorry Dave ) = ? • What is the most likely sentence according to this model?

  18. Motivation Dave prob I Estimation . ’t , afraid What about sentences? can do sorry that Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 word ’m 21 / 86 equal size. For example (length 2), N-gram models defjne probability distributions Extensions distributions over word sequences of distribution over words Back-ofg & Interpolation Smoothing Evaluation • An n-gram model defjnes a probability 0 . 200 0 . 133 ∑ P ( w ) = 1 0 . 133 w ∈ V 0 . 067 0 . 067 • They also defjne probability 0 . 067 0 . 067 0 . 067 ∑ ∑ 0 . 067 P ( w ) P ( v ) = 1 0 . 067 w ∈ V v ∈ V 0 . 067 1 . 000

  19. Motivation afraid I Estimation . ’t , Dave can word do sorry that Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 prob ’m 21 / 86 equal size. For example (length 2), N-gram models defjne probability distributions distribution over words distributions over word sequences of Extensions Back-ofg & Interpolation Smoothing Evaluation • An n-gram model defjnes a probability 0 . 200 0 . 133 ∑ P ( w ) = 1 0 . 133 w ∈ V 0 . 067 0 . 067 • They also defjne probability 0 . 067 0 . 067 0 . 067 ∑ ∑ 0 . 067 P ( w ) P ( v ) = 1 0 . 067 w ∈ V v ∈ V 0 . 067 • What about sentences? 1 . 000

  20. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 22 / 86 Unigram probabilities Extensions Evaluation Smoothing Back-ofg & Interpolation 3 0 . 2 0 . 15 2 2 0 . 1 1 1 1 1 1 1 1 1 , . can sorry I ’m ’t Dave afraid do that

  21. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, MLE probability rank 23 / 86 Unigram probabilities in a (slightly) larger corpus Evaluation Smoothing MLE probabilities in the Universal Declaration of Human Rights Back-ofg & Interpolation Extensions 0 . 06 0 . 04 0 . 02 a long tail follows … 536 0 . 00 0 50 100 150 200 250

  22. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, (or n-grams) – even very large corpora will not contain some of the words distribution most linguistic units follow more-or-less a similar rank 24 / 86 or The frequency of a word is inversely proportional to its rank: Zipf’s law – a short divergence Extensions Back-ofg & Interpolation Smoothing Evaluation 1 rank × frequency = k frequency ∝ • This is a reoccurring theme in (computational) linguistics: • Important consequence for us (in this lecture):

  23. Motivation ’m sorry freq ngram freq Estimation freq I ’m , Dave afraid I n’t do Dave . freq I can do that sorry , ’m afraid can ’t that . What about the bigram ‘ . I ’? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 ngram ngram ngram . Evaluation Smoothing Back-ofg & Interpolation Extensions Bigrams Bigrams are overlapping sequences of two tokens. I ’m sorry , Bigram counts Dave 25 / 86 I I ’m . that afraid do ’t can 2 1 1 1 1 1 1 1 1 1 1 1

  24. Motivation ’m sorry freq ngram freq Estimation freq I ’m , Dave afraid I n’t do Dave . freq I can do that sorry , ’m afraid can ’t that . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 ngram ngram ngram . Evaluation Smoothing Back-ofg & Interpolation Extensions Bigrams Bigrams are overlapping sequences of two tokens. I ’m sorry Bigram counts Dave , 25 / 86 I I . that ’m do ’t afraid can 2 1 1 1 1 1 1 1 1 1 1 1 • What about the bigram ‘ . I ’?

  25. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, distribution to sentences beginning of a sentence 26 / 86 If we want sentence probabilities, we need to mark them. Sentence boundary markers Extensions Back-ofg & Interpolation Smoothing Evaluation ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ • The bigram ‘ ⟨ s ⟩ I ’ is not the same as the unigram ‘ I ’ Including ⟨ s ⟩ allows us to predict likely words at the • Including ⟨ /s ⟩ allows us to assign a proper probability

  26. Motivation and, the MLE Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, not its unigram probability Estimation 27 / 86 recap with some more detail Evaluation Calculating bigram probabilities Smoothing Back-ofg & Interpolation Extensions We want to calculate P ( w 2 | w 1 ) . From the chain rule: P ( w 2 | w 1 ) = P ( w 1 , w 2 ) P ( w 1 ) C ( w 1 w 2 ) = C ( w 1 w 2 ) N P ( w 2 | w 1 ) = C ( w 1 ) C ( w 1 ) N P ( w 2 | w 1 ) is the probability of w 2 given the previous word is w 1 P ( w 2 , w 1 ) is the probability of the sequence w 1 w 2 P ( w 1 ) is the probability of w 1 occurring as the fjrst item in a bigram,

  27. Motivation I can sorry , , Dave Dave . ’m afraid Estimation afraid I can ’t I ’m n’t do do that that . unigram probability! Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 ’m sorry 28 / 86 Extensions Bigram probabilities Back-ofg & Interpolation Smoothing Evaluation C ( w 1 w 2 ) C ( w 1 ) P ( w 1 w 2 ) P ( w 1 ) P ( w 2 | w 1 ) P ( w 2 ) w 1 w 2 ⟨ s ⟩ I 2 2 0 . 12 0 . 12 1 . 00 0 . 18 2 3 0 . 12 0 . 18 0 . 67 0 . 12 1 2 0 . 06 0 . 12 0 . 50 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 12 1 2 0 . 06 0 . 12 0 . 50 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 18 1 3 0 . 06 0 . 18 0 . 33 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 12 . ⟨ /s ⟩ 2 2 0 . 12 0 . 12 1 . 00 0 . 12

  28. Motivation Dave Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Bigram Unigram Estimation . 29 / 86 , Sentence probability: bigram vs. unigram Evaluation Smoothing Back-ofg & Interpolation sorry Extensions ’m I 1 0 . 5 0 ⟨ /s ⟩ = 2 . 83 × 10 − 9 P uni ( ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ) P bi ( ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 33

  29. Motivation I , ’m I . sorry Dave uni bi w ’m Estimation afraid , Dave . uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 w 30 / 86 w Unigram vs. bigram probabilities I Evaluation Smoothing Back-ofg & Interpolation Extensions ’m sorry , in sentences and non-sentences Dave . 2 . 83 × 10 − 9 P uni 0 . 20 0 . 13 0 . 07 0 . 07 0 . 07 0 . 07 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33

  30. Motivation ’m Estimation I . sorry Dave w I afraid w , Dave . uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 , ’m 30 / 86 w , sorry ’m . I Dave in sentences and non-sentences Unigram vs. bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation 2 . 83 × 10 − 9 P uni 0 . 20 0 . 13 0 . 07 0 . 07 0 . 07 0 . 07 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33 2 . 83 × 10 − 9 P uni 0 . 07 0 . 13 0 . 20 0 . 07 0 . 07 0 . 07 P bi 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 1 . 00 0 . 00

  31. Motivation I ’m I . sorry Dave Estimation w ’m w afraid , Dave . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 , 30 / 86 ’m sorry I w in sentences and non-sentences Unigram vs. bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation , Dave . 2 . 83 × 10 − 9 P uni 0 . 20 0 . 13 0 . 07 0 . 07 0 . 07 0 . 07 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33 2 . 83 × 10 − 9 P uni 0 . 07 0 . 13 0 . 20 0 . 07 0 . 07 0 . 07 P bi 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 1 . 00 0 . 00 2 . 83 × 10 − 9 P uni 0 . 07 0 . 13 0 . 07 0 . 07 0 . 07 0 . 07 P bi 1 . 00 0 . 67 0 . 50 0 . 00 0 . 50 1 . 00 0 . 00

  32. Motivation , Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation that do ’t . Dave 31 / 86 afraid Smoothing Extensions Bigram model as a fjnite-state automaton Back-ofg & Interpolation I Evaluation ’m can sorry 1 . 0 1 . 0 0 . 5 1 . 0 7 6 0 . 0 . 5 1 . 0 1 . 0 ⟨ s ⟩ ⟨ /s ⟩ 1 . 0 0 1 . 0 3 3 . 1 . 0 1 . 0 1 . 0

  33. Motivation can ’t do sorry , Dave , Dave . I ’m afraid ’m afraid I afraid I can I can ’t ’t do that I ’m sorry How many -grams are there in a sentence of length ? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 Estimation ’m sorry , do that . freq Evaluation Smoothing Back-ofg & Interpolation Extensions Trigrams Trigram counts ngram freq ngram 32 / 86 ngram freq ⟨ s ⟩ ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I that . ⟨ /s ⟩ 2 1 1 ⟨ s ⟩ I ’m 2 1 1 Dave . ⟨ /s ⟩ 1 1 1 1 1 1 1 1 1

  34. Motivation do that . Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, ’t do that can ’t do I can ’t afraid I can ’m afraid I I ’m afraid , Dave . sorry , Dave ’m sorry , Estimation I ’m sorry 32 / 86 ngram Evaluation Smoothing Back-ofg & Interpolation Extensions freq ngram freq Trigrams ngram freq Trigram counts ⟨ s ⟩ ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I that . ⟨ /s ⟩ 2 1 1 ⟨ s ⟩ I ’m 2 1 1 Dave . ⟨ /s ⟩ 1 1 1 1 1 1 1 1 1 • How many n -grams are there in a sentence of length m ?

  35. Motivation Dave Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Trigram Bigram Unigram Estimation . 33 / 86 , Back-ofg & Interpolation Evaluation Smoothing ’m I Trigram probabilities of a sentence Extensions sorry 1 0 . 5 0 ⟨ /s ⟩ = 2 . 83 × 10 − 9 P uni ( I ’m sorry , Dave . ⟨ /s ⟩ ) P bi ( I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 33 P tri ( I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 50

  36. Motivation – Furiously sleep ideas green colorless Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Should n-gram models model the difgerence? Can n-gram models model the difgerence? – Colorless green ideas sleep furiously interpretation of this term. — Chomsky (1968) Estimation sentence’ is an entirely useless one, under any known But it must be recognized that the notion ’probability of a Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation 34 / 86 • The following ‘sentences’ are categorically difgerent:

  37. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Should n-gram models model the difgerence? – Colorless green ideas sleep furiously – Furiously sleep ideas green colorless interpretation of this term. — Chomsky (1968) sentence’ is an entirely useless one, under any known But it must be recognized that the notion ’probability of a Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation 34 / 86 • The following ‘sentences’ are categorically difgerent: • Can n-gram models model the difgerence?

  38. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, – Colorless green ideas sleep furiously – Furiously sleep ideas green colorless interpretation of this term. — Chomsky (1968) sentence’ is an entirely useless one, under any known But it must be recognized that the notion ’probability of a Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation 34 / 86 • The following ‘sentences’ are categorically difgerent: • Can n-gram models model the difgerence? • Should n-gram models model the difgerence?

  39. Motivation Some cultural aspects of everyday language: ‘ Chinese Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been more aspects of ‘usage’ of language food ’ is more likely than ‘ British food ’ ideas ’ Estimation Some semantics: ‘ bright ideas ’ is more likely than ‘ green more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much

  40. Motivation Some cultural aspects of everyday language: ‘ Chinese Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been more aspects of ‘usage’ of language food ’ is more likely than ‘ British food ’ ideas ’ Estimation more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green

  41. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been more aspects of ‘usage’ of language food ’ is more likely than ‘ British food ’ ideas ’ more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green • Some cultural aspects of everyday language: ‘ Chinese

  42. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been food ’ is more likely than ‘ British food ’ ideas ’ more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green • Some cultural aspects of everyday language: ‘ Chinese • more aspects of ‘usage’ of language

  43. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been food ’ is more likely than ‘ British food ’ ideas ’ more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green • Some cultural aspects of everyday language: ‘ Chinese • more aspects of ‘usage’ of language

  44. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been food ’ is more likely than ‘ British food ’ ideas ’ more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green • Some cultural aspects of everyday language: ‘ Chinese • more aspects of ‘usage’ of language

  45. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, markers phonemes , phrases , … probabilities, but other units are also possible: characters , on its parts (sequences of words ) N-grams, so far … regularities that are useful in many applications Extensions Back-ofg & Interpolation Smoothing Evaluation 36 / 86 • N-gram language models are one of the basic tools in NLP • They capture some linguistic (and non-linguistic) • The idea is to estimate the probability of a sentence based • N-grams are n consecutive units in a sequence • Typically, we use sequences of words to estimate sentence • For most applications, we introduce sentence boundary

  46. Motivation probabilities is using relative frequencies (leads to MLE) Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, words and sentences Estimation 37 / 86 N-grams, so far … Extensions Back-ofg & Interpolation Smoothing Evaluation • The most straightforward method for estimating • Due to Zipf’s law, as we increase ‘ n ’, the counts become smaller (data sparseness), many counts become 0 • If there are unknown words, we get 0 probabilities for both • In practice, bigrams or trigrams are used most commonly, applications/datasets of up to 5 -grams are also used

  47. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, better the model. A few measures: Intrinsic: the higher the probability assigned to a test set applications 38 / 86 application: Extrinsic: how (much) the model improves the target How to test n-gram models? Extensions Back-ofg & Interpolation Smoothing Evaluation • Speech recognition accuracy • BLEU score for machine translation • Keystroke savings in predictive text • Likelihood • (cross) entropy • perplexity

  48. Motivation Estimation Evaluation Smoothing Back-ofg & Interpolation Extensions Training and test set division training data model may overfjt the training set Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 39 / 86 • We (almost) never use a statistical (language) model on the • Testing a model on the training set is misleading: the • Always test your models on a separate test set

  49. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, because of readability and ease of numerical manipulation the model 40 / 86 Intrinsic evaluation metrics: likelihood Extensions Back-ofg & Interpolation Smoothing Evaluation • Likelihood of a model M is the probability of the (test) set w given the model ∏ L ( M | w ) = P ( w | M ) = P ( s ) s ∈ w • The higher the likelihood (for a given test set), the better • Likelihood is sensitive to test set size • Practical note: (minus) log likelihood is more common,

  50. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, set language model) distribution) using an approximate distribution (the to encode the data coming from a distribution (test set 41 / 86 Extensions Intrinsic evaluation metrics: cross entropy Back-ofg & Interpolation Smoothing Evaluation • Cross entropy of a language model on a test set w is H ( w ) = − 1 Nlog 2 P ( w ) • The lower the cross entropy, the better the model • Remember that cross entropy is the average bits required • Note that cross entropy is not sensitive to length of the test

  51. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, – not sensitive to test set size – lower better 42 / 86 language models Intrinsic evaluation metrics: perplexity Extensions Back-ofg & Interpolation Smoothing Evaluation • Perplexity is a more common measure for evaluating √ 1 PP ( w ) = 2 H ( w ) = P ( w ) − 1 N = N P ( w ) • Perplexity is the average branching factor • Similar to cross entropy

  52. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, unseen seen seen from known words, and assign it to unknown words data sequences containing unseen words the Zipf’s law: many words are rare . and other issues with MLE estimates What do we do with unseen n-grams? Extensions Back-ofg & Interpolation Smoothing Evaluation 43 / 86 • Words (and word sequences) are distributed according to • MLE will assign 0 probabilities to unseen words, and • Even with non-zero probabilities, MLE overfjts the training • One solution is smoothing: take some probability mass

  53. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 44 / 86 Smoothing: what is in the name? Back-ofg & Interpolation Extensions Smoothing Evaluation samples from N ( 0 , 1 ) 1 0 . 8 5 samples 10 samples 0 . 6 0 . 5 0 . 4 0 . 2 0 0 0 . 6 0 . 4 30 samples 1000 samples 0 . 4 0 . 2 0 . 2 0 0 − 4 − 2 − 4 − 2 0 2 4 0 2 4

  54. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, V number of word types - the size of the vocabulary N number of word tokens 45 / 86 (Add-one smoothing) Laplace smoothing Extensions Back-ofg & Interpolation Smoothing Evaluation • The idea (from 1790): add one to all counts • The probability of a word is estimated by P +1 ( w ) = C ( w )+ 1 N + V • Then, probability of an unknown word is: 0 + 1 N + V

  55. Motivation for n-grams Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation 46 / 86 Laplace smoothing Back-ofg & Interpolation Extensions Evaluation Smoothing • The probability of a bigram becomes P +1 ( w i w i − 1 ) = C ( w i w i − 1 )+ 1 N + V 2 • and, the conditional probability P +1 ( w i | w i − 1 ) = C ( w i − 1 w i )+ 1 C ( w i − 1 )+ V • In general C ( w i i − n + 1 ) + 1 P +1 ( w i i − n + 1 ) = N + V n C ( w i i − n + 1 ) + 1 P +1 ( w i i − n + 1 | w i − 1 i − n + 1 ) = C ( w i − 1 i − n + 1 ) + V

  56. Motivation I ’m Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, that . do that n’t do can ’t I can afraid I Estimation Dave . , Dave sorry , 47 / 86 non-smoothed vs. Laplace smoothing Bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation w 1 w 2 C +1 P MLE ( w 1 w 2 ) P +1 ( w 1 w 2 ) P MLE ( w 2 | w 1 ) P +1 ( w 2 | w 1 ) ⟨ s ⟩ I 3 0 . 118 0 . 019 1 . 000 0 . 188 3 0 . 118 0 . 019 0 . 667 0 . 176 ’m sorry 2 0 . 059 0 . 012 0 . 500 0 . 125 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 ’m afraid 2 0 . 059 0 . 012 0 . 500 0 . 125 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 0 . 333 0 . 118 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 . ⟨ /s ⟩ 3 0 . 118 0 . 019 1 . 000 0 . 188 ∑ 1 . 000 0 . 193

  57. Motivation I , ’m I . sorry Dave /s MLE +1 w ’m Estimation afraid , Dave . /s uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 w 48 / 86 w MLE vs. Laplace probabilities Evaluation I Smoothing Back-ofg & Interpolation Extensions ’m sorry , Dave . bigram probabilities in sentences and non-sentences ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 44 × 10 − 5 P +1 0 . 25 0 . 23 0 . 17 0 . 18 0 . 18 0 . 18 0 . 25

  58. Motivation afraid ’m Estimation . sorry Dave w I ’m , w Dave . /s uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 , I 48 / 86 w , sorry . I ’m bigram probabilities in sentences and non-sentences MLE vs. Laplace probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation Dave ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 44 × 10 − 5 P +1 0 . 25 0 . 23 0 . 17 0 . 18 0 . 18 0 . 18 0 . 25 ⟨ /s ⟩ P MLE 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 3 . 34 × 10 − 8 P +1 0 . 08 0 . 09 0 . 08 0 . 08 0 . 08 0 . 09 0 . 09

  59. Motivation I ’m I . sorry Dave Estimation w ’m w afraid , Dave . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 , 48 / 86 Smoothing . ’m I w bigram probabilities in sentences and non-sentences MLE vs. Laplace probabilities Extensions Back-ofg & Interpolation sorry Evaluation , Dave ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 44 × 10 − 5 P +1 0 . 25 0 . 23 0 . 17 0 . 18 0 . 18 0 . 18 0 . 25 ⟨ /s ⟩ P MLE 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 3 . 34 × 10 − 8 P +1 0 . 08 0 . 09 0 . 08 0 . 08 0 . 08 0 . 09 0 . 09 ⟨ /s ⟩ P uni 1 . 00 0 . 67 0 . 50 0 . 00 1 . 00 1 . 00 1 . 00 0 . 00 7 . 22 × 10 − 6 P bi 0 . 25 0 . 23 0 . 17 0 . 09 0 . 18 0 . 18 0 . 25

  60. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Trigrams seen Bigrams seen Unigrams seen (e.g., trigrams) are possible the higher level n-grams higher order n-grams How much mass does +1 smoothing steal? Evaluation Smoothing Back-ofg & Interpolation Extensions reserves probability mass proportional to vocabulary size of the vocabulary large vocabularies and 49 / 86 • Laplace smoothing unseen ( 3 . 33 % ) • This is just too much for unseen ( 83 . 33 % ) • Note that only very few of unseen ( 98 . 55 % )

  61. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, it has high variance: it overfjts 50 / 86 Extensions Lindstone correction Back-ofg & Interpolation Smoothing Evaluation (Add- α smoothing) • A simple improvement over Laplace smoothing is adding 0 < α (and typically < 1 ) instead of 1 ) = C ( w i − n + 1 ) + α P ( w i − n + 1 i i N + αV • With smaller α values, the model behaves similar to MLE, • Larger α values reduce the variance, but has large bias

  62. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, later in this course) aside a development set for tuning hyperparameters wrong setting smoothing parameters Extensions Back-ofg & Interpolation Smoothing Evaluation 51 / 86 How do we pick a good α value • We want α value that works best outside the training data • Peeking at your test data during training/development is • This calls for another division of the available data: set • Alternatively, we can use k-fold cross validation and take the α with the best average score (more on cross validation

  63. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, events 52 / 86 Absolute discounting Extensions Back-ofg & Interpolation Smoothing Evaluation ϵ • An alternative to the additive smoothing is to reserve an explicit amount of probability mass, ϵ , for the unseen • The probabilities of known events has to be re-normalized • This is often not very convenient • How do we decide what ϵ value to use?

  64. Motivation n-grams using the observed n-grams Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, data corpus is Estimation 53 / 86 ‘discounting’ view Good-Turing smoothing Extensions Back-ofg & Interpolation Smoothing Evaluation • Estimate the probability mass to be reserved for the novel • Novel events in our training set is the ones that occur once p 0 = n 1 n where n 1 is the number of distinct n-grams with frequency 1 in the training data • Now we need to discount this mass from the higher counts • The probability of an n-gram that occurred r times in the ( r + 1 ) n r + 1 n r n

  65. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Note: frequency 54 / 86 Evaluation Smoothing Back-ofg & Interpolation Extensions Some terminology frequencies of frequencies and equivalence classes n 3 = 1 3 n 2 = 2 3 2 2 n 1 = 8 2 1 1 1 1 1 1 1 1 1 0 . , sorry I ’m ’t Dave afraid can do that • We often put n-grams into equivalence classes • Good-Turing forms the equivalence classes based on ∑ n = r × n r r

  66. Motivation – novel n-grams Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, * Yes, this seems to be a word. Estimation 55 / 86 Good-Turing estimation: leave-one-out justifjcation Extensions Back-ofg & Interpolation Smoothing Evaluation • Leave each n-gram out • Count the number of times the left-out n-gram had frequency r in the remaining data n 1 n – n-grams with frequency 1 (singletons) ( 1 + 1 ) n 2 n 1 n – n-grams with freqnency 2 (doubletons) * ( 2 + 1 ) n 3 n 2 n

  67. Motivation n-gram under the smoothing method. Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation 56 / 86 Sometimes it is instructive to see the ‘efgective count’ of an Adjusted counts Extensions Back-ofg & Interpolation Smoothing Evaluation For Good-Turing smoothing, the updated count, r ∗ is r ∗ = ( r + 1 ) n r + 1 n r • novel items: n 1 • singeltons: 2 × n 2 n 1 • doubletons: 3 × n 3 n 2 • …

  68. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 57 / 86 Good-Turing example Evaluation Smoothing Back-ofg & Interpolation Extensions n 3 = 1 3 n 2 = 2 3 2 2 n 1 = 8 2 1 1 1 1 1 1 1 1 1 0 . , can sorry I ’m ’t Dave afraid do that P GT ( the ) = P GT ( a ) = . . . = 8 15 P GT ( that ) = P GT ( do ) = . . . = 2 × 2 15 P GT ( ’m ) = P GT ( . ) = 3 × 1 15

  69. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, unreliable 58 / 86 With some solutions Issues with Good-Turing discounting Extensions Back-ofg & Interpolation Smoothing Evaluation • Zero counts: we cannot assign probabilities if n r + 1 = 0 • The estimates of some of the frequencies of frequencies are • A solution is to replace n r with smoothed counts z r • A well-known technique (simple Good-Turing) for smoothing n r is to use linear interpolation log z r = a + b log r

  70. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, markers phonemes , phrases , … probabilities, but other units are also possible: characters , on its parts (sequences of words ) N-grams, so far … regularities that are useful in many applications Extensions Back-ofg & Interpolation Smoothing Evaluation 59 / 86 • N-gram language models are one of the basic tools in NLP • They capture some linguistic (and non-linguistic) • The idea is to estimate the probability of a sentence based • N-grams are n consecutive units in a sequence • Typically, we use sequences of words to estimate sentence • For most applications, we introduce sentence boundary

  71. Motivation probabilities is using relative frequencies (leads to MLE) Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, words and sentences Estimation 60 / 86 N-grams, so far … Extensions Back-ofg & Interpolation Smoothing Evaluation • The most straightforward method for estimating • Due to Zipf’s law, as we increase ‘ n ’, the counts become smaller (data sparseness), many counts become 0 • If there are unknown words, we get 0 probabilities for both • In practice, bigrams or trigrams are used most commonly, applications/datasets of up to 5 -grams are also used

  72. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, extrinsic metrics Intrinsic likelihood, (cross) entropy, perplexity Extrinsic success in an external application N-grams, so far … Extensions Back-ofg & Interpolation Smoothing Evaluation 61 / 86 • Two difgerent ways of evaluating n-gram models: • Intrinsic evaluation metrics often correlate well with the • Test your n-grams models on an ‘unseen’ test set

  73. Motivation simple but often not very useful Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, the unobserved events based on the n-grams seen only – Good-Turing discounting reserves the probability mass to from the observed n-grams tuning over a development set Estimation 62 / 86 Back-ofg & Interpolation observed n-grams, and assigns it to unobserved ones reduce the variance) Evaluation Smoothing N-grams, so far … Extensions • Smoothing methods solve the zero-count problem (also • Smoothing takes away some probability mass from the – Additive smoothing: add a constant α to all counts • α = 1 (Laplace smoothing) simply adds one to all counts – • A simple correction is to add a smaller α , which requires – Discounting removes a fjxed amount of probability mass, ϵ , • We need to re-normalize the probability estimates • Again, we need a development set to tune ϵ once: p 0 = n 1 n

  74. Motivation How about black wug ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) Would make a difgerence if we used a better smoothing black squirrel wug black wug) black Estimation Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation 63 / 86 • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability P + 1 ( squirrel | black ) =

  75. Motivation How about black wug ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) Would make a difgerence if we used a better smoothing black squirrel wug black wug) 63 / 86 Estimation Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V

  76. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) Would make a difgerence if we used a better smoothing black squirrel wug 63 / 86 Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? P + 1 ( black wug) =

  77. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) Would make a difgerence if we used a better smoothing black 63 / 86 Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? P + 1 ( black wug) = P + 1 ( squirrel | wug ) =

  78. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) 63 / 86 Extensions Not all (unknown) n-grams are equal Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? 0 + 1 P + 1 ( black wug) = P + 1 ( squirrel | wug ) = C ( black ) + V • Would make a difgerence if we used a better smoothing

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend