natural language processing
play

Natural Language Processing Info 159/259 Lecture 6: Language models - PowerPoint PPT Presentation

Natural Language Processing Info 159/259 Lecture 6: Language models 1 (Sept 12, 2017) David Bamman, UC Berkeley Language Model Vocabulary is a finite set of discrete symbols (e.g., words, characters); V = | | + is the


  1. Natural Language Processing Info 159/259 
 Lecture 6: Language models 1 (Sept 12, 2017) David Bamman, UC Berkeley

  2. Language Model • Vocabulary 𝒲 is a finite set of discrete symbols (e.g., words, characters); V = | 𝒲 | • 𝒲 + is the infinite set of sequences of symbols from 𝒲 ; each sequence ends with STOP • x ∈ 𝒲 +

  3. Language Model P ( w ) = P ( w 1 , . . . , w n ) P(“Call me Ishmael”) = 
 P(w 1 = “call”, w 2 = “me”, w 3 = “Ishmael”) x P(STOP) � P ( w ) = 1 0 ≤ P ( w ) ≤ 1 w ∈ V + over all sequence lengths!

  4. Language Model • Language models provide us with a way to quantify the likelihood of sequence — i.e., plausible sentences.

  5. OCR • to fee great Pompey paffe the Areets of Rome: • to see great Pompey passe the streets of Rome:

  6. Machine translation • Fidelity (to source text) • Fluency (of the translation)

  7. Speech Recognition • 'Scuse me while I kiss the sky. • 'Scuse me while I kiss this guy • 'Scuse me while I kiss this fly. • 'Scuse me while my biscuits fry

  8. Dialogue generation Li et al. (2016), "Deep Reinforcement Learning for Dialogue Generation" (EMNLP)

  9. Information theoretic view Y “One morning I shot an elephant in my pajamas” encode(Y) decode(encode(Y)) Shannon 1948

  10. Noisy Channel X Y ASR speech signal transcription MT target text source text OCR pixel densities transcription P ( Y | X ) ∝ P ( X | Y ) P ( Y ) � �� � � �� � channel model source model

  11. Language Model • Language modeling is the task of estimating P(w) • Why is this hard? P(“It was the best of times, it was the worst of times”)

  12. Chain rule (of probability) P ( x 1 , x 2 , x 3 , x 4 , x 5 ) = P ( x 1 ) × P ( x 2 | x 1 ) × P ( x 3 | x 1 , x 2 ) × P ( x 4 | x 1 , x 2 , x 3 ) × P ( x 5 | x 1 , x 2 , x 3 , x 4 )

  13. Chain rule (of probability) P(“It was the best of times, it was the worst of times”)

  14. Chain rule (of probability) P ( w 1 ) this is easy P(“It”) P ( w 2 | w 1 ) P(“was” | “It” ) P ( w 3 | w 1 , w 2 ) P ( w 4 | w 1 , w 2 , w 3 ) P ( w n | w 1 , . . . , w n − 1 ) this is hard P(“times” | “It was the best of times, it was the worst of” )

  15. Markov assumption P ( x i | x 1 , . . . x i − 1 ) ≈ P ( x i | x i − 1 ) first-order P ( x i | x 1 , . . . x i − 1 ) ≈ P ( x i | x i − 2 , x i − 1 ) second-order

  16. Markov assumption n � P ( w i | w i − 1 ) × P (STOP | w n ) bigram model (first-order markov) i n � P ( w i | w i − 2 , w i − 1 ) trigram model (second-order markov) i × P (STOP | w n − 1 , w n )

  17. P ( It | START 1 , START 2 ) P ( was | START 2 , It ) P ( the | It, was ) “It was the best of times, it was the … worst of times” P ( times | worst, of ) P (STOP | of, times )

  18. Estimation unigram bigram trigram n n n � � � P ( w i ) P ( w i | w i − 1 ) P ( w i | w i − 2 , w i − 1 ) i i i × P ( STOP ) × P ( STOP | w n ) × P ( STOP | w n − 1 , w n ) Maximum likelihood estimate c ( w i − 1 , w i ) c ( w i − 2 , w i − 1 , w i ) c ( w i ) c ( w i − 1 ) c ( w i − 2 , w i − 1 ) N

  19. Generating 0.06 0.04 0.02 0.00 a amazing bad best good like love movie not of sword the worst • What we learn in estimating language models is P(word | context), where context — at least here — is the previous n-1 words (for ngram of order n) • We have one multinomial over the vocabulary (including STOP) for each context

  20. Generating generated 
 context1 context2 word • As we sample, START START The the words we generate form START The dog the new context we condition on The dog walked dog walked in

  21. Aside: sampling?

  22. Sampling from a Multinomial Probability 0.6 mass function (PMF) 0.5 0.4 P(z = x) P(z = x) 0.3 exactly 0.2 0.1 0.0 1 2 3 4 5 x

  23. Sampling from a Multinomial Cumulative 1.0 density 0.8 function (CDF) 0.6 P(z <= x) P(z ≤ x) 0.4 0.2 0.0 1 2 3 4 5 x

  24. Sampling from a Multinomial 1.0 Sample p uniformly in p=.78 0.8 [0,1] 0.6 P(z <= x) Find the point CDF -1 (p) 0.4 0.2 0.0 1 2 3 4 5 x

  25. Sampling from a Multinomial 1.0 Sample p uniformly in 0.8 [0,1] 0.6 P(z <= x) Find the point CDF -1 (p) 0.4 0.2 p=.06 0.0 1 2 3 4 5 x

  26. Sampling from a Multinomial ≤ 1.000 1.0 Sample p uniformly in 0.8 ≤ 0.703 [0,1] 0.6 P(z <= x) Find the point CDF -1 (p) 0.4 0.2 ≤ 0.071 ≤ 0.059 ≤ 0.008 0.0 1 2 3 4 5 x

  27. Unigram model • the around, she They I blue talking “Don’t to and little come of • on fallen used there. young people to Lázaro • of the • the of of never that ordered don't avoided to complaining. • words do had men flung killed gift the one of but thing seen I plate Bradley was by small Kingmaker.

  28. Bigram Model • “What the way to feel where we’re all those ancients called me one of the Council member, and smelled Tales of like a Korps peaks.” • Tuna battle which sold or a monocle, I planned to help and distinctly. • “I lay in the canoe ” • She started to be able to the blundering collapsed. • “Fine.”

  29. Trigram Model • “I’ll worry about it.” • Avenue Great-Grandfather Edgeworth hasn’t gotten there. • “If you know what. It was a photograph of seventeenth-century flourishin’ To their right hands to the fish who would not care at all. Looking at the clock, ticking away like electronic warnings about wonderfully SAT ON FIFTH • Democratic Convention in rags soaked and my past life, I managed to wring your neck a boss won’t so David Pritchet giggled. • He humped an argument but her bare He stood next to Larry, these days it will have no trouble Jay Grayer continued to peer around the Germans weren’t going to faint in the

  30. 4gram Model • Our visitor in an idiot sister shall be blotted out in bars and flirting with curly black hair right marble, wallpapered on screen credit.” • You are much instant coffee ranges of hills. • Madison might be stored here and tell everyone about was tight in her pained face was an old enemy, trading-posts of the outdoors watching Anyog extended On my lips moved feebly. • said. • “I’m in my mind, threw dirt in an inch,’ the Director.

  31. Evaluation • The best evaluation metrics are external — how does a better language model influence the application you care about? • Speech recognition (word error rate), machine translation (BLEU score), topic models (sensemaking)

  32. Evaluation • A good language model should judge unseen real language to have high probability • Perplexity = inverse probability of test data, averaged by word. • To be reliable, the test data must be truly unseen (including knowledge of its vocabulary). � 1 N perplexity = P ( w 1 , . . . , w n )

  33. Experiment design training development testing size 80% 10% 10% evaluation; model selection; never look at it purpose training models hyperparameter until the very tuning end

  34. Evaluation N � log P ( w 1 , . . . , w n ) = log P ( w i ) i N 1 � log P ( w i ) N i � � N − 1 � exp log P ( w i ) perplexity = N i

  35. Perplexity � � N − 1 � trigram model exp log P ( w i | w i − 2 , w i − 1 ) (second-order markov) N i

  36. Perplexity Model Unigram Bigram Trigram Perplexity 962 170 109 SLP3 4.3

  37. Smoothing • When estimating a language model, we’re relying on the data we’ve observed in a training corpus. • Training data is a small (and biased) sample of the creativity of language.

  38. Data sparsity SLP3 4.1

  39. n � P ( w i | w i − 1 ) × P (STOP | w n ) i • As in Naive Bayes, P(w i ) = 0 causes P(w) = 0. (Perplexity?)

  40. Smoothing in NB • One solution: add a little probability mass to every element. smoothed estimates maximum likelihood estimate P ( x i | y ) = n i , y + α n y + Vα P ( x i | y ) = n i , y n y same α for all x i n i , y + α i P ( x i | y ) = n i,y = count of word i in class y n y + � V j = 1 α j n y = number of words in y V = size of vocabulary possibly different α for each x i

  41. Additive smoothing P ( w i ) = c ( w i ) + α Laplace smoothing: 
 α = 1 N + V α P ( w i | w i − 1 ) = c ( w i − 1 , w i ) + α c ( w i − 1 ) + V α

  42. Smoothing 0.6 0.5 0.4 MLE 0.3 0.2 0.1 0.0 1 2 3 4 5 6 Smoothing is the re-allocation of probability mass 0.6 0.5 0.4 smoothing with α =1 0.3 0.2 0.1 0.0 1 2 3 4 5 6

  43. Smoothing • How can best re-allocate probability mass? Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Center for Research in Computing Technology, Harvard University, 1998.

  44. Interpolation • As ngram order rises, we have the potential for higher precision but also higher variability in our estimates. • A linear interpolation of any two language models p and q (with λ ∈ [0,1]) is also a valid language model. λ p + (1 − λ ) q p = the web q = political speeches

  45. Interpolation • We can use this fact to make higher-order language models more robust. P ( w i | w i − 2 , w i − 1 ) = λ 1 P ( w i | w i − 2 , w i − 1 ) + λ 2 P ( w i | w i − 1 ) + λ 3 P ( w i ) λ 1 + λ 2 + λ 3 = 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend