language modeling
play

Language Modeling CS 6956: Deep Learning for NLP Overview What is - PowerPoint PPT Presentation

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How do we evaluate language models? Traditional language models Feedforward neural networks for language modeling Recurrent neural networks for


  1. Language Modeling CS 6956: Deep Learning for NLP

  2. Overview • What is a language model? • How do we evaluate language models? • Traditional language models • Feedforward neural networks for language modeling • Recurrent neural networks for language modeling 1

  3. Overview • What is a language model? • How do we evaluate language models? • Traditional language models • Feedforward neural networks for language modeling • Recurrent neural networks for language modeling 2

  4. Language models What is the probability of a sentence? – Grammatically incorrect or rare sentences should be more improbable – Or equivalently, what is the probability of a word following a sequence of words? “The cat chased a mouse” vs “The cat chased a turnip” Can be framed as a sequence modeling task Two classes of models Count-based: Markov assumptions with smoothing – Neural models – 3

  5. Language models What is the probability of a sentence? – Grammatically incorrect or rare sentences should be more improbable – Or equivalently, what is the probability of a word following a sequence of words? “The cat chased a mouse” vs “The cat chased a turnip” Can be framed as a sequence modeling task Two classes of models Count-based: Markov assumptions with smoothing – Neural models – 4

  6. Language models What is the probability of a sentence? – Grammatically incorrect or rare sentences should be more improbable – Or equivalently, what is the probability of a word following a sequence of words? “The cat chased a mouse” vs “The cat chased a turnip” Can be framed as a sequence modeling task Two classes of models Count-based: Markov assumptions with smoothing – Neural models – We have seen this difference before. In this lecture, we will look at some details 5

  7. Overview • What is a language model? • How do we evaluate language models? • Traditional language models • Feedforward neural networks for language modeling • Recurrent neural networks for language modeling 6

  8. Evaluating language models Extrinsic evaluation • A good language model should help with an end task such as machine translation – If we have a MT system that uses language models to produce outputs… – …a better language model can produce better outputs 7

  9. Evaluating language models Extrinsic evaluation • A good language model should help with an end task such as machine translation – If we have a MT system that uses language models to produce outputs… – …a better language model can produce better outputs • To evaluate a language model, is a downstream task needed? – Can be slow, depends on the quality of the downstream system 8

  10. Evaluating language models Extrinsic evaluation • A good language model should help with an end task such as machine translation – If we have a MT system that uses language models to produce outputs… – …a better language model can produce better outputs • To evaluate a language model, is a downstream task needed? – Can be slow, depends on the quality of the downstream system Can we define an intrinsic evaluation? 9

  11. What is a good language model? • Should prefer good sentences to bad ones – It should higher probabilities to valid/grammatical/frequent sentences – It should assign lower probabilities to invalid/ungrammatical/rare sentences • Can we construct an evaluation metric that directly measures this? 10

  12. What is a good language model? • Should prefer good sentences to bad ones – It should higher probabilities to valid/grammatical/frequent sentences – It should assign lower probabilities to invalid/ungrammatical/rare sentences • Can we construct an evaluation metric that directly measures this? Answer: Perplexity 11

  13. Perplexity A good language model should assign high probability to sentences that occur in the real world – Need a metric that captures this intuition, but normalizes for length of sentences 12

  14. Perplexity A good language model should assign high probability to sentences that occur in the real world – Need a metric that captures this intuition, but normalizes for length of sentences Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & 13

  15. Perplexity A good language model should assign high probability to sentences that occur in the real world – Need a metric that captures this intuition, but normalizes for length of sentences Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & Lower perplexity corresponds to higher probability 14

  16. Example: Uniformly likely words Suppose we have n words in a sentence, and they are all independent and uniform! – Would be a strange language…. ( 4 Perplexity = 𝑄 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & 5 ( 4 & 5 " = = 𝑜 & 15

  17. � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & For a history based model, we have 𝑄 𝑥 " ⋯ 𝑥 & = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 16

  18. � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 17

  19. � � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 M4 5 ∏ J K L K 4:LM4 ) 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 EFG H L 18

  20. � � � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 M4 5 ∏ J K L K 4:LM4 ) 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 EFG H L 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 (" & ∑ EFG H J 𝑥 8 𝑥 ":8(" L 19

  21. � � � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 M4 5 ∏ J K L K 4:LM4 ) 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 EFG H L 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 (" & ∑ EFG H J 𝑥 8 𝑥 ":8(" L Average number of bits needed to encode the sentence 20

  22. Evaluating language models Several benchmark sets available – Penn Treebank Wall Street Journal corpus • Standard preprocessing by Mikolov • Vocabulary size: 10K words • Training size: 890K tokens – Billion Word Benchmark • English news text [Chelba, et al 2013] • Vocabulary size: ~793K • Training size: ~800M tokens Standard methodology of training on the training set and evaluating on the test set Some papers also continue training on the evaluation set because no – labels needed 21

  23. Overview • What is a language model? • How do we evaluate language models? • Traditional language models • Feedforward neural networks for language modeling • Recurrent neural networks for language modeling 22

  24. Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words 23

  25. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 24

  26. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 Need to get this from data 25

  27. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 TUV&W K LMX:L ,K LZ4 𝑄 𝑥 8Q" 𝑥 8(S:8 = TUV&W(K LMX:L ) 26

  28. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 TUV&W K LMX:L ,K LZ4 𝑄 𝑥 8Q" 𝑥 8(S:8 = TUV&W(K LMX:L ) The problem: Zeros in the counts. 27

  29. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 TUV&W K LMX:L ,K LZ4 𝑄 𝑥 8Q" 𝑥 8(S:8 = TUV&W(K LMX:L ) The problem: Zeros in the counts. The solution: Smoothing 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend