cs224n ling284
play

CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural Networks Abigail See Overview Today we will: Introduce a new NLP task Language Modeling motivates Introduce a new


  1. Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural Networks Abigail See

  2. Overview Today we will: • Introduce a new NLP task • Language Modeling motivates • Introduce a new family of neural networks • Recurrent Neural Networks (RNNs) These are two of the most important ideas for the rest of the class! 2

  3. Language Modeling • Language Modeling is the task of predicting what word comes next. books laptops the students opened their ______ exams minds • More formally: given a sequence of words , compute the probability distribution of the next word : where can be any word in the vocabulary • A system that does this is called a Language Model . 3

  4. Language Modeling • You can also think of a Language Model as a system that assigns probability to a piece of text. • For example, if we have some text , then the probability of this text (according to the Language Model) is: This is what our LM provides 4

  5. You use Language Models every day! 5

  6. You use Language Models every day! 6

  7. n-gram Language Models the students opened their ______ • Question : How to learn a Language Model? • Answer (pre- Deep Learning): learn a n -gram Language Model! • Definition: A n -gram is a chunk of n consecutive words. • uni grams: “the”, “students”, “opened”, ”their” • bi grams: “the students”, “students opened”, “opened their” • tri grams: “the students opened”, “students opened their” • 4- grams: “the students opened their” • Idea: Collect statistics about how frequent different n-grams are, and use these to predict next word. 7

  8. n-gram Language Models • First we make a simplifying assumption: depends only on the preceding n-1 words. n -1 words (assumption) prob of a n-gram (definition of conditional prob) prob of a (n-1)-gram • Question: How do we get these n -gram and ( n -1)-gram probabilities? • Answer: By counting them in some large corpus of text! (statistical approximation) 8

  9. n-gram Language Models: Example Suppose we are learning a 4-gram Language Model. as the proctor started the clock, the students opened their _____ discard condition on this For example, suppose that in the corpus: • “students opened their” occurred 1000 times • “students opened their books ” occurred 400 times Should we have • → P(books | students opened their) = 0.4 discarded the • “students opened their exams ” occurred 100 times “proctor” context? • → P(exams | students opened their) = 0.1 9

  10. Sparsity Problems with n-gram Language Models Sparsity Problem 1 Problem: What if “students (Partial) Solution: Add small 𝜀 opened their ” never to the count for every . occurred in data? Then This is called smoothing . has probability 0! Sparsity Problem 2 Problem: What if “students (Partial) Solution: Just condition opened their” never occurred in on “opened their” instead. data? Then we can’t calculate This is called backoff . probability for any ! Note: Increasing n makes sparsity problems worse. Typically we can’t have n bigger than 5. 10

  11. Storage Problems with n-gram Language Models Storage : Need to store count for all n -grams you saw in the corpus. Increasing n or increasing corpus increases model size! 11

  12. n-gram Language Models in practice • You can build a simple trigram Language Model over a 1.7 million word corpus (Reuters) in a few seconds on your laptop* Business and financial news today the _______ get probability distribution company 0.153 Sparsity problem : bank 0.153 not much granularity price 0.077 in the probability italian 0.039 distribution emirate 0.039 … Otherwise, seems reasonable! * Try for yourself: https://nlpforhackers.io/language-models/ 12

  13. Generating text with a n-gram Language Model • You can also use a Language Model to generate text. today the _______ condition on this get probability distribution company 0.153 bank 0.153 price 0.077 sample italian 0.039 emirate 0.039 … 13

  14. Generating text with a n-gram Language Model • You can also use a Language Model to generate text. today the price _______ condition on this get probability distribution sample of 0.308 for 0.050 it 0.046 to 0.046 is 0.031 … 14

  15. Generating text with a n-gram Language Model • You can also use a Language Model to generate text. today the price of _______ condition on this get probability distribution the 0.072 18 0.043 oil 0.043 its 0.036 gold 0.018 sample … 15

  16. Generating text with a n-gram Language Model • You can also use a Language Model to generate text. today the price of gold _______ 16

  17. Generating text with a n-gram Language Model • You can also use a Language Model to generate text. today the price of gold per ton , while production of shoe lasts and shoe industry , the bank intervened just after it considered and rejected an imf demand to rebuild depleted european stocks , sept 30 end primary 76 cts a share . Surprisingly grammatical! …but incoherent. We need to consider more than three words at a time if we want to model language well. But increasing n worsens sparsity problem, and increases model size… 17

  18. How to build a neural Language Model? • Recall the Language Modeling task: • Input: sequence of words • Output: prob dist of the next word • How about a window-based neural model? • We saw this applied to Named Entity Recognition in Lecture 3: LOCATION in Paris are amazing museums 18

  19. A fixed-window neural Language Model as the proctor started the clock the students opened their ______ discard fixed window 19

  20. A fixed-window neural Language Model books laptops output distribution a zoo hidden layer concatenated word embeddings words / one-hot vectors the students opened their 20

  21. A fixed-window neural Language Model books Improvements over n -gram LM: laptops • No sparsity problem • Don’t need to store all observed n -grams a zoo Remaining problems : • Fixed window is too small • Enlarging window enlarges • Window can never be large enough! • and are multiplied by completely different weights in . No symmetry in how the inputs are processed. We need a neural architecture that can the students opened their process any length input 21

  22. Core idea: Apply the Recurrent Neural Networks (RNN) same weights A family of neural architectures repeatedly outputs … (optional) hidden states … input sequence … (any length) 22

  23. A RNN Language Model books laptops output distribution a zoo hidden states is the initial hidden state word embeddings words / one-hot vectors the students opened their Note : this input sequence could be much 23 longer, but this slide doesn’t have space!

  24. A RNN Language Model books laptops RNN Advantages : • Can process any length input a zoo • Computation for step t can (in theory) use information from many steps back • Model size doesn’t increase for longer input • Same weights applied on every timestep, so there is symmetry in how inputs are processed. RNN Disadvantages : • Recurrent computation is More on slow these later the students opened their • In practice, difficult to in the access information from course many steps back 24

  25. Training a RNN Language Model • Get a big corpus of text which is a sequence of words • Feed into RNN-LM; compute output distribution for every step t. • i.e. predict probability dist of every word , given words so far • Loss function on step t is cross-entropy between predicted probability distribution , and the true next word (one-hot for ): • Average this to get overall loss for entire training set: 25

  26. Training a RNN Language Model = negative log prob of “students” Loss Predicted prob dists … Corpus the students opened their exams … 26

  27. Training a RNN Language Model = negative log prob of “opened” Loss Predicted prob dists … Corpus the students opened their exams … 27

  28. Training a RNN Language Model = negative log prob of “their” Loss Predicted prob dists … Corpus the students opened their exams … 28

  29. Training a RNN Language Model = negative log prob of “exams” Loss Predicted prob dists … Corpus the students opened their exams … 29

  30. Training a RNN Language Model Loss + + + + … = Predicted prob dists … Corpus the students opened their exams … 30

  31. Training a RNN Language Model • However: Computing loss and gradients across entire corpus is too expensive! • In practice, consider as a sentence (or a document) • Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small chunk of data, and update. • Compute loss for a sentence (actually a batch of sentences), compute gradients and update weights. Repeat. 31

  32. Backpropagation for RNNs … … Question: What’s the derivative of w.r.t. the repeated weight matrix ? “The gradient w.r.t. a repeated weight is the sum of the gradient Answer: w.r.t . each time it appears” Why? 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend