natural language processing with deep learning language
play

Natural Language Processing with Deep Learning Language Modeling - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Language Modeling with n- grams Recurrent Neural


  1. Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception

  2. Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time The slides are adopted from http://web.stanford.edu/class/cs224n/

  3. Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time

  4. Language Modeling § Language Modeling is the task of predicting a word (or a subword or character) given a context: 𝑄(𝑤|context) § A Language Model can answer the questions like 𝑄(𝑤| the students opened their ) 4

  5. Language Modeling § Formally, given a sequence of words 𝑦 (") , 𝑦 ($) , … , 𝑦 (%) , a language model calculates the probability distribution of next word 𝑦 (%&") over all words in vocabulary 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (") ) 𝑦 is any word in the vocabulary 𝕎 = {𝑤1, 𝑤2, … , 𝑤𝑂} 5

  6. Language Modeling § You can also think of a Language Model as a system that assigns probability to a piece of text - How probable is it that someone generates this sentence?! “ colorless green ideas sleep furiously” § According to a Language Model, the probability of a given text is computed by: 𝑄 𝑦 ! , … , 𝑦 " = 𝑄 𝑦 ! ×𝑄 𝑦 # 𝑦 ! × ⋯×𝑄 𝑦 " 𝑦 "$! , … , 𝑦 ! " 𝑄 𝑦 ! , … , 𝑦 " 𝑄(𝑦 " |𝑦 "$! , … , 𝑦 ! ) = / %&! 6

  7. Why Language Modeling? § Language Modeling is a benchmark task that helps us measure our progress on understanding language § Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text: - Predictive typing - Spelling/grammar correction - Speech recognition - Handwriting recognition - Machine translation - Summarization - Dialogue /chatbots - etc. 7

  8. Language Modeling [link] 8

  9. Language Modeling 9

  10. n -gram Language Model § Recall: a n -gram is a chunk of n consecutive words. the students opened their ______ § unigrams: “the”, “students”, “opened”, “their” § bigrams: “the students”, “students opened”, “opened their” § trigrams: “the students opened”, “students opened their” § 4-grams: “the students opened their” § A n -gram Language Model collects frequency statistics of different n -grams in a corpus, and use these to calculate probabilities 10

  11. n -gram Language Model § Markov assumption: decision at time 𝑢 depends only on the current state § In n -gram Language Model: predicting 𝑦 (%&") depends on preceding n-1 words § Without Markovian assumption: 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (") ) § n -gram Language Model: 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (%'(&$) ) n-1 words 11

  12. n -gram Language Model § Based on definition of conditional probability: = 𝑄 𝑦 %&" , 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 % , … , 𝑦 %'(&$ § The n -gram probability is calculated by counting n -grams and [ n–1 ]-grams in a large corpus of text: ≈ count 𝑦 %&" , 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 %'(&$ count 𝑦 % , … , 𝑦 %'(&$ 12

  13. n -gram Language Model § Example: learning a 4-gram Language Model as the exam clerk started the clock, the students opened their ______ condition on this 𝑄 𝑤 students opened their = 𝑄 students opened their 𝑤 𝑄 students opened their § For example, suppose that in the corpus: - “ students opened their ” occurred 1000 times - “ students opened their books ” occurred 400 times • 𝑄 ( books | students opened their ) = 0.4 - “ students opened their exams ” occurred 100 times • 𝑄 ( exams | students opened their ) = 0.1 13

  14. n -gram Language Model – problems § Sparsity - If nominator „ students opened their 𝑤“ never occurred in corpus • Smoothing: add small hyper-parameter 𝜀 to all words - If denominator „ students opened their “ never occurred in corpus • Backoff: condition on “ students opened ” instead - Increasing n makes sparsity problem worse! § Storage - The model needs to store all n- grams (from unigram to n- gram), observed in the corpus - Increasing n worsens the storage problem radically! 14

  15. n -gram Language Models – generating text § A trigram Language Model trained on Reuters corpus (1.7 M words) 15

  16. n -gram Language Models – generating text § Generating text by sampling from the probability distributions 16

  17. n -gram Language Models – generating text § Generating text by sampling from the probability distributions 17

  18. n -gram Language Models – generating text § Generating text by sampling from the probability distributions 18

  19. n -gram Language Models – generating text § Generating text by sampling from the probability distributions § Very good in syntax … but incoherent! § Increasing n makes the text more coherent but also intensifies the discussed issues 19

  20. Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Training Language Models with RNN • Backpropagation Through Time

  21. Recurrent Neural Network § Recurrent Neural Network (RNN) encodes/embeds a sequential input of any size like … - Sequence of word/subword/character vectors - Time series … into compositional embeddings § RNN captures dependencies through the sequence by applying the same parameters repeatedly … § RNN outputs a final embedding but also intermediary embeddings on each time step 21

  22. Recurrent Neural Networks 𝒊 (%) 𝒊 (%'") RNN 𝒇 (%) 22

  23. Recurrent Neural Networks § Output 𝒊 (%) is a function of input 𝒇 (%) and the output of the previous time step 𝒊 (%'") 𝒊 (%) 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) § 𝒊 (%) is called hidden state 𝒊 (%'") RNN § With hidden state 𝒊 (%'") , the model accesses to a sort of memory from all previous entities 𝒇 (%) 23

  24. RNN – Unrolling 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (,) 𝒊 (,'") … 𝒊 (-) - RNN RNN RNN RNN 𝒇 ($) 𝒇 (,) 𝒇 (") 𝒇 (+) The quick brown fox jumps over the lazy dog 𝑦 (") 𝑦 ($) 𝑦 (+) 𝑦 (,) 24

  25. RNN – Compositional embedding sentence embedding 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 25

  26. RNN – Compositional embedding sentence embedding use last hidden state 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 26

  27. RNN – Compositional embedding sentence embedding calculate element-wise max, or the mean of hidden states 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 27

  28. Standard (Elman) RNN § General form of an RNN function 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) 𝒊 (%) § Standard RNN: linear projection of the previous hidden state 𝒊 !"# - 𝒊 (%'") linear projection of input 𝒇 (!) - RNN - summing the projections and applying a non-linearity 𝒊 (%) = 𝜏(𝒊 %'" 𝑿 2 + 𝒇 (%) 𝑿 3 + 𝒄) 𝒇 (%) 28

  29. Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time

  30. RNN Language Model 𝑄(𝑦 (() | the students opened their ) 𝒛 (.) ; 𝑽 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (-) RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝑭 𝑭 𝑭 𝑭 the students open their 𝑦 (") 𝑦 (.) 𝑦 ($) 𝑦 (+) 30

  31. RNN Language Model § Encoder - word at time step 𝑢 → 𝑦 (%) - One-hot vector of 𝑦 (%) → 𝒚 (%) ∈ ℝ 𝕎 - Word embedding → 𝒇 (%) = 𝒚 (%) 𝑭 § RNN 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) § Decoder - Predicted probability distribution: 𝒛 (%) = softmax 𝑽𝒊 % + 𝒄 ∈ ℝ 𝕎 ; - Probability of any word 𝑤 at step 𝑢 : 𝑄 𝑤 𝑦 %'" , … , 𝑦 (") (%) = E 𝑧 5 31

  32. Training an RNN Language Model § Start with a large text corpus: 𝑦 " , … , 𝑦 , 𝒛 (%) § For every step 𝑢 predict the output distribution ; § Calculate the loss function: Negative Log Likelihood of the predicted probability of the true next word 𝑦 %&" ℒ (%) = − log E = − log 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 (") % 𝑧 𝒚 !"# § Overall loss is the average of loss values over the entire training set: , ℒ = 1 ℒ (%) 𝑈 M %7" 32

  33. NLL of students Training ℒ (") 𝒛 (") ; 𝑽 𝒊 (") 𝒊 (-) RNN 𝒇 (") 𝑭 the students open their exams … 𝑦 (") 𝑦 (.) 𝑦 (/) 𝑦 ($) 𝑦 (+) 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend