 
              Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception
Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time The slides are adopted from http://web.stanford.edu/class/cs224n/
Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time
Language Modeling § Language Modeling is the task of predicting a word (or a subword or character) given a context: 𝑄(𝑤|context) § A Language Model can answer the questions like 𝑄(𝑤| the students opened their ) 4
Language Modeling § Formally, given a sequence of words 𝑦 (") , 𝑦 ($) , … , 𝑦 (%) , a language model calculates the probability distribution of next word 𝑦 (%&") over all words in vocabulary 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (") ) 𝑦 is any word in the vocabulary 𝕎 = {𝑤1, 𝑤2, … , 𝑤𝑂} 5
Language Modeling § You can also think of a Language Model as a system that assigns probability to a piece of text - How probable is it that someone generates this sentence?! “ colorless green ideas sleep furiously” § According to a Language Model, the probability of a given text is computed by: 𝑄 𝑦 ! , … , 𝑦 " = 𝑄 𝑦 ! ×𝑄 𝑦 # 𝑦 ! × ⋯×𝑄 𝑦 " 𝑦 "$! , … , 𝑦 ! " 𝑄 𝑦 ! , … , 𝑦 " 𝑄(𝑦 " |𝑦 "$! , … , 𝑦 ! ) = / %&! 6
Why Language Modeling? § Language Modeling is a benchmark task that helps us measure our progress on understanding language § Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text: - Predictive typing - Spelling/grammar correction - Speech recognition - Handwriting recognition - Machine translation - Summarization - Dialogue /chatbots - etc. 7
Language Modeling [link] 8
Language Modeling 9
n -gram Language Model § Recall: a n -gram is a chunk of n consecutive words. the students opened their ______ § unigrams: “the”, “students”, “opened”, “their” § bigrams: “the students”, “students opened”, “opened their” § trigrams: “the students opened”, “students opened their” § 4-grams: “the students opened their” § A n -gram Language Model collects frequency statistics of different n -grams in a corpus, and use these to calculate probabilities 10
n -gram Language Model § Markov assumption: decision at time 𝑢 depends only on the current state § In n -gram Language Model: predicting 𝑦 (%&") depends on preceding n-1 words § Without Markovian assumption: 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (") ) § n -gram Language Model: 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (%'(&$) ) n-1 words 11
n -gram Language Model § Based on definition of conditional probability: = 𝑄 𝑦 %&" , 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 % , … , 𝑦 %'(&$ § The n -gram probability is calculated by counting n -grams and [ n–1 ]-grams in a large corpus of text: ≈ count 𝑦 %&" , 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 %'(&$ count 𝑦 % , … , 𝑦 %'(&$ 12
n -gram Language Model § Example: learning a 4-gram Language Model as the exam clerk started the clock, the students opened their ______ condition on this 𝑄 𝑤 students opened their = 𝑄 students opened their 𝑤 𝑄 students opened their § For example, suppose that in the corpus: - “ students opened their ” occurred 1000 times - “ students opened their books ” occurred 400 times • 𝑄 ( books | students opened their ) = 0.4 - “ students opened their exams ” occurred 100 times • 𝑄 ( exams | students opened their ) = 0.1 13
n -gram Language Model – problems § Sparsity - If nominator „ students opened their 𝑤“ never occurred in corpus • Smoothing: add small hyper-parameter 𝜀 to all words - If denominator „ students opened their “ never occurred in corpus • Backoff: condition on “ students opened ” instead - Increasing n makes sparsity problem worse! § Storage - The model needs to store all n- grams (from unigram to n- gram), observed in the corpus - Increasing n worsens the storage problem radically! 14
n -gram Language Models – generating text § A trigram Language Model trained on Reuters corpus (1.7 M words) 15
n -gram Language Models – generating text § Generating text by sampling from the probability distributions 16
n -gram Language Models – generating text § Generating text by sampling from the probability distributions 17
n -gram Language Models – generating text § Generating text by sampling from the probability distributions 18
n -gram Language Models – generating text § Generating text by sampling from the probability distributions § Very good in syntax … but incoherent! § Increasing n makes the text more coherent but also intensifies the discussed issues 19
Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Training Language Models with RNN • Backpropagation Through Time
Recurrent Neural Network § Recurrent Neural Network (RNN) encodes/embeds a sequential input of any size like … - Sequence of word/subword/character vectors - Time series … into compositional embeddings § RNN captures dependencies through the sequence by applying the same parameters repeatedly … § RNN outputs a final embedding but also intermediary embeddings on each time step 21
Recurrent Neural Networks 𝒊 (%) 𝒊 (%'") RNN 𝒇 (%) 22
Recurrent Neural Networks § Output 𝒊 (%) is a function of input 𝒇 (%) and the output of the previous time step 𝒊 (%'") 𝒊 (%) 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) § 𝒊 (%) is called hidden state 𝒊 (%'") RNN § With hidden state 𝒊 (%'") , the model accesses to a sort of memory from all previous entities 𝒇 (%) 23
RNN – Unrolling 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (,) 𝒊 (,'") … 𝒊 (-) - RNN RNN RNN RNN 𝒇 ($) 𝒇 (,) 𝒇 (") 𝒇 (+) The quick brown fox jumps over the lazy dog 𝑦 (") 𝑦 ($) 𝑦 (+) 𝑦 (,) 24
RNN – Compositional embedding sentence embedding 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 25
RNN – Compositional embedding sentence embedding use last hidden state 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 26
RNN – Compositional embedding sentence embedding calculate element-wise max, or the mean of hidden states 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 27
Standard (Elman) RNN § General form of an RNN function 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) 𝒊 (%) § Standard RNN: linear projection of the previous hidden state 𝒊 !"# - 𝒊 (%'") linear projection of input 𝒇 (!) - RNN - summing the projections and applying a non-linearity 𝒊 (%) = 𝜏(𝒊 %'" 𝑿 2 + 𝒇 (%) 𝑿 3 + 𝒄) 𝒇 (%) 28
Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time
RNN Language Model 𝑄(𝑦 (() | the students opened their ) 𝒛 (.) ; 𝑽 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (-) RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝑭 𝑭 𝑭 𝑭 the students open their 𝑦 (") 𝑦 (.) 𝑦 ($) 𝑦 (+) 30
RNN Language Model § Encoder - word at time step 𝑢 → 𝑦 (%) - One-hot vector of 𝑦 (%) → 𝒚 (%) ∈ ℝ 𝕎 - Word embedding → 𝒇 (%) = 𝒚 (%) 𝑭 § RNN 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) § Decoder - Predicted probability distribution: 𝒛 (%) = softmax 𝑽𝒊 % + 𝒄 ∈ ℝ 𝕎 ; - Probability of any word 𝑤 at step 𝑢 : 𝑄 𝑤 𝑦 %'" , … , 𝑦 (") (%) = E 𝑧 5 31
Training an RNN Language Model § Start with a large text corpus: 𝑦 " , … , 𝑦 , 𝒛 (%) § For every step 𝑢 predict the output distribution ; § Calculate the loss function: Negative Log Likelihood of the predicted probability of the true next word 𝑦 %&" ℒ (%) = − log E = − log 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 (") % 𝑧 𝒚 !"# § Overall loss is the average of loss values over the entire training set: , ℒ = 1 ℒ (%) 𝑈 M %7" 32
NLL of students Training ℒ (") 𝒛 (") ; 𝑽 𝒊 (") 𝒊 (-) RNN 𝒇 (") 𝑭 the students open their exams … 𝑦 (") 𝑦 (.) 𝑦 (/) 𝑦 ($) 𝑦 (+) 33
Recommend
More recommend