Natural Language Processing with Deep Learning Language Modeling - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception

Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time The slides are adopted from http://web.stanford.edu/class/cs224n/

Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time

Language Modeling § Language Modeling is the task of predicting a word (or a subword or character) given a context: 𝑄(𝑤|context) § A Language Model can answer the questions like 𝑄(𝑤| the students opened their ) 4

Language Modeling § Formally, given a sequence of words 𝑦 (") , 𝑦 ($) , … , 𝑦 (%) , a language model calculates the probability distribution of next word 𝑦 (%&") over all words in vocabulary 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (") ) 𝑦 is any word in the vocabulary 𝕎 = {𝑤1, 𝑤2, … , 𝑤𝑂} 5

Language Modeling § You can also think of a Language Model as a system that assigns probability to a piece of text - How probable is it that someone generates this sentence?! “ colorless green ideas sleep furiously” § According to a Language Model, the probability of a given text is computed by: 𝑄 𝑦 ! , … , 𝑦 " = 𝑄 𝑦 ! ×𝑄 𝑦 # 𝑦 ! × ⋯×𝑄 𝑦 " 𝑦 "$! , … , 𝑦 ! " 𝑄 𝑦 ! , … , 𝑦 " 𝑄(𝑦 " |𝑦 "$! , … , 𝑦 ! ) = / %&! 6

Why Language Modeling? § Language Modeling is a benchmark task that helps us measure our progress on understanding language § Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text: - Predictive typing - Spelling/grammar correction - Speech recognition - Handwriting recognition - Machine translation - Summarization - Dialogue /chatbots - etc. 7

Language Modeling [link] 8

Language Modeling 9

n -gram Language Model § Recall: a n -gram is a chunk of n consecutive words. the students opened their ______ § unigrams: “the”, “students”, “opened”, “their” § bigrams: “the students”, “students opened”, “opened their” § trigrams: “the students opened”, “students opened their” § 4-grams: “the students opened their” § A n -gram Language Model collects frequency statistics of different n -grams in a corpus, and use these to calculate probabilities 10

n -gram Language Model § Markov assumption: decision at time 𝑢 depends only on the current state § In n -gram Language Model: predicting 𝑦 (%&") depends on preceding n-1 words § Without Markovian assumption: 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (") ) § n -gram Language Model: 𝑄(𝑦 (%&") |𝑦 % , 𝑦 (%'") , … , 𝑦 (%'(&$) ) n-1 words 11

n -gram Language Model § Based on definition of conditional probability: = 𝑄 𝑦 %&" , 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 % , … , 𝑦 %'(&$ § The n -gram probability is calculated by counting n -grams and [ n–1 ]-grams in a large corpus of text: ≈ count 𝑦 %&" , 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 %'(&$ count 𝑦 % , … , 𝑦 %'(&$ 12

n -gram Language Model § Example: learning a 4-gram Language Model as the exam clerk started the clock, the students opened their ______ condition on this 𝑄 𝑤 students opened their = 𝑄 students opened their 𝑤 𝑄 students opened their § For example, suppose that in the corpus: - “ students opened their ” occurred 1000 times - “ students opened their books ” occurred 400 times • 𝑄 ( books | students opened their ) = 0.4 - “ students opened their exams ” occurred 100 times • 𝑄 ( exams | students opened their ) = 0.1 13

n -gram Language Model – problems § Sparsity - If nominator „ students opened their 𝑤“ never occurred in corpus • Smoothing: add small hyper-parameter 𝜀 to all words - If denominator „ students opened their “ never occurred in corpus • Backoff: condition on “ students opened ” instead - Increasing n makes sparsity problem worse! § Storage - The model needs to store all n- grams (from unigram to n- gram), observed in the corpus - Increasing n worsens the storage problem radically! 14

n -gram Language Models – generating text § A trigram Language Model trained on Reuters corpus (1.7 M words) 15

n -gram Language Models – generating text § Generating text by sampling from the probability distributions 16

n -gram Language Models – generating text § Generating text by sampling from the probability distributions § Very good in syntax … but incoherent! § Increasing n makes the text more coherent but also intensifies the discussed issues 19

Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Training Language Models with RNN • Backpropagation Through Time

Recurrent Neural Network § Recurrent Neural Network (RNN) encodes/embeds a sequential input of any size like … - Sequence of word/subword/character vectors - Time series … into compositional embeddings § RNN captures dependencies through the sequence by applying the same parameters repeatedly … § RNN outputs a final embedding but also intermediary embeddings on each time step 21

Recurrent Neural Networks 𝒊 (%) 𝒊 (%'") RNN 𝒇 (%) 22

Recurrent Neural Networks § Output 𝒊 (%) is a function of input 𝒇 (%) and the output of the previous time step 𝒊 (%'") 𝒊 (%) 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) § 𝒊 (%) is called hidden state 𝒊 (%'") RNN § With hidden state 𝒊 (%'") , the model accesses to a sort of memory from all previous entities 𝒇 (%) 23

RNN – Unrolling 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (,) 𝒊 (,'") … 𝒊 (-) - RNN RNN RNN RNN 𝒇 ($) 𝒇 (,) 𝒇 (") 𝒇 (+) The quick brown fox jumps over the lazy dog 𝑦 (") 𝑦 ($) 𝑦 (+) 𝑦 (,) 24

RNN – Compositional embedding sentence embedding 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 25

RNN – Compositional embedding sentence embedding use last hidden state 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 26

RNN – Compositional embedding sentence embedding calculate element-wise max, or the mean of hidden states 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (/) 𝒊 (-) RNN RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝒇 (/) cat sunbathes on river bank 27

Standard (Elman) RNN § General form of an RNN function 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) 𝒊 (%) § Standard RNN: linear projection of the previous hidden state 𝒊 !"# - 𝒊 (%'") linear projection of input 𝒇 (!) - RNN - summing the projections and applying a non-linearity 𝒊 (%) = 𝜏(𝒊 %'" 𝑿 2 + 𝒇 (%) 𝑿 3 + 𝒄) 𝒇 (%) 28

Agenda • Language Modeling with n- grams • Recurrent Neural Networks • Language Modeling with RNN • Backpropagation Through Time

RNN Language Model 𝑄(𝑦 (() | the students opened their ) 𝒛 (.) ; 𝑽 𝒊 (") 𝒊 ($) 𝒊 (+) 𝒊 (.) 𝒊 (-) RNN RNN RNN RNN 𝒇 ($) 𝒇 (") 𝒇 (+) 𝒇 (.) 𝑭 𝑭 𝑭 𝑭 the students open their 𝑦 (") 𝑦 (.) 𝑦 ($) 𝑦 (+) 30

RNN Language Model § Encoder - word at time step 𝑢 → 𝑦 (%) - One-hot vector of 𝑦 (%) → 𝒚 (%) ∈ ℝ 𝕎 - Word embedding → 𝒇 (%) = 𝒚 (%) 𝑭 § RNN 𝒊 (%) = RNN(𝒊 %'" , 𝒇 (%) ) § Decoder - Predicted probability distribution: 𝒛 (%) = softmax 𝑽𝒊 % + 𝒄 ∈ ℝ 𝕎 ; - Probability of any word 𝑤 at step 𝑢 : 𝑄 𝑤 𝑦 %'" , … , 𝑦 (") (%) = E 𝑧 5 31

Training an RNN Language Model § Start with a large text corpus: 𝑦 " , … , 𝑦 , 𝒛 (%) § For every step 𝑢 predict the output distribution ; § Calculate the loss function: Negative Log Likelihood of the predicted probability of the true next word 𝑦 %&" ℒ (%) = − log E = − log 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 (") % 𝑧 𝒚 !"# § Overall loss is the average of loss values over the entire training set: , ℒ = 1 ℒ (%) 𝑈 M %7" 32

NLL of students Training ℒ (") 𝒛 (") ; 𝑽 𝒊 (") 𝒊 (-) RNN 𝒇 (") 𝑭 the students open their exams … 𝑦 (") 𝑦 (.) 𝑦 (/) 𝑦 ($) 𝑦 (+) 33

Natural Language Processing with Deep Learning Language Modeling - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Language Modeling with n- grams Recurrent Neural

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) CMSC 678 UMBC Recap

Neural Network Part 4: Recurrent Neural Networks Yingyu Liang Computer Sciences 760 Fall 2017

Geometry-Aware Deep Visual Learning Katerina Fragkiadaki zebras How this talk fits the workshop

Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke 1,2 ,

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2.

Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks

Natural Language Processing with Deep Learning Language Modeling - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Language Modeling with n- grams Recurrent Neural

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) CMSC 678 UMBC Recap

Neural Network Part 4: Recurrent Neural Networks Yingyu Liang Computer Sciences 760 Fall 2017

Geometry-Aware Deep Visual Learning Katerina Fragkiadaki zebras How this talk fits the workshop

Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke 1,2 ,

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2.

Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks

Deep learning for natural language processing A short primer on deep learning Benoit Favre <