Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Recurrent Neural Networks CS 6956: Deep Learning for NLP

Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Long short-term memory units 1

Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Gating and Long short-term memory units 2

A simple RNN 1. How to generate the current state using the previous state and the current input? Next state 𝐭 " = 𝑕(𝐭 "&' 𝐗 ) + 𝐲 " 𝐗 - + 𝐜) 2. How to generate the current output using the current state? The output is the state. That is, 𝒛 " = 𝐭 " 3

How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) Initial state I 4 like cake

How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) – The outputs are labels associated with each word Verb Pronoun Noun Initial state I 5 like cake

How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) – The outputs are labels associated with each word – Losses for each word are added up Loss loss1 loss2 loss3 Verb Pronoun Noun Initial state I 6 like cake

Gradients to the rescue • We have a computation graph • Use back propagation to compute gradients of the loss with respect to the parameters ( 𝐗 ) , 𝐗 - , 𝐜 ) – Sometimes called Backpropagation Through Time (BPTT) • Update gradients using SGD or a variant – Adam, for example 7

A simple RNN 1. How to generate the current state using the previous state and the current input? Next state 𝐭 " = 𝑕(𝐭 "&' 𝐗 ) + 𝐲 " 𝐗 - + 𝐜) 2. How to generate the current output using the current state? The output is the state. That is, 𝒛 " = 𝐭 " 8

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step 9

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step First input: 𝑦 ' Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 State: s ' = 𝑕(𝑢 ' ) Loss: 𝑚 ' = 𝑔(𝑡 ' ) 10

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑢 ' ) Loss: 𝑚 ' = 𝑔(𝑡 ' ) 11

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑢 ' ) 𝜖𝑚 ' = 𝜖𝑚 ' ⋅ 𝜖𝑡 ' ⋅ 𝜖𝑢 ' Loss: 𝑚 ' = 𝑔(𝑡 ' ) 𝜖𝑋 𝜖𝑡 ' 𝜖𝑢 ' 𝜖𝑋 - - Follows the chain rule 12

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑢 ' ) 𝜖𝑚 ' = 𝜖𝑚 ' ⋅ 𝜖𝑡 ' ⋅ 𝜖𝑢 ' Loss: 𝑚 ' = 𝑔(𝑡 ' ) 𝜖𝑋 𝜖𝑡 ' 𝜖𝑢 ' 𝜖𝑋 - - Let us examine the non-linearity in this system due to the activation function 16

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑢 ' ) 𝜖𝑚 ' = 𝜖𝑚 ' ⋅ 𝜖𝑡 ' ⋅ 𝜖𝑢 ' Loss: 𝑚 ' = 𝑔(𝑡 ' ) 𝜖𝑋 𝜖𝑡 ' 𝜖𝑢 ' 𝜖𝑋 - - Suppose 𝑕 𝑨 = tanh 𝑨 17

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑢 ' ) 𝜖𝑚 ' = 𝜖𝑚 ' ⋅ 𝜖𝑡 ' ⋅ 𝜖𝑢 ' Loss: 𝑚 ' = 𝑔(𝑡 ' ) 𝜖𝑋 𝜖𝑡 ' 𝜖𝑢 ' 𝜖𝑋 - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 − tanh G (𝑨) 18

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑢 ' ) 𝜖𝑚 ' = 𝜖𝑚 ' ⋅ 𝜖𝑡 ' ⋅ 𝜖𝑢 ' Loss: 𝑚 ' = 𝑔(𝑡 ' ) 𝜖𝑋 𝜖𝑡 ' 𝜖𝑢 ' 𝜖𝑋 - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 − tanh G (𝑨) Always between zero and one 19

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑢 ' ) 𝜖𝑚 ' = 𝜖𝑚 ' ⋅ 𝜖𝑡 ' ⋅ 𝜖𝑢 ' Loss: 𝑚 ' = 𝑔(𝑡 ' ) 𝜖𝑋 𝜖𝑡 ' 𝜖𝑢 ' 𝜖𝑋 - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 − tanh G (𝑨) 𝜖𝑡 ' = 1 − tanh G 𝑢 ' That is 𝜖𝑢 ' 20

Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑢 ' = 𝑡 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑢 ' ) 𝜖𝑚 ' = 𝜖𝑚 ' ⋅ 𝜖𝑡 ' ⋅ 𝜖𝑢 ' Loss: 𝑚 ' = 𝑔(𝑡 ' ) 𝜖𝑋 𝜖𝑡 ' 𝜖𝑢 ' 𝜖𝑋 - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 − tanh G (𝑨) 𝜖𝑡 ' = 1 − tanh G 𝑢 ' That is 𝜖𝑢 ' A number between zero and one. 21

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

One Model To Learn Them All Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki

Do Judgments Screen Evidence? Brian Weatherson Rutgers/Arch e March, 2010 Brian Weatherson

Building Your Bench for Success: Management and Key Position Succession Planning June 7, 2017

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave

Almost over... Feedback Website schedule for talk feedback Booklet (infodesk) feedback@

Categorizing objects: global and part based models global and part-based models of appearance

Learning to Compose Relational Embeddings in Knowledge Graphs Wenye Chen, Huda Hakami, Danushka

Talk to me Drupal Talk to me Drupal Using Drupal to power a Voice App Speaker notes Talk to me