CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, - PowerPoint PPT Presentation

1/63 CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

2/63 Module 16.1: Introduction to Encoder Decoder Models Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

3/63 start today y t today U W x t s t y t We will by V revisiting the problem of language modeling want to fjnd Let us see how we model Mitesh M. Khapra W V U V U V I I U W y t V home y t am U am W at y t home W V U CS7015 (Deep Learning) : Lecture 16 at y t ⟨ stop ⟩ P ( y t = j | y t − 1 ) Informally, given ‘ t − i ’ words we are 1 interested in predicting the t th word More formally, given y 1 , y 2 , ..., y t − 1 we s 0 y ∗ = argmax P ( y t | y 1 , y 2 , ..., y t − 1 ) < GO > P ( y t | y 1 , y 2 ... y t − 1 ) using a RNN We will refer to P ( y t | y 1 , y 2 ... y t − 1 ) by shorthand notation: P ( y t | y t − 1 ) 1

4/63 x t U V W today today U W s t home y t We are interested in vocabulary words Using an RNN we compute this as In other words we compute Notice that the recurrent connections Mitesh M. Khapra home V W V U V I I V U W CS7015 (Deep Learning) : Lecture 16 am W U at am at U V ⟨ stop ⟩ P ( y t = j | y 1 , y 2 ... y t − 1 ) P ( y t = j | y t − 1 ) 1 where j ∈ V and V is the set of all s 0 P ( y t = j | y t − 1 ) = softmax ( Vs t + c ) j 1 < GO > P ( y t = j | y t − 1 ) = P ( y t = j | s t ) 1 = softmax ( Vs t + c ) j ensure that s t has information about y t − 1 1

5/63 India, offjcially the Republic today today U V W x t y t Data: of India, is a country in South V Asia. It is the seventh-largest country by area, ..... Data: All sentences from any large corpus (say wikipedia) Model: Loss: T t Mitesh M. Khapra W s t U V U V I I home W V am am U U CS7015 (Deep Learning) : Lecture 16 V W home at at U W ⟨ stop ⟩ P ( y t = j | y t − 1 ) 1 s t = σ ( Ws t − 1 + Ux t + b ) P ( y t = j | y t − 1 ) = softmax ( Vs t + c ) j 1 s 0 Parameters: U , V , W , b , c < GO > ∑ L ( θ ) = L t ( θ ) t =1 L t ( θ ) = − log P ( y t = ℓ t | y t − 1 ) 1 where ℓ t is the true word at time step

6/63 predicted at the previous time step 0 0 0 s t What is the input at each time step? It is simply the word that we In general 0 Let j be the index of the word which has been assigned the max x t is essentially a one-hot vector vocabulary In practice, instead of one hot representation we use a pre-trained Mitesh M. Khapra 0 1 1 1 0 0 0 0 1 am 0 0 1 0 0 at 0 CS7015 (Deep Learning) : Lecture 16 0 0 0 0 0 today 1 0 0 0 0 home 0 < stop > o/p: I s t = RNN ( s t − 1 , x t ) probability at time step t − 1 < GO > x t = e ( v j ) ( e ( v j ) )representing the j th word in the word embedding of the j th word

7/63 x t V W today today U V W s t home y t just randomly initialized We learn it along with the other parameters of RNN (or LSTM or GRU) We will return back to this later Mitesh M. Khapra home U W am U V I I U V V W am U V W at at U CS7015 (Deep Learning) : Lecture 16 Notice that s 0 is not computed but ⟨ stop ⟩ P ( y t = j | y t − 1 ) 1 s 0 < GO >

8/63 h t Mitesh M. Khapra We will use these notations going forward by RNN, GRU and LSTM Before moving on we will see a compact way of writing the function computed s t CS7015 (Deep Learning) : Lecture 16 s t h t h t s t = σ ( U x t + Ws t − 1 + b ) s t ˜ s t = σ ( W ( o t ⊙ s t − 1 ) + Ux t + b ) ˜ s t = σ ( W h t − 1 + Ux t + b ) s t = i t ⊙ s t − 1 + (1 − i t ) ⊙ ˜ s t = f t ⊙ s t − 1 + i t ⊙ ˜ h t = o t ⊙ σ ( s t ) s t = RNN ( s t − 1 , x t ) s t = GRU ( s t − 1 , x t ) h t , s t = LSTM ( h t − 1 , s t − 1 , x t )

9/63 What x t s t y t A man throwing So far we have seen how to model the conditional probability distribution More informally, we have seen how to generate a sentence given previous words if V we want to generate a sentence given an image? image conditional distribution Mitesh M. Khapra W CS7015 (Deep Learning) : Lecture 16 U V a frisbee in a park U V A A U park W man man U V W W throwing . . . . . . . . . ⟨ stop ⟩ P ( y t | y t − 1 P ( y t = j | y t − 1 ) ) 1 1 s 0 We are now interested in P ( y t | y t − 1 , I ) 1 instead of P ( y t | y t − 1 ) where I is an < Go > 1 Notice that P ( y t | y t − 1 , I ) is again a 1

10/63 . . . Mitesh M. Khapra image previous words y t s t x t CNN V U park . . . W . . . W throwing U U W V A A CS7015 (Deep Learning) : Lecture 16 V U man man V W Earlier we modeled P ( y t | y t − 1 ) as 1 ⟨ stop ⟩ P ( y t | y t − 1 P ( y t = j | y t − 1 ) = P ( y t = j | s t ) , I ) 1 1 Where s t was a state capturing all the We could now model P ( y t = j | y t − 1 , I ) 1 s 0 = fc 7 ( I ) as P ( y t = j | s t , f c 7 ( I )) where fc 7 ( I ) is the representation obtained from the fc 7 layer of an < GO >

11/63 Let us see two such options Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16 There are many ways of making P ( y t = j ) conditional on f c 7 ( I )

12/63 throwing Mitesh M. Khapra In other words, we are computing y t s T x T W V U Option 1 . . . W . . . . . . park W A CNN V U V A CS7015 (Deep Learning) : Lecture 16 U V W man U man ⟨ stop ⟩ P ( y t = j | y t − 1 , I ) 1 Option 1: Set s 0 = f c 7 ( I ) Now s 0 and hence all subsequent s t ’s depend on f c 7 ( I ) s 0 = fc 7 ( I ) We can thus say that P ( y t = j ) depends on f c 7 ( I ) < GO > P ( y t = j | s t , f c 7 ( I ))

13/63 s t . . . Option 2 U V W x t y t . . . Option 2: Another more explicit way of doing this is to compute In other words we are explicitly using You could think of other ways of Mitesh M. Khapra W park . . . U CNN U V A throwing A CS7015 (Deep Learning) : Lecture 16 V man W W U V man ⟨ stop ⟩ P ( y t = j | y t − 1 , I ) 1 s t = RNN ( s t − 1 , [ x t , f c 7 ( I ))] f c 7 ( I ) to compute s t and hence P ( y t = j ) s 0 = fc 7 ( I ) conditioning P ( y t = j ) on f c 7 < GO >

14/63 A RNN is then used to decode y t Let us look at the full architecture A CNN is fjrst used to encode the image (generate) a sentence from the x t encoding This is a typical encoder decoder architecture Both the encoder and decoder use a neural network Mitesh M. Khapra Encoder s t W V CNN Decoder U V A A U V W man man U V W throwing . . . . . . W . . . park U CS7015 (Deep Learning) : Lecture 16 ⟨ stop ⟩ P ( y t = j | y t − 1 , I ) 1 h 0 < GO >

15/63 This us look at the full architecture A CNN is fjrst used to encode the image A RNN is then used to decode (generate) a sentence from the encoding is Encoder a typical encoder decoder architecture Both the encoder and decoder use a neural network Alternatively, the encoder’s output can be fed to every step of the decoder Mitesh M. Khapra Let CS7015 (Deep Learning) : Lecture 16 V throwing A A U V W man man U y t W . . . U W . . . park U V W x t s t Decoder CNN V ⟨ stop ⟩ P ( y t = j | y t − 1 , I ) 1 h 0 < GO >

16/63 Module 16.2: Applications of Encoder Decoder models Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

17/63 For all these applications we will try to answer the following questions What kind of a network can we use to encode the input(s)? (What is an appropriate encoder?) What kind of a network can we use to decode the output? (What is an appropriate decoder?) What are the parameters of the model ? What is an appropriate loss function ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

18/63 Model: U V W x t s t y t Encoder Task: Image captioning Encoder: . . . Decoder: Loss: T T Algorithm: Gradient descent with backpropagation Mitesh M. Khapra park CS7015 (Deep Learning) : Lecture 16 W W CNN . . . A Decoder A U V V man man U V W . . . U L t ( θ ) = − log P ( y t = j | y t − 1 , f c 7 ) 1 Data: { x i = image i , y i = caption i } N i =1 ⟨ stop ⟩ throwing . . . P ( y t = j | y t − 1 , f c 7 ) 1 s 0 = CNN ( x i ) s t = RNN ( s t − 1 , e (ˆ y t − 1 )) h 0 P ( y t | y t − 1 , I ) = softmax ( Vs t + b ) 1 < GO > Parameters: U dec , V , W dec , W conv , b log P ( y t = ℓ t | y t − 1 ∑ ∑ L ( θ ) = L t ( θ ) = − , I ) 1 i =1 t =1

19/63 s t is wet 0 0 0 0 1 i/p : It is raining outside 0 0 0 1 0 Task: 0 Textual entailment Model (Option 1): Encoder: Decoder: (T is length of input) Loss: T T Algorithm: Gradient descent with backpropagation Mitesh M. Khapra 0 wet 0 0 o/p : The ground is wet It i/p: is raining 1 h t o/p: The 0 0 outside 0 0 0 is ground 0 0 1 1 CS7015 (Deep Learning) : Lecture 16 0 ground The < STOP > Data: { x i = premise i , y i = hypothesis i } N i =1 h t = RNN ( h t − 1 , x it ) s 0 = h T s t = RNN ( s t − 1 , e (ˆ y t − 1 )) < Go > P ( y t | y t − 1 , x ) = softmax ( Vs t + b ) 1 Parameters: U dec , V , W dec , U enc , W enc , b log P ( y t = ℓ t | y t − 1 ∑ ∑ L ( θ ) = L t ( θ ) = − , x ) x 1 x 2 x 3 x 4 1 i =1 t =1

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, - PowerPoint PPT Presentation

1/63 CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

Applications Lecture slides for Chapter 12 of Deep Learning www.deeplearningbook.org Ian

Interpretable and Accurate Fine-grained Recognition via Region Grouping Zixuan Huang 1 , Yin Li

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang,

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Turning Your Attention to VISTA Member Retention Dial: 877-853-5257 Webinar ID: 996-1208-0047 1

Prioritizing Attention in Fast Data Principles and Promise Peter Bailis Edward Gan Kexin Rong

Attention, Coordination, and Bounded Recall Alessandro Pavan Northwestern University Chicago

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1 ,