cs7015 deep learning lecture 16
play

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, - PowerPoint PPT Presentation

1/63 CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16


  1. 1/63 CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

  2. 2/63 Module 16.1: Introduction to Encoder Decoder Models Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

  3. 3/63 start today y t today U W x t s t y t We will by V revisiting the problem of language modeling want to fjnd Let us see how we model Mitesh M. Khapra W V U V U V I I U W y t V home y t am U am W at y t home W V U CS7015 (Deep Learning) : Lecture 16 at y t ⟨ stop ⟩ P ( y t = j | y t − 1 ) Informally, given ‘ t − i ’ words we are 1 interested in predicting the t th word More formally, given y 1 , y 2 , ..., y t − 1 we s 0 y ∗ = argmax P ( y t | y 1 , y 2 , ..., y t − 1 ) < GO > P ( y t | y 1 , y 2 ... y t − 1 ) using a RNN We will refer to P ( y t | y 1 , y 2 ... y t − 1 ) by shorthand notation: P ( y t | y t − 1 ) 1

  4. 4/63 x t U V W today today U W s t home y t We are interested in vocabulary words Using an RNN we compute this as In other words we compute Notice that the recurrent connections Mitesh M. Khapra home V W V U V I I V U W CS7015 (Deep Learning) : Lecture 16 am W U at am at U V ⟨ stop ⟩ P ( y t = j | y 1 , y 2 ... y t − 1 ) P ( y t = j | y t − 1 ) 1 where j ∈ V and V is the set of all s 0 P ( y t = j | y t − 1 ) = softmax ( Vs t + c ) j 1 < GO > P ( y t = j | y t − 1 ) = P ( y t = j | s t ) 1 = softmax ( Vs t + c ) j ensure that s t has information about y t − 1 1

  5. 5/63 India, offjcially the Republic today today U V W x t y t Data: of India, is a country in South V Asia. It is the seventh-largest country by area, ..... Data: All sentences from any large corpus (say wikipedia) Model: Loss: T t Mitesh M. Khapra W s t U V U V I I home W V am am U U CS7015 (Deep Learning) : Lecture 16 V W home at at U W ⟨ stop ⟩ P ( y t = j | y t − 1 ) 1 s t = σ ( Ws t − 1 + Ux t + b ) P ( y t = j | y t − 1 ) = softmax ( Vs t + c ) j 1 s 0 Parameters: U , V , W , b , c < GO > ∑ L ( θ ) = L t ( θ ) t =1 L t ( θ ) = − log P ( y t = ℓ t | y t − 1 ) 1 where ℓ t is the true word at time step

  6. 6/63 predicted at the previous time step 0 0 0 s t What is the input at each time step? It is simply the word that we In general 0 Let j be the index of the word which has been assigned the max x t is essentially a one-hot vector vocabulary In practice, instead of one hot representation we use a pre-trained Mitesh M. Khapra 0 1 1 1 0 0 0 0 1 am 0 0 1 0 0 at 0 CS7015 (Deep Learning) : Lecture 16 0 0 0 0 0 today 1 0 0 0 0 home 0 < stop > o/p: I s t = RNN ( s t − 1 , x t ) probability at time step t − 1 < GO > x t = e ( v j ) ( e ( v j ) )representing the j th word in the word embedding of the j th word

  7. 7/63 x t V W today today U V W s t home y t just randomly initialized We learn it along with the other parameters of RNN (or LSTM or GRU) We will return back to this later Mitesh M. Khapra home U W am U V I I U V V W am U V W at at U CS7015 (Deep Learning) : Lecture 16 Notice that s 0 is not computed but ⟨ stop ⟩ P ( y t = j | y t − 1 ) 1 s 0 < GO >

  8. 8/63 h t Mitesh M. Khapra We will use these notations going forward by RNN, GRU and LSTM Before moving on we will see a compact way of writing the function computed s t CS7015 (Deep Learning) : Lecture 16 s t h t h t s t = σ ( U x t + Ws t − 1 + b ) s t ˜ s t = σ ( W ( o t ⊙ s t − 1 ) + Ux t + b ) ˜ s t = σ ( W h t − 1 + Ux t + b ) s t = i t ⊙ s t − 1 + (1 − i t ) ⊙ ˜ s t = f t ⊙ s t − 1 + i t ⊙ ˜ h t = o t ⊙ σ ( s t ) s t = RNN ( s t − 1 , x t ) s t = GRU ( s t − 1 , x t ) h t , s t = LSTM ( h t − 1 , s t − 1 , x t )

  9. 9/63 What x t s t y t A man throwing So far we have seen how to model the conditional probability distribution More informally, we have seen how to generate a sentence given previous words if V we want to generate a sentence given an image? image conditional distribution Mitesh M. Khapra W CS7015 (Deep Learning) : Lecture 16 U V a frisbee in a park U V A A U park W man man U V W W throwing . . . . . . . . . ⟨ stop ⟩ P ( y t | y t − 1 P ( y t = j | y t − 1 ) ) 1 1 s 0 We are now interested in P ( y t | y t − 1 , I ) 1 instead of P ( y t | y t − 1 ) where I is an < Go > 1 Notice that P ( y t | y t − 1 , I ) is again a 1

  10. 10/63 . . . Mitesh M. Khapra image previous words y t s t x t CNN V U park . . . W . . . W throwing U U W V A A CS7015 (Deep Learning) : Lecture 16 V U man man V W Earlier we modeled P ( y t | y t − 1 ) as 1 ⟨ stop ⟩ P ( y t | y t − 1 P ( y t = j | y t − 1 ) = P ( y t = j | s t ) , I ) 1 1 Where s t was a state capturing all the We could now model P ( y t = j | y t − 1 , I ) 1 s 0 = fc 7 ( I ) as P ( y t = j | s t , f c 7 ( I )) where fc 7 ( I ) is the representation obtained from the fc 7 layer of an < GO >

  11. 11/63 Let us see two such options Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16 There are many ways of making P ( y t = j ) conditional on f c 7 ( I )

  12. 12/63 throwing Mitesh M. Khapra In other words, we are computing y t s T x T W V U Option 1 . . . W . . . . . . park W A CNN V U V A CS7015 (Deep Learning) : Lecture 16 U V W man U man ⟨ stop ⟩ P ( y t = j | y t − 1 , I ) 1 Option 1: Set s 0 = f c 7 ( I ) Now s 0 and hence all subsequent s t ’s depend on f c 7 ( I ) s 0 = fc 7 ( I ) We can thus say that P ( y t = j ) depends on f c 7 ( I ) < GO > P ( y t = j | s t , f c 7 ( I ))

  13. 13/63 s t . . . Option 2 U V W x t y t . . . Option 2: Another more explicit way of doing this is to compute In other words we are explicitly using You could think of other ways of Mitesh M. Khapra W park . . . U CNN U V A throwing A CS7015 (Deep Learning) : Lecture 16 V man W W U V man ⟨ stop ⟩ P ( y t = j | y t − 1 , I ) 1 s t = RNN ( s t − 1 , [ x t , f c 7 ( I ))] f c 7 ( I ) to compute s t and hence P ( y t = j ) s 0 = fc 7 ( I ) conditioning P ( y t = j ) on f c 7 < GO >

  14. 14/63 A RNN is then used to decode y t Let us look at the full architecture A CNN is fjrst used to encode the image (generate) a sentence from the x t encoding This is a typical encoder decoder architecture Both the encoder and decoder use a neural network Mitesh M. Khapra Encoder s t W V CNN Decoder U V A A U V W man man U V W throwing . . . . . . W . . . park U CS7015 (Deep Learning) : Lecture 16 ⟨ stop ⟩ P ( y t = j | y t − 1 , I ) 1 h 0 < GO >

  15. 15/63 This us look at the full architecture A CNN is fjrst used to encode the image A RNN is then used to decode (generate) a sentence from the encoding is Encoder a typical encoder decoder architecture Both the encoder and decoder use a neural network Alternatively, the encoder’s output can be fed to every step of the decoder Mitesh M. Khapra Let CS7015 (Deep Learning) : Lecture 16 V throwing A A U V W man man U y t W . . . U W . . . park U V W x t s t Decoder CNN V ⟨ stop ⟩ P ( y t = j | y t − 1 , I ) 1 h 0 < GO >

  16. 16/63 Module 16.2: Applications of Encoder Decoder models Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

  17. 17/63 For all these applications we will try to answer the following questions What kind of a network can we use to encode the input(s)? (What is an appropriate encoder?) What kind of a network can we use to decode the output? (What is an appropriate decoder?) What are the parameters of the model ? What is an appropriate loss function ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

  18. 18/63 Model: U V W x t s t y t Encoder Task: Image captioning Encoder: . . . Decoder: Loss: T T Algorithm: Gradient descent with backpropagation Mitesh M. Khapra park CS7015 (Deep Learning) : Lecture 16 W W CNN . . . A Decoder A U V V man man U V W . . . U L t ( θ ) = − log P ( y t = j | y t − 1 , f c 7 ) 1 Data: { x i = image i , y i = caption i } N i =1 ⟨ stop ⟩ throwing . . . P ( y t = j | y t − 1 , f c 7 ) 1 s 0 = CNN ( x i ) s t = RNN ( s t − 1 , e (ˆ y t − 1 )) h 0 P ( y t | y t − 1 , I ) = softmax ( Vs t + b ) 1 < GO > Parameters: U dec , V , W dec , W conv , b log P ( y t = ℓ t | y t − 1 ∑ ∑ L ( θ ) = L t ( θ ) = − , I ) 1 i =1 t =1

  19. 19/63 s t is wet 0 0 0 0 1 i/p : It is raining outside 0 0 0 1 0 Task: 0 Textual entailment Model (Option 1): Encoder: Decoder: (T is length of input) Loss: T T Algorithm: Gradient descent with backpropagation Mitesh M. Khapra 0 wet 0 0 o/p : The ground is wet It i/p: is raining 1 h t o/p: The 0 0 outside 0 0 0 is ground 0 0 1 1 CS7015 (Deep Learning) : Lecture 16 0 ground The < STOP > Data: { x i = premise i , y i = hypothesis i } N i =1 h t = RNN ( h t − 1 , x it ) s 0 = h T s t = RNN ( s t − 1 , e (ˆ y t − 1 )) < Go > P ( y t | y t − 1 , x ) = softmax ( Vs t + b ) 1 Parameters: U dec , V , W dec , U enc , W enc , b log P ( y t = ℓ t | y t − 1 ∑ ∑ L ( θ ) = L t ( θ ) = − , x ) x 1 x 2 x 3 x 4 1 i =1 t =1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend