deep learning
play

Deep learning Recurrent neural networks Hamid Beigy Sharif - PowerPoint PPT Presentation

Deep learning Deep learning Recurrent neural networks Hamid Beigy Sharif university of technology November 10, 2019 Hamid Beigy | Sharif university of technology | November 10, 2019 1 / 96 Deep learning Table of contents 1 Introduction 2


  1. Deep learning | Recurrent neural networks Recurrent neural networks 1 Usually, we want to predict a vector at some steps 7 . Recurrent Neural Network 1 We can process input x by applying the following recurrence We can process a sequence of vectors x by equation. applying a recurrence formula at every time step: y h t = f w ( x t , h t − 1 ) RNN 2 Assume that the activation function is tanh. new state old state input vector at some time step h t = tanh ( Ux t , Wh t − 1 ) some function x with parameters W y t = Vh t Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - Lecture 10 - May 3, 2018 May 3, 2018 22 7From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 18 / 96

  2. Deep learning | Recurrent neural networks Recurrent neural networks RNN: Computational Graph: One to Many 1 The computational graph is 8 . y 2 y 3 y T y 1 … f W f W f W h 0 h 1 h 2 h 3 h T x W Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - Lecture 10 - 33 May 3, 2018 May 3, 2018 8From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 19 / 96

  3. Deep learning | Recurrent neural networks Recurrent neural networks (character level language model) 1 Assume that the vocabulary is [h,e,l,o] 9 . 2 Example training sequence: hello 3 At the output layer, we use softmax. Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” 9From Fei-Fei Li et al. slides Lecture 10 - Lecture 10 - May 3, 2018 May 3, 2018 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 38 Hamid Beigy | Sharif university of technology | November 10, 2019 20 / 96

  4. Deep learning | Recurrent neural networks Recurrent neural networks (character level language model) 1 Assume that the vocabulary is [h,e,l,o] 10 . 2 Example training sequence: hello 3 At the output layer, we use softmax. “e” “l” “o” Example: “l” Sample Character-level .03 .25 .11 .11 .13 .20 .17 .02 Softmax Language Model .00 .05 .68 .08 .84 .50 .03 .79 Sampling Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model 10From Fei-Fei Li et al. slides Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - Lecture 10 - 42 May 3, 2018 May 3, 2018 Hamid Beigy | Sharif university of technology | November 10, 2019 21 / 96

  5. Deep learning | Recurrent neural networks Comparing two models Example: “l” Example: “e” “l” “o” Sample Character-level Character-level .03 .25 .11 .11 .13 .20 .17 .02 Language Model Softmax Language Model .00 .05 .68 .08 .84 .50 .03 .79 Sampling Vocabulary: Vocabulary: [h,e,l,o] [h,e,l,o] Example training At test-time sample sequence: characters one at a time, “hello” feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - Lecture 10 - 42 May 3, 2018 May 3, 2018 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - Lecture 10 - May 3, 2018 May 3, 2018 38 1 This is more powerful because a 1 This is less powerful unless very very high dimensional hidden high dimensional and rich vector can be considered. output vector is considered. 2 The training will be harder. It 2 The training will be easier. It must be sequential. allows a parallelization. Hamid Beigy | Sharif university of technology | November 10, 2019 22 / 96

  6. Deep learning | Training recurrent neural networks Table of contents 1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading Hamid Beigy | Sharif university of technology | November 10, 2019 22 / 96

  7. Deep learning | Training recurrent neural networks Backpropagation through time 1 We have a collection of labeled samples. S = { ( x i , y i ) , ( x 2 , y 2 ) , . . . , ( x m , y m ) } where x i = ( x i , 0 , x i , 1 , . . . , x i , T ) is the input sequence and y i = ( y i , 0 , y i , 1 , . . . , y i , T ) is the output sequence. 2 The goal is to find weights of the network that minimizes the error between ˆ y i = (ˆ y i , 0 , ˆ y i , 1 , . . . , ˆ y i , T ) and y i = ( y i , 0 , y i , 1 , . . . , y i , T ) 3 In forward phase, the input is given to the network and the output is calculated. 4 In backward phase, the gradient of cost function with respect to the weights are calculated and the weights are updated. Hamid Beigy | Sharif university of technology | November 10, 2019 23 / 96

  8. Deep learning | Training recurrent neural networks Forward phase 1 The input is given to the network and the output will be calculated. 2 Consider a two hidden-layers denoted by h (1) and h (2) . 3 The output of the first hidden layer equals to   h (1) w (1) ji h (1) ( t − 1) + b (1) ∑ ∑ ( t ) = σ 1 u ij x j ( t ) +  i j i j j 4 The output of the second hidden layer equals to   h (2) u ij h (1) w (2) ji h (2) ( t − 1) + b (2) ∑ ∑ ( t ) = σ 2 ( t ) +  i j j i j j 5 The output equals to   v ij h (2) ∑ y i ( t ) = σ 3 ˆ ( t ) + c i  j j Hamid Beigy | Sharif university of technology | November 10, 2019 24 / 96

  9. Deep learning | Training recurrent neural networks Backpropagation through time 1 Forward through entire sequence to compute loss, then backward Forward through entire sequence to through entire sequence to compute gradient 11 . Backpropagation through time compute loss, then backward through entire sequence to compute gradient Loss Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - Lecture 10 - 43 May 3, 2018 May 3, 2018 11From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 25 / 96

  10. Deep learning | Training recurrent neural networks Backpropagation through time 1 We must find ∇ V L ( θ ) ∇ W L ( θ ) ∇ U L ( θ ) ∇ b L ( θ ) ∇ c L ( θ ) 2 Then we threat the network as usual multi-layer network and apply the backpropagation on the unrolled network. Hamid Beigy | Sharif university of technology | November 10, 2019 26 / 96

  11. Deep learning | Training recurrent neural networks Backpropagation through time 1 Let consider a network with one hidden layer. 2 We use softmax activation function for output layer 3 We use tanh for hidden layer 4 Suppose that L ( t ) be the loss at time t . 5 If L ( t ) is the negative log-likelihood of y ( t ) given x (1) , x (2) , . . . , x ( T ), then ∑ L ( { x (1) , x (2) , . . . , x ( T ) } , { y (1) , y (2) , . . . , y ( T ) } ) = L ( t ) t ∑ = − log p model ( y ( t ) |{ x (1) , x (2) , . . . , x ( T ) } ) t Hamid Beigy | Sharif university of technology | November 10, 2019 27 / 96

  12. Deep learning | Training recurrent neural networks Backpropagation through time 1 We backpropagate the gradient in the following way ( L is loss function). y 1 y 2 y 3 y 4 y 5 ˆ ˆ ˆ ˆ ˆ ∂ L V V V V V ∂ V W W W W h 1 h 2 h 3 h 4 h 5 ∂ L ∂ L ∂ L ∂ L ∂ W ∂ W ∂ W ∂ W ∂ L ∂ L ∂ L ∂ L ∂ L U U U U U ∂ U ∂ U ∂ U ∂ U ∂ U x 1 x 2 x 3 x 4 x 5 Credit Trivedi & Kondor Hamid Beigy | Sharif university of technology | November 10, 2019 28 / 96

  13. Deep learning | Training recurrent neural networks Backpropagation through time 1 In forward phase, hidden layer computes ( ) W T h t − 1 + U T x t + b h t = tanh 2 In forward phase, output layer computes o t = V T h t + c y t = softmax ( o t ) ˆ Hamid Beigy | Sharif university of technology | November 10, 2019 29 / 96

  14. Deep learning | Training recurrent neural networks Backpropagation through time 1 Calculating the gradient ∂ L ∂ L ( t ) = 1 ∂ L ∂ L ∂ L ( t ) ( ) ∇ o ( t ) L i = ∂ o i ( t ) = ∂ L ( t ) ∂ o i ( t ) 2 By using softmax in output layer, we have e o i ( t ) y i ( t ) = ˆ k e o k ( t ) ∑ e oi ( t ) ∂ ∂ ˆ y i ( t ) k e ok ( t ) ∑ ∂ o j ( t ) = ∂ o j ( t )  y i ( t )(1 − ˆ ˆ y j ( t )) i = j  = − ˆ y j ( t )ˆ y i ( t ) i ̸ = j  Hamid Beigy | Sharif university of technology | November 10, 2019 30 / 96

  15. Deep learning | Training recurrent neural networks Backpropagation through time 1 We compute gradient of loss function with respect to o i ( t ) ∑ L ( t ) = − y k ( t ) log (ˆ y k ( t )) k ∂ L ( t ) y k ( t ) ∂ log (ˆ y k ( t )) ∑ ∂ o i ( t ) = − ∂ o i ( t ) k y k ( t ) ∂ log (ˆ y k ( t )) × ∂ ˆ y k ( t ) ∑ = − ∂ ˆ y k ( t ) ∂ o i ( t ) k y k ( t ) × ∂ ˆ 1 y k ( t ) ∑ = − y k ( t ) ˆ ∂ o i ( t ) k Hamid Beigy | Sharif university of technology | November 10, 2019 31 / 96

  16. Deep learning | Training recurrent neural networks Backpropagation through time 1 We compute gradient of loss function with respect to o i ( t ) ∂ L ( t ) 1 ∑ ∂ o i ( t ) = − y i ( t )(1 − ˆ y i ( t )) − y k ( t ) y k ( t )( − ˆ y k ( t ) . ˆ y i ( t )) ˆ k ̸ = i ∑ = − y i ( t )(1 − ˆ y i ( t )) + y k ( t ) . ˆ y i ( t ) k ̸ = i ∑ = − y i ( t ) + y i ( t )ˆ y i ( t ) + y k ( t ) . ˆ y i ( t ) k ̸ = i   ∑  − y i ( t ) = ˆ y i ( t )  y i ( t ) + y k ( t ) k ̸ = i Hamid Beigy | Sharif university of technology | November 10, 2019 32 / 96

  17. Deep learning | Training recurrent neural networks Backpropagation through time 1 We had   ∂ L ( t ) ∑  − y i ( t ) ∂ o i ( t ) = ˆ y i ( t )  y i ( t ) + y k ( t ) k ̸ = i 2 Since y ( t ) is a one hot encoded vector for the labels, so ∑ k y k ( t ) = 1 and y i ( t ) + ∑ k ̸ =1 y k ( t ) = 1. So we have ∂ L ( t ) ∂ o i ( t ) = ˆ y i ( t ) − y i ( t ) 3 This is a very simple and elegant expression. Hamid Beigy | Sharif university of technology | November 10, 2019 33 / 96

  18. Deep learning | Training recurrent neural networks Backpropagation through time 1 At the final time step T , h ( T ) has only as o ( T ) as a descendant ∇ h ( T ) L = V T ∇ o ( T ) L 2 We can then iterate backward in time to back-propagate gradients through time, from t = T − 1 down to t = 1. ) T ( ( ∂ h ( t + 1) ) ∇ h ( t ) L = ∇ h ( t +1) L ∂ h ( t ) ) T ( ( ∂ o ( t ) ) + ∇ o ( t ) L ∂ h ( t ) = W T diag 1 − ( h ( t + 1)) 2 ) ( ( ) ∇ h ( t +1) L + V T ( ) ∇ o ( t ) L Hamid Beigy | Sharif university of technology | November 10, 2019 34 / 96

  19. Deep learning | Training recurrent neural networks Backpropagation through time 1 The gradient on the remaining parameters is given by ) T ( ∂ o ( t ) ∑ ∑ ∇ c L = ∇ o ( t ) L = ∇ o ( t ) L ∂ c ( t ) t t ) T ( ∂ h ( t ) ∑ ∑ 1 − ( h ( t )) 2 ) ( ( ) ∇ b L = ∇ h ( t ) L = ∇ h ( t ) L diag ∂ b ( t ) t t ) T ( ∂ L ∑ ∑ ∑ h ( t ) T ( ) ∇ V L = ∇ V ( t ) o ( t ) = ∇ o ( t ) L ∂ o i ( t ) t t i ) T ( ∂ L ∑ ∑ ∑ 1 − ( h ( t )) 2 ) ( ( h ( t − 1)) T ( ) ∇ W L = ∇ W ( t ) h i ( t ) = ∇ h ( t − 1) L diag ∂ h i ( t ) t t i ) T ( ∂ L ∑ ∑ ∑ 1 − ( h ( t )) 2 ) ( ( x ( t )) T ( ) ∇ U L = ∇ U ( t ) h i ( t ) = ∇ h ( t ) L diag ∂ h i ( t ) t i t Hamid Beigy | Sharif university of technology | November 10, 2019 35 / 96

  20. Deep learning | Training recurrent neural networks Backpropagation through time 1 Finally, bearing in mind that the weights to and from each unit in the hidden layer are the same at every time-step, we sum over the whole sequence to get the derivatives with respect to each of the network weights. T T ∑ ∑ ∆ U = − η ∇ U L ( t ) ∆ b = − η ∇ b L ( t ) t =1 t =1 T ∑ ∆ W = − η ∇ W L ( t ) t =1 T T ∑ ∑ ∆ V = − η ∇ V L ( t ) ∆ c = − η ∇ c L ( t ) t =1 t =1 Hamid Beigy | Sharif university of technology | November 10, 2019 36 / 96

  21. Deep learning | Training recurrent neural networks Truncated Backpropagation through time 1 Run forward and backward through chunks of the sequence instead of Truncated Backpropagation through time whole sequence 12 . Loss Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - Lecture 10 - 45 May 3, 2018 May 3, 2018 12From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 37 / 96

  22. Deep learning | Training recurrent neural networks Truncated Backpropagation through time 1 Run forward and backward through chunks of the sequence instead of whole sequence 13 . Truncated Backpropagation through time Loss Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - Lecture 10 - 46 May 3, 2018 May 3, 2018 13From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 38 / 96

  23. Deep learning | Design patterns of RNN Table of contents 1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading Hamid Beigy | Sharif university of technology | November 10, 2019 38 / 96

  24. Deep learning | Design patterns of RNN Design patterns of RNN (summarization) 1 Producing a single output and have recurrent connections from output between hidden units. 2 This is useful for summarizing a sequence such as sentiment analysis Hamid Beigy | Sharif university of technology | November 10, 2019 39 / 96

  25. Deep learning | Design patterns of RNN Design patterns of RNN (fixed vector as input) 1 Sometimes we are interested in only taking a single, fixed sized vector x as input, which generates the y sequence 2 Most common ways to provide an extra input at each time step 3 Other solutions? (Please consider them) 4 Application: Image caption generation Hamid Beigy | Sharif university of technology | November 10, 2019 40 / 96

  26. Deep learning | Design patterns of RNN Bidirectional RNNs 1 We considered RNNs in the context of a sequence x ( t )( t = 1 , . . . , T ) 2 In many applications, y ( t ) may depend on the whole input sequence. 3 Bidirectional RNNs were introduced to address this need. Hamid Beigy | Sharif university of technology | November 10, 2019 41 / 96

  27. Deep learning | Design patterns of RNN Encoder-Decoder 1 How do we map input sequences to output sequences that are not necessarily of the same length? 2 The input to RNN is called context, we want to find a representation of the context C . 3 C may be a vector or a sequence that summarizes x = { x (1) , . . . , x ( n x ) } Hamid Beigy | Sharif university of technology | November 10, 2019 42 / 96

  28. Deep learning | Design patterns of RNN Encoder-Decoder Hamid Beigy | Sharif university of technology | November 10, 2019 43 / 96

  29. Deep learning | Design patterns of RNN Deep recurrent networks 1 The computations in RNNs can be decomposed into three blocks of parameters and transformations Input to hidden state Previous hidden state to the next Hidden state to the output 2 Each of these transforms were learned affine transformations followed by a fixed nonlinearity. 3 Introducing depth in each of these operations is advantageous. 4 The intuition on why depth should be more useful is quite similar to that in deep feed-forward networks 5 Optimization can be made much harder, but can be mitigated by tricks such as introducing skip connections Hamid Beigy | Sharif university of technology | November 10, 2019 44 / 96

  30. Deep learning | Design patterns of RNN Deep recurrent networks Hamid Beigy | Sharif university of technology | November 10, 2019 45 / 96

  31. Deep learning | Long-term dependencies Table of contents 1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading Hamid Beigy | Sharif university of technology | November 10, 2019 45 / 96

  32. Deep learning | Long-term dependencies Long-term dependencies 1 RNNs involve the composition of a function multiple times, once per time step. 2 This function composition resembles matrix multiplication. 3 Consider the recurrence relationship h ( t + 1) = W T h ( t ) 4 This is very simple RNN without a nonlinear activation and x . 5 This recurrence equation can be written as: h ( t ) = ( W t ) T h (0). 6 If W has an eigendecomposition of form W = Q Λ Q T with orthogonal Q . 7 The recurrence becomes h ( t ) = ( W t ) T h (0) = Q T Λ t Qh (0). Q is matrix composed of eigenvectors W . Λ is a diagonal matrix with eigenvalues placed on the diagonals. Hamid Beigy | Sharif university of technology | November 10, 2019 46 / 96

  33. Deep learning | Long-term dependencies Long-term dependencies 1 The recurrence becomes h ( t ) = ( W t ) T h (0) = Q T Λ t Qh (0). Q is matrix composed of eigenvectors W . Λ is a diagonal matrix with eigenvalues placed on the diagonals. 2 Eigenvalues are raised to t : Quickly decay to zero or explode. Consider Vanishing gradients Exploding gradients 3 Problem: Gradients propagated over many stages tend to vanish (most of the time) or explode (relatively rarely) Hamid Beigy | Sharif university of technology | November 10, 2019 47 / 96

  34. Deep learning | Long-term dependencies Why do gradients vanish or explode? 1 The expression for h ( t ) was h ( t ) = tanh ( Wh ( t − 1) + Ux ( t )). 2 The partial derivative of loss wrt to hidden states equals to ∂ L ∂ L ∂ h ( T ) ∂ h ( t ) = ∂ h ( T ) ∂ h ( t ) T − 1 ∂ L ∂ h ( k + 1) ∏ = ∂ h ( T ) ∂ h ( k ) k = t T − 1 ∂ L ∏ D k +1 W T = k ∂ h ( T ) k = t where 1 − tanh 2 ( Wh ( t − 1) + Ux ( t )) ( ) D k +1 = diag Hamid Beigy | Sharif university of technology | November 10, 2019 48 / 96

  35. Deep learning | Long-term dependencies Why do gradients vanish or explode? 1 For any matrices A , B , we have ∥ AB ∥ ≤ ∥ A ∥∥ B ∥ � T − 1 � T − 1 � � � � ∂ L ∂ L ∂ L � � � � ∏ D k +1 W T ∏ � D k +1 W T � � � � � = � ≤ � � � � � � � � k k ∂ h ( t ) ∂ h ( T ) ∂ h ( T ) � � � � � � � k = t k = t 2 Since ∥ A ∥ equals to the largest singular value of A ( σ max ( A )), we have � T − 1 � T − 1 � � � � ∂ L ∂ L ∂ L � � ∏ ∏ � � D k +1 W T � � � = � ≤ σ max ( D k +1 ) σ max ( W k ) � � � � � � k ∂ h ( t ) � ∂ h ( T ) � ∂ h ( T ) � � � � k = t k = t 3 Hence the gradient norm can shrink to zero or grow up exponentially fast depending on the σ max . Hamid Beigy | Sharif university of technology | November 10, 2019 49 / 96

  36. Deep learning | Long-term dependencies Echo state networks 1 Set the recurrent weights such that they do a good job of capturing past history and learn only the output weights 2 Methods: Echo State Machines and Liquid State Machines 3 The general methodology is called reservoir computing 4 How to choose the recurrent weights? 5 In Echo State Machines, choose recurrent weights such that the hidden-to-hidden transition Jacobian has eigenvalues close to 1 Hamid Beigy | Sharif university of technology | November 10, 2019 50 / 96

  37. Deep learning | Long-term dependencies Other solutions 1 Adding skip connection through time : Adding direct connections from variables in the distant past to the variables in the present. 2 Leaky units: Having units with self-connections. 3 Removing connections: Removing length-one connections and replacing them with longer connections. Hamid Beigy | Sharif university of technology | November 10, 2019 51 / 96

  38. Deep learning | Long-term dependencies Gated units 1 RNNs can accumulate but it might be useful to forget. 2 Creating paths through time where derivatives can flow. 3 Learn when to forget! 4 Gates allow learning how to read, write and forget. 5 We consider two gated units: Long Short-Term Memory (LSTM) Gated Recurrent Unit (GRU) Hamid Beigy | Sharif university of technology | November 10, 2019 52 / 96

  39. Deep learning | Long-term dependencies Long short term memory 1 LSTMs are explicitly designed to avoid the long-term dependency problem. 2 All recurrent neural networks have the form of a chain of repeating modules of neural network. 3 In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. Hamid Beigy | Sharif university of technology | November 10, 2019 53 / 96

  40. Deep learning | Long-term dependencies Long short term memory 1 LSTMs also have this chain like structure, but the repeating module has a different structure. 2 Instead of having a single neural network layer, there are four, interacting in a very special way. Hamid Beigy | Sharif university of technology | November 10, 2019 54 / 96

  41. Deep learning | Long-term dependencies Long short term memory 1 Let us to define the following notations 2 In the above figure, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent point-wise operations, like vector addition. The yellow boxes are learned neural network layers. Lines merging denote concatenation Line forking denote its content being copied and the copies going to different locations. Hamid Beigy | Sharif university of technology | November 10, 2019 55 / 96

  42. Deep learning | Long-term dependencies Long short term memory 1 The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. 2 The cell state is kind of like a conveyor belt. 3 It runs straight down the entire chain, with only some minor linear interactions. 4 Its very easy for information to just flow along it unchanged. 5 The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates Hamid Beigy | Sharif university of technology | November 10, 2019 56 / 96

  43. Deep learning | Long-term dependencies Long short term memory 1 Gates are a way to optionally let information through. 2 They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. 3 The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means let nothing through, while a value of one means let everything through! 4 An LSTM has three of these gates, to protect and control the cell state. Hamid Beigy | Sharif university of technology | November 10, 2019 57 / 96

  44. Deep learning | Long-term dependencies Long short term memory 1 This decision is made by a sigmoid layer called the forget gate layer. 2 It looks at h t − 1 and x t , and outputs a number between 0 and 1 for each number in the cell state C t − 1 . Hamid Beigy | Sharif university of technology | November 10, 2019 58 / 96

  45. Deep learning | Long-term dependencies Long short term memory 1 The next step is to decide what new information were going to store in the cell state. 2 This has the two following parts that could be added to the state. A sigmoid layer called the input gate layer decides which values well update. A tanh layer creates a vector of new candidate values, ˜ C t . 3 Then well combine these two to create an update to the state. Hamid Beigy | Sharif university of technology | November 10, 2019 59 / 96

  46. Deep learning | Long-term dependencies Long short term memory 1 Now, we must update the old cell state, C t − 1 into the new cell state C t . 2 We multiply the old state by f t , forgetting the things we decided to forget earlier. 3 Then we add i t × ˜ C t . This is the new candidate values, scaled by how much we decided to update each state value. Hamid Beigy | Sharif university of technology | November 10, 2019 60 / 96

  47. Deep learning | Long-term dependencies Long short term memory 1 Finally, we need to decide what were going to output. 2 This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state were going to output. Then, we put the cell state through tanh and multiply it by the output of the sigmoid gate. 3 So that we only output the parts we decided to. Hamid Beigy | Sharif university of technology | November 10, 2019 61 / 96

  48. Deep learning | Long-term dependencies Long short term memory Hamid Beigy | Sharif university of technology | November 10, 2019 62 / 96

  49. Deep learning | Long-term dependencies Variants on Long short term memory 1 One popular LSTM variant is adding peephole connections. This means that we let the gate layers look at the cell state. 14 14 F. A. Gers and J. Schmidhuber, ”Recurrent nets that time and count”, Proceedings of the IEEE International Joint Conference on Neural Networks, 2000. Hamid Beigy | Sharif university of technology | November 10, 2019 63 / 96

  50. Deep learning | Long-term dependencies Variants on Long short term memory 1 Another variation is to use coupled forget and input gates. 2 Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. 3 We only forget when were going to input something in its place. We only input new values to the state when we forget something older. Hamid Beigy | Sharif university of technology | November 10, 2019 64 / 96

  51. Deep learning | Long-term dependencies Variants on Long short term memory (GRU) 1 GRU combines the forget and input gates into a single update gate 15 . 2 It merges the cell and hidden states, and makes some other changes. 3 The resulting model is simpler than standard LSTM models, and has been growing increasingly popular. 15 Kyunghyun Cho and et al. ”Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”, Proc. of Conference on Empirical Methods in Natural Language Processing, pages 1724-1734, 2014. Hamid Beigy | Sharif university of technology | November 10, 2019 65 / 96

  52. Deep learning | Long-term dependencies Gated recurrent units (GRU) Hamid Beigy | Sharif university of technology | November 10, 2019 66 / 96

  53. Deep learning | Long-term dependencies Variants on Long short term memory 1 There are also other LSTM variants such as Jan Koutnik, et. al. ”A clockwork RNN”, Proceedings of the 31st International Conference on International Conference on Machine Learning, 2014. Kaisheng Yao, et. al. ”Depth-Gated Recurrent Neural Networks”, https://arxiv.org/pdf/1508.03790v2.pdf . Hamid Beigy | Sharif university of technology | November 10, 2019 67 / 96

  54. Deep learning | Long-term dependencies Optimization for long-term dependencies 1 Two simple solutions for gradient vanishing / exploding. For vanishing gradients: Initialization + ReLus Trick for exploding gradient: clipping trick When ∥ g ∥ > v Then g ← vg ∥ g ∥ 2 Vanishing gradient happens when the optimization gets stuck in a saddle point, the gradient becomes too small for the optimization to progress. This can also be fixed by using gradient descent with momentum or other methods. 3 Exploding gradient happens when the gradient becomes too big and you get numerical overflow. This can be easily fixed by initializing the network’s weights to smaller values. Hamid Beigy | Sharif university of technology | November 10, 2019 68 / 96

  55. Deep learning | Long-term dependencies Optimization for long-term dependencies 1 Gradient clipping helps to deal with exploding gradients but it does not help with vanishing gradients. 2 Another way is encourage creating paths in the unfolded recurrent architecture along which the product of gradients is near 1. 3 Solution: regularize to encourage information flow. ∂ h ( t ) ( ) 4 We want that ∇ h ( t ) L ∂ h ( t − 1) to be as large as ∇ h ( t ) L . 5 One regularizer is 2 � �  ∂ h ( t )  ( ) ∇ h ( t ) L � � ∂ h ( t − 1) � � ∑ − 1   � � � ∇ h ( t ) L � t 6 Computing this regularizer is difficult and its approximation is used. Hamid Beigy | Sharif university of technology | November 10, 2019 69 / 96

  56. Deep learning | Attention models Table of contents 1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading Hamid Beigy | Sharif university of technology | November 10, 2019 69 / 96

  57. Deep learning | Attention models Sequence to sequence modeling 1 Sequence to sequence modeling transforms an input sequence (source) to a new one (target) and both sequences can be of arbitrary lengths. 2 Examples of transformation tasks include Machine translation between multiple languages in either text or audio Question-answer dialog generation Parsing sentences into grammar trees. 3 The sequence to sequence model normally has an encoder-decoder architecture, composed of An encoder processes the input sequence and compresses the information into a context vector of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sequence. A decoder is initialized with the context vector to emit the transformed output. The early work only used the last state of the encoder network as the decoder initial state. Hamid Beigy | Sharif university of technology | November 10, 2019 70 / 96

  58. Deep learning | Attention models Attention models 1 Both the encoder and decoder are recurrent neural networks such as LSTM or GRU units. 2 A critical disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Hamid Beigy | Sharif university of technology | November 10, 2019 71 / 96

  59. Deep learning | Attention models Attention models 1 The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT) 16 . 2 Instead of building a single context vector out of the encoders last hidden state, the goal of attention is to create shortcuts between the context vector and the entire source input. 3 The weights of these shortcut connections are customizable for each output element. 4 The alignment between the source and target is learned and controlled by the context vector. 5 Essentially the context vector consumes three pieces of information: Encoder hidden states Decoder hidden states Alignment between source and target 16 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine translation by jointly learning to align and translate.” ICLR 2015. Hamid Beigy | Sharif university of technology | November 10, 2019 72 / 96

  60. Deep learning | Attention models Attention models 1 Assume that we have a source sequence x of length n and try to output a target sequence y of length m x = [ x 1 , x 2 , . . . , x n ] y = [ y 1 , y 2 , . . . , y m ] 2 The encoder is a bidirectional RNN with a forward hidden state − → h i and a backward one ← − h i . 3 A simple concatenation of two represents the encoder state. 4 The motivation is to include both the preceding and following words in the annotation of one word. [ − → i ; ← − ] ⊤ h ⊤ h ⊤ h i = i = 1 , 2 , . . . , n i Hamid Beigy | Sharif university of technology | November 10, 2019 73 / 96

  61. Deep learning | Attention models Attention models 1 Model of attention Hamid Beigy | Sharif university of technology | November 10, 2019 74 / 96

  62. Deep learning | Attention models Attention models 1 The decoder network has hidden state s t = f ( s t − 1 , y t − 1 , c t ) at position t = 1 , 2 , . . . , m . 2 The context vector c t is a sum of hidden states of the input sequence, weighted by alignment scores: n ∑ c t = α t , i h i Context vector for output y i =1 α t , i = align( y t , x i ) How well two words y t and x i are aligned. exp (score( s t − 1 , h i )) = Softmax of predefined alignment score. ∑ n j =1 exp (score( s t − 1 , h j )) 3 The alignment model assigns a score α t , i to the pair of ( y t , x i ) based on how well they match. 4 The set of { α t , i } are weights defining how much of each source hidden state should be considered for each output. Hamid Beigy | Sharif university of technology | November 10, 2019 75 / 96

  63. Deep learning | Attention models Attention models 1 The alignment score α is parametrized by a feed-forward network with a single hidden layer 17 . 2 This network is jointly trained with other parts of the model. 3 The score function is therefore in the following form, given that tanh is used as activation function. score( s t , h i ) = v ⊤ a tanh( W a [ s t ; h i ]) where both V a and W a are weight matrices to be learned in the alignment model. 17 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine translation by jointly learning to align and translate.” ICLR 2015. Hamid Beigy | Sharif university of technology | November 10, 2019 76 / 96

  64. Deep learning | Attention models Alignment scores 1 The matrix of alignment scores explicitly show the correlation between source and target words. Hamid Beigy | Sharif university of technology | November 10, 2019 77 / 96 (a)

  65. Deep learning | Attention models Alignment scores 1 The matrix of alignment scores explicitly show the correlation between source and target words. Hamid Beigy | Sharif university of technology | November 10, 2019 78 / 96

  66. Deep learning | Attention models Alignment scores Name Alignment score function Paper ( https://lilianweng.github.io/lil-log/2018/06/24/attention-attention. html ) score( s t , h i ) = cosine[ s t , h i ] A. Graves, et al. ”Neural Turing machines”, arXiv, 2014. Content-base attention score( s t , h i ) = v ⊤ a tanh( W a [ s t ; h i ]) D. Bahdanau, et al.”Neural machine translation by Additive jointly learning to align and translate”, ICLR 2015. α t , i = softmax( W a s t ) T. Luong, , et al. ”Effective Approaches to Attention- Location-Base based Neural Machine Translation”, EMNLP 2015. score( s t , h i ) = s ⊤ t W a h i General Same as the above score( s t , h i ) = s ⊤ t h i Dot-Product Same as the above score( s t , h i ) = s ⊤ t h i A. Vaswani, et al. ”Attention is all you need”, NIPS √ n Scaled Dot-Product 2017. Hamid Beigy | Sharif university of technology | November 10, 2019 79 / 96

  67. Deep learning | Attention models Self-Attention 1 Self-attention 18 (intra-attention) is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. 2 It is very useful in Machine reading (the automatic, unsupervised understanding of text) Abstractive summarization Image description generation 18 Jianpeng Cheng, Li Dong, and Mirella Lapata. ”Long short-term memory-networks for machine reading”. EMNLP 2016. Hamid Beigy | Sharif university of technology | November 10, 2019 80 / 96

  68. Deep learning | Attention models Self Attention 1 The self-attention mechanism enables us to learn the correlation between the current words and the previous part of the sentence. 2 The current word is in red and the size of the blue shade indicates the activation level. Hamid Beigy | Sharif university of technology | November 10, 2019 81 / 96

  69. Deep learning | Attention models Self Attention 1 Self-attention is applied to the image to generate descriptions 19 . 2 Image is encoded by aCNN and a RNN with self-attention consumes the CNN feature maps to generate the descriptive words one by one. 3 The visualization of the attention weights clearly demonstrates which regions of the image, the model pays attention to so as to output a certain word. 19Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. ”Show, attend and tell: Neural image caption generation with visual attention”, ICML, 2015. Hamid Beigy | Sharif university of technology | November 10, 2019 82 / 96

  70. Deep learning | Attention models Self Attention Hamid Beigy | Sharif university of technology | November 10, 2019 83 / 96

  71. Deep learning | Attention models Soft vs Hard Attention 1 The soft vs hard attention is another way to categorize how attention is defined based on whether the attention has access to the entire image or only a patch. Soft Attention: the alignment weights are learned and placed softly over all patches in the source image (the same idea as in Bahdanau et al., 2015). Pro: the model is smooth and differentiable. Con: expensive when the source input is large. Hard Attention: only selects one patch of the image to attend to at a time. Pro: less calculation at the inference time. Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train 20 . 20 Thang Luong, Hieu Pham, Christopher D. Manning. Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015. Hamid Beigy | Sharif university of technology | November 10, 2019 84 / 96

  72. Deep learning | Attention models Global vs Local Attention 1 Global and local attention are proposed by Luong, et al 21 2 The idea of a global attentional model is to consider all the hidden states of the encoder when deriving the context vector. y t ˜ h t is Attention Layer ) Context vector c t Global align weights a t ¯ h s h t el 21 Thang Luong, Hieu Pham, Christopher D. Manning. Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015. Hamid Beigy | Sharif university of technology | November 10, 2019 85 / 96

  73. Deep learning | Attention models Global vs Local Attention 1 The global attention has a drawback that it has to attend to all words on the source side for each target word, which is expensive and can potentially render it impractical to translate longer sequences, 2 The local attentional mechanism chooses to focus only on a small subset of the source positions per target word. 3 Local one is an interesting blend between hard and soft, an improvement over the hard attention to make it differentiable: 4 The model first predicts a single aligned position for the current target word and a window centered around the source position is then used to compute a context vector. ( ) v ⊤ p t = n × sigmoid p tanh( W p h t ) n is length of source sequence. Hence, p t ∈ [0 , n ]. Hamid Beigy | Sharif university of technology | November 10, 2019 86 / 96

  74. Deep learning | Attention models Global vs Local Attention 1 To favor alignment points near p t , they placed a Gaussian distribution centered around p t . Specif ically, the alignment weights are defined as − ( s − p t ) 2 ( ) a st = align ( h t , ¯ h s ) exp 2 σ 2 and ( ) v ⊤ p t = n × sigmoid p tanh( W p h t ) Hamid Beigy | Sharif university of technology | November 10, 2019 87 / 96

  75. Deep learning | Attention models Global vs Local Attention y t ˜ h t Attention Layer Context vector c t Aligned position p t a t Local weights ¯ h s h t Figure 3: Local attention model – the model first Hamid Beigy | Sharif university of technology | November 10, 2019 88 / 96

  76. Deep learning | Attention models Transformer model 1 The soft attention and make it possible to do sequence to sequence modeling without recurrent network units 22 . 2 The transformer model is entirely built on the self-attention mechanisms without using sequence-aligned recurrent architecture. 22 Ashish Vaswani, et al. Attention is all you need. NIPS 2017. Hamid Beigy | Sharif university of technology | November 10, 2019 89 / 96

  77. Deep learning | Attention models Transformer encoder-decoder 1 The encoding component is a stack of six encoders. 2 The decoding component is a stack of decoders of the same number. Hamid Beigy | Sharif university of technology | November 10, 2019 90 / 96

  78. Deep learning | Attention models Transformer encoder-decoder 1 Each encoder has two sub-layer. 2 Each decoder has three sub-layer. Hamid Beigy | Sharif university of technology | November 10, 2019 91 / 96

  79. Deep learning | Attention models Transformer encoder 1 All encoders receive a list of vectors each of the size 512. 2 The size of this list is hyperparameter we can set basically it would be the length of the longest sentence in our training dataset. Hamid Beigy | Sharif university of technology | November 10, 2019 92 / 96

  80. Deep learning | Attention models Transformer encoder 1 Each sublayer has residual connection. Hamid Beigy | Sharif university of technology | November 10, 2019 93 / 96

  81. Deep learning | Attention models Transformer 1 A transformer of two stacked encoders and decoders Hamid Beigy | Sharif university of technology | November 10, 2019 94 / 96

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend