recurrent networks
play

Recurrent Networks 10/16/2017 1 Which open source project? - PowerPoint PPT Presentation

Deep Learning Recurrent Networks 10/16/2017 1 Which open source project? Related math. What is it talking about? And a Wikipedia page explaining it all The unreasonable effectiveness of recurrent neural networks.. All previous examples


  1. Linear recursions: Vector version What about at middling values of ๐‘ข ? It will depend on the other eigen values โ€ข Consider simple, scalar, linear recursion (note change of notation) If ๐‘†๐‘“(๐œ‡ ๐‘›๐‘๐‘ฆ ) > 1 it will blow up, otherwise it will contract โ€“ โ„Ž ๐‘ข = ๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ(๐‘ข) and shrink to 0 rapidly โ€“ โ„Ž 0 ๐‘ข = ๐‘‹ ๐‘ข ๐‘‘๐‘ฆ 0 For any input, for large ๐‘ข the length of the hidden vector โ€ข Length of response ( โ„Ž ) to a single input at 0 will expand or contract according to the ๐‘ข -th power of the โ€ข We can write ๐‘‹ = ๐‘‰ฮ›๐‘‰ โˆ’1 largest eigen value of the hidden-layer weight matrix Unless it has no component along the eigen vector โ€“ ๐‘‹๐‘ฃ ๐‘— = ๐œ‡ ๐‘— ๐‘ฃ ๐‘— corresponding to the largest eigen value. In that case it โ€“ For any vector โ„Ž we can write will grow according to the second largest Eigen value.. โ€ข โ„Ž = ๐‘ 1 ๐‘ฃ 1 + ๐‘ 2 ๐‘ฃ 2 + โ‹ฏ + ๐‘ ๐‘œ ๐‘ฃ ๐‘œ And so on.. โ€ข ๐‘‹โ„Ž = ๐‘ 1 ๐œ‡ 1 ๐‘ฃ 1 + ๐‘ 2 ๐œ‡ 2 ๐‘ฃ 2 + โ‹ฏ + ๐‘ ๐‘œ ๐œ‡ ๐‘œ ๐‘ฃ ๐‘œ ๐‘ข ๐‘ฃ 2 + โ‹ฏ + ๐‘ ๐‘œ ๐œ‡ ๐‘œ ๐‘ข ๐‘ฃ ๐‘œ โ€ข ๐‘‹ ๐‘ข โ„Ž = ๐‘ 1 ๐œ‡ 1 ๐‘ข ๐‘ฃ 1 + ๐‘ 2 ๐œ‡ 2 ๐‘ข ๐‘ฃ ๐‘› where ๐‘› = argmax ๐‘ขโ†’โˆž ๐‘‹ ๐‘ข โ„Ž = ๐‘ ๐‘› ๐œ‡ ๐‘› โ€“ lim ๐œ‡ ๐‘˜ ๐‘˜

  2. Linear recursions โ€ข Vector linear recursion โ€“ โ„Ž ๐‘ข = ๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ(๐‘ข) โ€“ โ„Ž 0 ๐‘ข = ๐‘ฅ ๐‘ข ๐‘‘๐‘ฆ 0 โ€ข Response to a single input [1 1 1 1] at 0 ๐œ‡ ๐‘›๐‘๐‘ฆ = 0.9 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1.1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1.1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1

  3. Linear recursions โ€ข Vector linear recursion โ€“ โ„Ž ๐‘ข = ๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ(๐‘ข) โ€“ โ„Ž 0 ๐‘ข = ๐‘ฅ ๐‘ข ๐‘‘๐‘ฆ 0 โ€ข Response to a single input [1 1 1 1] at 0 ๐œ‡ ๐‘›๐‘๐‘ฆ = 0.9 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1.1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1.1 ๐œ‡ 2๐‘œ๐‘’ = 0.5 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1 ๐œ‡ 2๐‘œ๐‘’ = 0.1 Complex Eigenvalues

  4. Lesson.. โ€ข In linear systems, long-term behavior depends entirely on the eigenvalues of the hidden-layer weights matrix โ€“ If the largest Eigen value is greater than 1, the system will โ€œblow upโ€ โ€“ If it is lesser than 1, the response will โ€œvanishโ€ very quickly โ€“ Complex Eigen values cause oscillatory response โ€ข Which we may or may not want โ€ข Force matrix to have real eigen values for smooth behavior โ€“ Symmetric weight matrix

  5. How about non-linearities โ„Ž ๐‘ข = ๐‘”(๐‘ฅโ„Ž ๐‘ข โˆ’ 1 + ๐‘‘๐‘ฆ ๐‘ข ) โ€ข The behavior of scalar non-linearities โ€ข Left: Sigmoid, Middle: Tanh, Right: Relu โ€“ Sigmoid: Saturates in a limited number of steps, regardless of ๐‘ฅ โ€“ Tanh: Sensitive to ๐‘ฅ , but eventually saturates โ€ข โ€œPrefersโ€ weights close to 1.0 โ€“ Relu: Sensitive to ๐‘ฅ , can blow up

  6. How about non-linearities โ„Ž ๐‘ข = ๐‘”(๐‘ฅโ„Ž ๐‘ข โˆ’ 1 + ๐‘‘๐‘ฆ ๐‘ข ) โ€ข With a negative start (equivalent to โ€“ ve wt) โ€ข Left: Sigmoid, Middle: Tanh, Right: Relu โ€“ Sigmoid: Saturates in a limited number of steps, regardless of ๐‘ฅ โ€“ Tanh: Sensitive to ๐‘ฅ , but eventually saturates โ€“ Relu: For negative starts, has no response

  7. Vector Process โ„Ž ๐‘ข = ๐‘”(๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ ๐‘ข ) โ€ข Assuming a uniform unit vector initialization โ€“ 1,1,1, โ€ฆ / ๐‘‚ โ€“ Behavior similar to scalar recursion โ€“ Interestingly, RELU is more prone to blowing up (why?) โ€ข Eigenvalues less than 1.0 retain the most โ€œmemoryโ€ sigmoid tanh relu

  8. Vector Process โ„Ž ๐‘ข = ๐‘”(๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ ๐‘ข ) โ€ข Assuming a uniform unit vector initialization โ€“ โˆ’1, โˆ’1, โˆ’1, โ€ฆ / ๐‘‚ โ€“ Behavior similar to scalar recursion โ€“ Interestingly, RELU is more prone to blowing up (why?) sigmoid tanh relu

  9. Stability Analysis โ€ข Formal stability analysis considers convergence of โ€œ Lyapunov โ€ functions โ€“ Alternately, Routhโ€™s criterion and/or pole -zero analysis โ€“ Positive definite functions evaluated at โ„Ž โ€“ Conclusions are similar: only the tanh activation gives us any reasonable behavior โ€ข And still has very short โ€œmemoryโ€ โ€ข Lessons: โ€“ Bipolar activations (e.g. tanh) have the best behavior โ€“ Still sensitive to Eigenvalues of ๐‘‹ โ€“ Best case memory is short โ€“ Exponential memory behavior โ€ข โ€œForgetsโ€ in exponential manner

  10. How about deeper recursion โ€ข Consider simple, scalar, linear recursion โ€“ Adding more โ€œtapsโ€ adds more โ€œmodesโ€ to memory in somewhat non-obvious ways โ„Ž ๐‘ข = 0.5โ„Ž ๐‘ข โˆ’ 1 + 0.25โ„Ž ๐‘ข โˆ’ 5 + ๐‘ฆ(๐‘ข) โ„Ž ๐‘ข = 0.5โ„Ž ๐‘ข โˆ’ 1 + 0.25โ„Ž ๐‘ข โˆ’ 5 + 0.1โ„Ž ๐‘ข โˆ’ 8 + ๐‘ฆ(๐‘ข)

  11. Stability Analysis โ€ข Similar analysis of vector functions with non- linear activations is relatively straightforward โ€“ Linear systems: Routhโ€™s criterion โ€ข And pole-zero analysis (involves tensors) โ€“ On board? โ€“ Non-linear systems: Lyapunov functions โ€ข Conclusions do not change

  12. RNNs.. โ€ข Excellent models for time-series analysis tasks โ€“ Time-series prediction โ€“ Time-series classification โ€“ Sequence prediction.. โ€“ They can even simplify problems that are difficult for MLPs โ€ข But the memory isnโ€™t all that great.. โ€“ Also..

  13. The vanishing gradient problem โ€ข A particular problem with training deep networks.. โ€“ The gradient of the error with respect to weights is unstable..

  14. Some useful preliminary math: The problem with training deep networks W 0 W 1 W 2 โ€ข A multilayer perceptron is a nested function ๐‘ = ๐‘” ๐‘‚ ๐‘‹ ๐‘‚โˆ’1 ๐‘” ๐‘‚โˆ’1 ๐‘‹ ๐‘‚โˆ’2 ๐‘” ๐‘‚โˆ’2 โ€ฆ ๐‘‹ 0 ๐‘Œ ๐‘™ is the weights matrix at the k th layer โ€ข ๐‘‹ โ€ข The error for ๐‘Œ can be written as ๐ธ๐‘—๐‘ค(๐‘Œ) = ๐ธ ๐‘” ๐‘‚ ๐‘‹ ๐‘‚โˆ’1 ๐‘” ๐‘‚โˆ’1 ๐‘‹ ๐‘‚โˆ’2 ๐‘” ๐‘‚โˆ’2 โ€ฆ ๐‘‹ 0 ๐‘Œ

  15. Training deep networks โ€ข Vector derivative chain rule: for any ๐‘” ๐‘‹๐‘• ๐‘Œ : ๐‘’๐‘” ๐‘‹๐‘• ๐‘Œ = ๐‘’๐‘” ๐‘‹๐‘• ๐‘Œ ๐‘’๐‘‹๐‘• ๐‘Œ ๐‘’๐‘• ๐‘Œ ๐‘’๐‘Œ ๐‘’๐‘‹๐‘• ๐‘Œ ๐‘’๐‘• ๐‘Œ ๐‘’๐‘Œ Poor notation ๐›ผ ๐‘Œ ๐‘” = ๐›ผ ๐‘Ž ๐‘”. ๐‘‹. ๐›ผ ๐‘Œ ๐‘• โ€ข Where โ€“ ๐‘Ž = ๐‘‹๐‘• ๐‘Œ โ€“ ๐›ผ ๐‘Ž ๐‘” is the jacobian matrix of ๐‘”(๐‘Ž) w.r.t ๐‘Ž โ€ข Using the notation ๐›ผ ๐‘Ž ๐‘” instead of ๐พ ๐‘” (๐‘จ) for consistency

  16. Training deep networks โ€ข For ๐ธ๐‘—๐‘ค(๐‘Œ) = ๐ธ ๐‘” ๐‘‚ ๐‘‹ ๐‘‚โˆ’1 ๐‘” ๐‘‚โˆ’1 ๐‘‹ ๐‘‚โˆ’2 ๐‘” ๐‘‚โˆ’2 โ€ฆ ๐‘‹ 0 ๐‘Œ โ€ข We get: ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข Where โ€“ ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค is the gradient ๐ธ๐‘—๐‘ค(๐‘Œ) of the error w.r.t the output of the kth layer of the network โ€ข Needed to compute the gradient of the error w.r.t ๐‘‹ ๐‘™โˆ’1 โ€“ ๐›ผ๐‘” ๐‘œ is jacobian of ๐‘” ๐‘‚ () w.r.t. to its current input โ€“ All blue terms are matrices

  17. The Jacobian of the hidden layers โ€ฒ (๐‘จ 1 ) ๐‘ ๐‘” 0 โ‹ฏ 0 ๐‘ข,1 โ€ฒ (๐‘จ 2 ) 0 ๐‘” โ‹ฏ 0 ๐‘ข,2 ๐›ผ๐‘” ๐‘ข ๐‘จ ๐‘— = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ„Ž 1 โ€ฒ (๐‘จ ๐‘‚ ) 0 0 โ‹ฏ ๐‘” ๐‘ข,๐‘‚ 1 (๐‘ข) = ๐‘” 1 ๐‘ข ๐‘Œ โ„Ž ๐‘— 1 ๐‘จ ๐‘— โ€ข ๐›ผ๐‘” ๐‘ข () is the derivative of the output of the (layer of) hidden recurrent neurons with respect to their input โ€“ A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer

  18. The Jacobian 1 (๐‘ข) = ๐‘” 1 ๐‘ข โ„Ž ๐‘— 1 ๐‘จ ๐‘— โ€ฒ (๐‘จ 1 ) ๐‘ ๐‘” 0 โ‹ฏ 0 ๐‘ข,1 โ€ฒ (๐‘จ 2 ) 0 ๐‘” โ‹ฏ 0 ๐‘ข,2 ๐›ผ๐‘” ๐‘ข ๐‘จ ๐‘— = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ„Ž 1 โ€ฒ (๐‘จ ๐‘‚ ) 0 0 โ‹ฏ ๐‘” ๐‘ข,๐‘‚ ๐‘Œ โ€ข The derivative (or subgradient) of the activation function is always bounded โ€“ The diagonals of the Jacobian are bounded โ€ข There is a limit on how much multiplying a vector by the Jacobian will scale it

  19. The derivative of the hidden state activation โ€ฒ (๐‘จ 1 ) ๐‘” 0 โ‹ฏ 0 ๐‘ข,1 โ€ฒ (๐‘จ 2 ) 0 ๐‘” โ‹ฏ 0 ๐‘ข,2 ๐›ผ๐‘” ๐‘ข ๐‘จ ๐‘— = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ€ฒ (๐‘จ ๐‘‚ ) 0 0 โ‹ฏ ๐‘” ๐‘ข,๐‘‚ โ€ข Most common activation functions, such as sigmoid, tanh() and RELU have derivatives that are always less than 1 โ€ข The most common activation for the hidden units in an RNN is the tanh() โ€“ The derivative of tanh() is always less than 1 โ€ข Multiplication by the Jacobian is always a shrinking operation

  20. Training deep networks ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข As we go back in layers, the Jacobians of the activations constantly shrink the derivative โ€“ After a few instants the derivative of the divergence at any time is totally โ€œforgottenโ€

  21. What about the weights ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข In a single-layer RNN, the weight matrices are identical โ€ข The chain product for ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค will โ€“ E xpand ๐›ผ๐ธ along directions in which the singular values of the weight matrices are greater than 1 โ€“ S hrink ๐›ผ๐ธ in directions where the singular values ae less than 1 โ€“ Exploding or vanishing gradients

  22. Exploding/Vanishing gradients ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข Every blue term is a matrix โ€ข ๐›ผ๐ธ is proportional to the actual error โ€“ Particularly for L 2 and KL divergence โ€ข The chain product for ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค will โ€“ E xpand ๐›ผ๐ธ in directions where each stage has singular values greater than 1 โ€“ S hrink ๐›ผ๐ธ in directions where each stage has singular values less than 1

  23. Gradient problems in deep networks ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข The gradients in the lower/earlier layers can explode or vanish โ€“ Resulting in insignificant or unstable gradient descent updates โ€“ Problem gets worse as network depth increases

  24. Vanishing gradient examples.. ELU activation, Batch gradients Input layer Output layer โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  25. Vanishing gradient examples.. RELU activation, Batch gradients Input layer Output layer โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  26. Vanishing gradient examples.. Sigmoid activation, Batch gradients Input layer Output layer โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  27. Vanishing gradient examples.. Tanh activation, Batch gradients Input layer Output layer โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  28. Vanishing gradient examples.. ELU activation, Individual instances โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  29. Vanishing gradients โ€ข ELU activations maintain gradients longest โ€ข But in all cases gradients effectively vanish after about 10 layers! โ€“ Your results may vary โ€ข Both batch gradients and gradients for individual instances disappear โ€“ In reality a tiny number may actually blow up.

  30. Recurrent nets are very deep nets Y(T) h f (-1) X(0) ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข The relation between ๐‘Œ(0) and ๐‘(๐‘ˆ) is one of a very deep network โ€“ Gradients from errors at t = ๐‘ˆ will vanish by the time theyโ€™re propagated to ๐‘ข = 0

  31. Recall: Vanishing stuff.. ๐‘(0) ๐‘(1) ๐‘(2) ๐‘(๐‘ˆ โˆ’ 2) ๐‘(๐‘ˆ โˆ’ 1) ๐‘(๐‘ˆ) h -1 ๐‘Œ(0) ๐‘Œ(1) ๐‘Œ(2) ๐‘Œ(๐‘ˆ โˆ’ 2) ๐‘Œ(๐‘ˆ โˆ’ 1) ๐‘Œ(๐‘ˆ) โ€ข Stuff gets forgotten in the forward pass too

  32. The long-term dependency problem 1 PATTERN1 [โ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆ..] PATTERN 2 Jane had a quick lunch in the bistro. Then she.. โ€ข Any other pattern of any length can happen between pattern 1 and pattern 2 โ€“ RNN will โ€œforgetโ€ pattern 1 if intermediate stuff is too long โ€“ โ€œJaneโ€ ๏ƒ  the next pronoun referring to her will be โ€œsheโ€ โ€ข Must know to โ€œrememberโ€ for extended periods of time and โ€œrecallโ€ when necessary โ€“ Can be performed with a multi-tap recursion, but how many taps? โ€“ Need an alternate way to โ€œrememberโ€ stuff

  33. And now we enter the domain of..

  34. Exploding/Vanishing gradients ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข Can we replace this with something that doesnโ€™t fade or blow up? โ€ข ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ๐ท๐œ ๐‘‚ ๐ท๐œ ๐‘‚โˆ’1 ๐ท โ€ฆ ๐œ ๐‘™ โ€ข Can we have a network that just โ€œremembersโ€ arbitrarily long, to be recalled on demand?

  35. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) Time t+1 t+2 t+3 t+4 โ€ข History is carried through uncompressed โ€“ No weights, no nonlinearities โ€“ Only scaling is through the s โ€œgatingโ€ term that captures other triggers โ€“ E.g. โ€œHave I seen Pattern2โ€?

  36. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) ๐‘Œ(๐‘ข + 4) ๐‘Œ(๐‘ข + 1) ๐‘Œ(๐‘ข + 2) ๐‘Œ(๐‘ข + 3) Time โ€ข Actual non-linear work is done by other portions of the network

  37. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) Other stuff ๐‘Œ(๐‘ข + 4) ๐‘Œ(๐‘ข + 1) ๐‘Œ(๐‘ข + 2) ๐‘Œ(๐‘ข + 3) Time โ€ข Actual non-linear work is done by other portions of the network

  38. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) Other stuff ๐‘Œ(๐‘ข + 4) ๐‘Œ(๐‘ข + 1) ๐‘Œ(๐‘ข + 2) ๐‘Œ(๐‘ข + 3) Time โ€ข Actual non-linear work is done by other portions of the network

  39. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) Other stuff ๐‘Œ(๐‘ข + 4) ๐‘Œ(๐‘ข + 1) ๐‘Œ(๐‘ข + 2) ๐‘Œ(๐‘ข + 3) Time โ€ข Actual non-linear work is done by other portions of the network

  40. Enter the LSTM โ€ข Long Short-Term Memory โ€ข Explicitly latch information to prevent decay / blowup โ€ข Following notes borrow liberally from โ€ข http://colah.github.io/posts/2015-08- Understanding-LSTMs/

  41. Standard RNN โ€ข Recurrent neurons receive past recurrent outputs and current input as inputs โ€ข Processed through a tanh() activation function โ€“ As mentioned earlier, tanh() is the generally used activation for the hidden layer โ€ข Current recurrent output passed to next higher layer and next time instant

  42. Long Short-Term Memory โ€ข The ๐œ() are multiplicative gates that decide if something is important or not โ€ข Remember, every line actually represents a vector

  43. LSTM: Constant Error Carousel โ€ข Key component: a remembered cell state

  44. LSTM: CEC โ€ข ๐ท ๐‘ข is the linear history carried by the constant-error carousel โ€ข Carries information through, only affected by a gate โ€“ And addition of history, which too is gated..

  45. LSTM: Gates โ€ข Gates are simple sigmoidal units with outputs in the range (0,1) โ€ข Controls how much of the information is to be let through

  46. LSTM: Forget gate โ€ข The first gate determines whether to carry over the history or to forget it โ€“ More precisely, how much of the history to carry over โ€“ Also called the โ€œforgetโ€ gate โ€“ Note, weโ€™re actually distinguishing between the cell memory ๐ท and the state โ„Ž that is coming over time! Theyโ€™re related though

  47. LSTM: Input gate โ€ข The second gate has two parts โ€“ A perceptron layer that determines if thereโ€™s something interesting in the input โ€“ A gate that decides if its worth remembering โ€“ If so its added to the current memory cell

  48. LSTM: Memory cell update โ€ข The second gate has two parts โ€“ A perceptron layer that determines if thereโ€™s something interesting in the input โ€“ A gate that decides if its worth remembering โ€“ If so its added to the current memory cell

  49. LSTM: Output and Output gate โ€ข The output of the cell โ€“ Simply compress it with tanh to make it lie between 1 and -1 โ€ข Note that this compression no longer affects our ability to carry memory forward โ€“ While weโ€™re at it, lets toss in an output gate โ€ข To decide if the memory contents are worth reporting at this time

  50. LSTM: The โ€œPeepholeโ€ Connection โ€ข Why not just let the cell directly influence the gates while at it โ€“ Party!!

  51. The complete LSTM unit ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข tanh ๐‘ ๐‘ข ๐‘— ๐‘ข ๐‘” ๐‘ข แˆš ๐ท ๐‘ข s() s() s() tanh โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข ๐‘ฆ ๐‘ข โ€ข With input, output, and forget gates and the peephole connection..

  52. Backpropagation rules: Forward ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข tanh ๐‘ ๐‘ข ๐‘— ๐‘ข ๐‘” ๐‘ข แˆš ๐ท ๐‘ข s() s() s() tanh โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข ๐‘ฆ ๐‘ข Gates โ€ข Forward rules: Variables

  53. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค =

  54. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž

  55. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Ž โ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘

  56. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘” ๐‘ ๐‘ข ๐‘ข ๐‘ข+1 แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐‘ข+1 +

  57. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘” ๐‘ ๐‘ข ๐‘ข ๐‘ข+1 แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘”

  58. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘—

  59. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข

  60. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘”

  61. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ โ„Ž๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘—

  62. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ โ„Ž๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘— + ๐›ผ ๐ท ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘— ๐‘ข+1 โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ โ„Ž๐‘—

  63. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ โ„Ž๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘— + ๐›ผ ๐ท ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข+1 โˆ˜ ๐‘ข๐‘๐‘œโ„Ž โ€ฒ . ๐‘‹ โ„Ž๐‘— + ๐›ผ โ„Ž ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ โ„Ž๐‘

  64. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 Not explicitly deriving the derivatives w.r.t weights; Left as an exercise ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ โ„Ž๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘— + ๐›ผ ๐ท ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข+1 โˆ˜ ๐‘ข๐‘๐‘œโ„Ž โ€ฒ . ๐‘‹ โ„Ž๐‘— + ๐›ผ โ„Ž ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ โ„Ž๐‘

  65. Gated Recurrent Units : Lets simplify the LSTM โ€ข Simplified LSTM which addresses some of your concerns of why

  66. Gated Recurrent Units : Lets simplify the LSTM โ€ข Combine forget and input gates โ€“ In new input is to be remembered, then this means old memory is to be forgotten โ€ข Why compute twice?

  67. Gated Recurrent Units : Lets simplify the LSTM โ€ข Donโ€™t bother to separately maintain compressed and regular memories โ€“ Pointless computation! โ€ข But compress it before using it to decide on the usefulness of the current input!

  68. LSTM Equations ๐‘— = ๐œ ๐‘ฆ ๐‘ข ๐‘‰ ๐‘— + ๐‘ก ๐‘ขโˆ’1 ๐‘‹ ๐‘— โ€ข ๐‘” = ๐œ ๐‘ฆ ๐‘ข ๐‘‰ ๐‘” + ๐‘ก ๐‘ขโˆ’1 ๐‘‹ ๐‘” โ€ข ๐‘ = ๐œ ๐‘ฆ ๐‘ข ๐‘‰ ๐‘ + ๐‘ก ๐‘ขโˆ’1 ๐‘‹ ๐‘ โ€ข โ€ข ๐’‹: input gate, how much of the new ๐‘• = tanh ๐‘ฆ ๐‘ข ๐‘‰ ๐‘• + ๐‘ก ๐‘ขโˆ’1 ๐‘‹ ๐‘• โ€ข information will be let through the memory โ€ข ๐‘‘ ๐‘ข = ๐‘‘ ๐‘ขโˆ’1 โˆ˜ ๐‘” + ๐‘• โˆ˜ ๐‘— cell. โ€ข ๐‘ก ๐‘ข = tanh ๐‘‘ ๐‘ข โˆ˜ ๐‘ โ€ข ๐’ˆ : forget gate, responsible for information โ€ข ๐‘ง = ๐‘ก๐‘๐‘”๐‘ข๐‘›๐‘๐‘ฆ ๐‘Š๐‘ก ๐‘ข should be thrown away from memory cell. โ€ข ๐’‘: output gate, how much of the information will be passed to expose to the next time step. โ€ข ๐’‰: self-recurrent which is equal to standard RNN โ€ข ๐’… ๐’– : internal memory of the memory cell โ€ข ๐’• ๐’– : hidden state LSTM Memory Cell โ€ข ๐ณ : final output 95

  69. LSTM architectures example Y(t) X(t) Time โ€ข Each green box is now an entire LSTM or GRU unit โ€ข Also keep in mind each box is an array of units

  70. Bidirectional LSTM Y(0) Y(1) Y(2) Y(T-2) Y(T-1) Y(T) h f (-1) X(0) X(1) X(2) X(T-2) X(T-1) X(T) h b (inf) X(0) X(1) X(2) X(T-2) X(T-1) X(T) t โ€ข Like the BRNN, but now the hidden nodes are LSTM units. โ€ข Can have multiple layers of LSTM units in either direction โ€“ Its also possible to have MLP feed-forward layers between the hidden layers.. โ€ข The output nodes (orange boxes) may be complete MLPs

  71. Significant issue left out โ€ข The Divergence

  72. Story so far Y desired (t) DIVERGENCE Y(t) h -1 X(t) t=0 Time โ€ข Outputs may not be defined at all times โ€“ Often no clear synchrony between input and desired output โ€ข Unclear how to specify alignment โ€ข Unclear how to compute a divergence โ€“ Obvious choices for divergence may not be differentiable (e.g. edit distance) โ€ข In later lectures..

  73. Some typical problem settings โ€ข Lets consider a few typical problems โ€ข Issues: โ€“ How to define the divergence() โ€“ How to compute the gradient โ€“ How to backpropagate โ€“ Specific problem: The constant error carousel..

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend