gradients of deep networks

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net - PowerPoint PPT Presentation

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net $ % & ( Output Input Hidden Hidden Hidden " X Activation Activation Activation & % $ ) = ( ) -


  1. Gradients of Deep Networks Chris Cremer March 29 2017

  2. Neural Net 𝑋 𝑋 𝑋 𝑋 $ % & ( Output Input Hidden Hidden Hidden " 𝑍 X Activation Activation Activation 𝐡 & 𝐡 % 𝐡 $ 𝐡 ) = 𝑔(𝑋 ) - 𝐡 ).$ ) Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …)

  3. Recurrent Neural Net http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  4. Recurrent Neural Net π‘Œ $ π‘Œ % π‘Œ & π‘Œ ( " 𝑋 𝑋 𝐡 0 𝐡 $ 𝐡 % 𝐡 & 𝑋 𝑋 𝑍 Output Init Hidden Hidden Hidden " 𝑍 Activation Activation Activation Activation 𝐡 0 = [0] 𝐡 & 𝐡 % 𝐡 $ 𝐡 ) = 𝑔(𝑋 ) - 𝐡 ).$ ) Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Notice that weights are the same

  5. Recurrent Neural Network – One Timestep 𝐡 ).$ 𝐡 ) - 𝑋 𝑔 π‘Œ )

  6. Gradient Descent " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐡 $ 𝑋 𝐡 % 𝐡 & 𝑍 𝐷(𝑍 & ( $ % 78 78 78 We want : , ; , < ,… so that we can do β€’ 9 9 9 78 gradient descent: 𝑋 =>? = 𝑋 @AB βˆ’ 𝛽 9 EFG π‘Œ $ π‘Œ % π‘Œ & π‘Œ ( ", 𝑍) 𝑋 " 𝐷(𝑍 𝑋 𝑋 𝐡 0 𝐡 $ 𝐡 % 𝐡 & 𝑋 𝑍 Where 𝐷 is a cost function (Squared error, Cross-Entropy, …)

  7. Backprop (Chain Rule) 78 78 78 We want : , ; , < … 9 9 9 " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐡 $ 𝑋 𝐡 % 𝐡 & 𝑍 𝐷(𝑍 & ( $ % Example 𝐡 ) = 𝑔(𝑋 ) - 𝐡 ).$ ) ", 𝑍 = (𝑍 " βˆ’ 𝑍) % 𝐷 𝑍 πœ–π· πœ–π· " βˆ’ 𝑍) " = 𝑒𝑓𝑠𝑗𝑀𝑏𝑒𝑗𝑀𝑓 𝑝𝑔 𝑑𝑝𝑑𝑒 π‘”π‘£π‘œπ‘‘π‘’π‘—π‘π‘œ 𝑔 = non-linear activation " = 2(𝑍 πœ–π‘ πœ–π‘ function (sigmoid, tanh, ReLu, Softplus, …) " πœ–π· = πœ–π· " - πœ–π‘ 𝐡 ) = 𝜏 𝑋 ) - 𝐡 ).$ 𝑒𝑓𝑠𝑗𝑀𝑏𝑒𝑗𝑀𝑓 𝑝𝑔 π‘π‘‘π‘’π‘—π‘€π‘π‘’π‘—π‘π‘œ π‘”π‘£π‘œπ‘‘π‘’π‘—π‘π‘œ πœ–π‘‹ πœ–π‘‹ πœ–π‘ πœ–π΅ ) ( ( = 𝐡 ) (1 βˆ’ 𝐡 ) ) 𝐷 = cost function πœ–π΅ ).$ " " πœ–π· = πœ–π· " - πœ–π‘ = πœ–π· " - πœ–π‘ - πœ–π΅ & (Squared error, Cross- πœ–π΅ % πœ–π΅ % πœ–π΅ & πœ–π΅ % πœ–π‘ πœ–π‘ Entropy, …) πœ–π΅ ) = πœ–π‘”(𝑋 ) - 𝐡 ).$ ) = 𝑔′(𝑋 ) - 𝐡 ).$ )𝑋 ) πœ–π΅ ).$ πœ–π΅ ).$

  8. Vanishing/Exploding Gradient " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐡 $ 𝑋 𝐡 % 𝐡 & 𝑍 𝐷(𝑍 & ( $ % ) - 𝐡 ).$ ) πœ–π΅ ) = πœ–π‘”(𝑋 = 𝑔′(𝑋 ) - 𝐡 ).$ )𝑋 ) πœ–π΅ ).$ πœ–π΅ ).$ " πœ–π· = πœ–π· " - πœ–π‘ - πœ–π΅ & - πœ–π΅ & = πœ–π· ) ) [ " - (𝑔′(𝑋 ) - 𝐡 ).$ )𝑋 πœ–π‘‹ πœ–π΅ & πœ–π΅ % πœ–π΅ % πœ–π‘ πœ–π‘ $ π‘ˆ = π‘œπ‘£π‘›π‘π‘“π‘  𝑝𝑔 π‘šπ‘π‘§π‘“π‘ π‘‘ = number of timesteps For NNs, t goes from T to 0 For RNNs, W is the same for every t if 𝑔′(𝑋 ) - 𝐡 ).$ )𝑋 ) > 1 Gradient Explodes ) ) [ (𝑔′(𝑋 ) - 𝐡 ).$ )𝑋 Gradient Vanishes if 𝑔′(𝑋 ) - 𝐡 ).$ )𝑋 ) < 1

  9. Resnet/HighwayNet/GRU/LSTM β€’ NNs: β€’ ResNet (2015) β€’ Highway Net (2015) β€’ RNNs: β€’ LSTM (1997) β€’ GRU (2014) K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

  10. Residual Network (ResNet) " 𝑋 𝑋 X 𝑋 𝐡 $ 𝑋 𝐡 % 𝐡 & 𝑍 & ( $ % Idea: - If layer is useless (ie lose 𝐡 ) = 𝑔(𝑋 ) - 𝐡 ).$ ) +𝐡 ).$ information), can skip it - Easier for network to have zero weights, than be identity

  11. ResNet Gradient 𝐡 ) = 𝑔(𝑋 ) - 𝐡 ).$ ) +𝐡 ).$ πœ–π΅ ) = πœ–π‘”(𝑋 ) - 𝐡 ).$ ) + 𝐡 ).$ = 𝑔 t 𝑋 ) - 𝐡 ).$ 𝑋 ) + 1 πœ–π΅ ).$ πœ–π΅ ).$ (𝑔 t 𝑋 ) + 1) [ = (π‘Ž + 1) [ ) - 𝐡 ).$ 𝑋 π‘₯β„Žπ‘“π‘ π‘“ π‘Ž = 𝑔 t 𝑋 ) - 𝐡 ).$ 𝑋 ) (π‘Ž + 1) % = π‘Ž % + 2π‘Ž + 1 (π‘Ž + 1) & = π‘Ž & + 3π‘Ž % + 3π‘Ž + 1 (π‘Ž + 1) ( = π‘Ž ( + 4π‘Ž & + 6π‘Ž % + 4π‘Ž + 1 β€’ Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, … β€’

  12. Highway Network 𝐺 𝐺 𝐺 " 𝑋 𝑋 X 𝑋 𝐡 $ 𝑋 𝐡 % 𝐡 & 𝑍 & ( $ % 𝐡 ) = 𝑔(𝑋 ) - 𝐡 ).$ ) - B + 𝐡 ).$ - (1 βˆ’ 𝐢) 𝐢 = 𝜏(𝑋 )% - 𝐡 ).$ ) 𝜏 = sigmoid since output (0,1)

  13. Highway Net Gradient 𝐡 ) = 𝑔 𝑋 ) - 𝐡 ).$ - B + 𝐡 ).$ - 1 βˆ’ 𝐢 𝐢 = 𝜏(𝑋 )% - 𝐡 ).$ ) = 𝑔 𝑋 ) - 𝐡 ).$ - 𝜏(𝑋 )% - 𝐡 ).$ ) + 𝐡 ).$ - 1 βˆ’ 𝜏(𝑋 )% - 𝐡 ).$ ) = 𝑔 𝑋 ) - 𝐡 ).$ - 𝜏(𝑋 )% - 𝐡 ).$ ) + 𝐡 ).$ βˆ’ 𝜏(𝑋 )% - 𝐡 ).$ ) - 𝐡 ).$ πœ–π΅ ) = 1 πœ–π΅ ).$ β€’ Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, … β€’

  14. Back to RNNs π‘Œ $ π‘Œ % π‘Œ & π‘Œ ( 𝑋 " 𝑋 𝑋 𝐡 0 𝐡 $ 𝐡 % 𝐡 & 𝑋 𝑍 Output Init Hidden Hidden Hidden " 𝑍 Activation Activation Activation Activation 𝐡 0 = [0] 𝐡 & 𝐡 % 𝐡 $ πœ–π΅ ) = πœ–π‘”(𝑋 ) - 𝐡 ).$ ) Vanishing/Exploding Gradient = 𝑔′(𝑋 ) - 𝐡 ).$ )𝑋 𝐡 ) = 𝑔(𝑋 ) - 𝐡 ).$ ) ) πœ–π΅ ).$ πœ–π΅ ).$ Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Note: weights are the same

  15. RNN Gated Recurrent Unit + 𝐡 ) x 𝐡 ).$ 1 βˆ’ 𝜏 - 𝑋 𝜏 x % π‘Œ ) - 𝑋 𝑔 𝐡 ) = 𝑔(𝑋 - 𝐡 ).$ ) - B + 𝐡 ).$ - (1 βˆ’ 𝐢) 𝐢 = 𝜏(𝑋 % - 𝐡 ).$ ) 𝜏 = sigmoid since output (0,1)

  16. RNN Another view of GRUs + 𝐡 ) x 𝐡 ).$ 1 βˆ’ 𝜏 - 𝑋 𝜏 x % π‘Œ $ π‘Œ % π‘Œ & π‘Œ ( π‘Œ ) 𝑋 " 𝑋 𝑋 𝑋 𝐡 0 𝐡 $ 𝐡 % 𝐡 & 𝑍 𝑔 - 𝑋 𝐡 ) = 𝑔(𝑋 - 𝐡 ).$ ) - B + 𝐡 ).$ - (1 βˆ’ 𝐢) 𝐢 = 𝜏(𝑋 % - 𝐡 ).$ ) 𝜏 = sigmoid since output (0,1)

  17. GRU/LSTM: More Gates GRU LSTM

  18. Memory Concerns β€’ If T=10000, you need to keep 10000 activations/states in memory

  19. Deep Network Gradients Conclusion β€’ The models we saw all use the same idea β€’ One of the earlier uses of skip connections was in the Nonlinear AutoRegressive with eXogenous inputs method (NARX; Lin et al., 1996), where they improved the RNN’s ability to infer finite state machines. - Ilya Sustkever PhD thesis 2013 NNs: ResNet (2015) Neither ResNet or Highway reference GRUs/LSTMs Highway Net (2015) RNNs: LSTM (1997) GRU (2014)

  20. Thanks

Recommend


More recommend