gradients of deep networks
play

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net - PowerPoint PPT Presentation

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net $ % & ( Output Input Hidden Hidden Hidden " X Activation Activation Activation & % $ ) = ( ) -


  1. Gradients of Deep Networks Chris Cremer March 29 2017

  2. Neural Net 𝑋 𝑋 𝑋 𝑋 $ % & ( Output Input Hidden Hidden Hidden " 𝑍 X Activation Activation Activation 𝐵 & 𝐵 % 𝐵 $ 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …)

  3. Recurrent Neural Net http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  4. Recurrent Neural Net 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( " 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑋 𝑍 Output Init Hidden Hidden Hidden " 𝑍 Activation Activation Activation Activation 𝐵 0 = [0] 𝐵 & 𝐵 % 𝐵 $ 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Notice that weights are the same

  5. Recurrent Neural Network – One Timestep 𝐵 ).$ 𝐵 ) - 𝑋 𝑔 𝑌 )

  6. Gradient Descent " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % 78 78 78 We want : , ; , < ,… so that we can do • 9 9 9 78 gradient descent: 𝑋 =>? = 𝑋 @AB − 𝛽 9 EFG 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( ", 𝑍) 𝑋 " 𝐷(𝑍 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑍 Where 𝐷 is a cost function (Squared error, Cross-Entropy, …)

  7. Backprop (Chain Rule) 78 78 78 We want : , ; , < … 9 9 9 " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % Example 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) ", 𝑍 = (𝑍 " − 𝑍) % 𝐷 𝑍 𝜖𝐷 𝜖𝐷 " − 𝑍) " = 𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑑𝑝𝑡𝑢 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝑔 = non-linear activation " = 2(𝑍 𝜖𝑍 𝜖𝑍 function (sigmoid, tanh, ReLu, Softplus, …) " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 𝐵 ) = 𝜏 𝑋 ) - 𝐵 ).$ 𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝜖𝑋 𝜖𝑋 𝜖𝑍 𝜖𝐵 ) ( ( = 𝐵 ) (1 − 𝐵 ) ) 𝐷 = cost function 𝜖𝐵 ).$ " " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 = 𝜖𝐷 " - 𝜖𝑍 - 𝜖𝐵 & (Squared error, Cross- 𝜖𝐵 % 𝜖𝐵 % 𝜖𝐵 & 𝜖𝐵 % 𝜖𝑍 𝜖𝑍 Entropy, …) 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) 𝜖𝐵 ).$ 𝜖𝐵 ).$

  8. Vanishing/Exploding Gradient " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % ) - 𝐵 ).$ ) 𝜖𝐵 ) = 𝜖𝑔(𝑋 = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) 𝜖𝐵 ).$ 𝜖𝐵 ).$ " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 - 𝜖𝐵 & - 𝜖𝐵 & = 𝜖𝐷 ) ) [ " - (𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 𝜖𝑋 𝜖𝐵 & 𝜖𝐵 % 𝜖𝐵 % 𝜖𝑍 𝜖𝑍 $ 𝑈 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑚𝑏𝑧𝑓𝑠𝑡 = number of timesteps For NNs, t goes from T to 0 For RNNs, W is the same for every t if 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) > 1 Gradient Explodes ) ) [ (𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 Gradient Vanishes if 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) < 1

  9. Resnet/HighwayNet/GRU/LSTM • NNs: • ResNet (2015) • Highway Net (2015) • RNNs: • LSTM (1997) • GRU (2014) K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

  10. Residual Network (ResNet) " 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 & ( $ % Idea: - If layer is useless (ie lose 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) +𝐵 ).$ information), can skip it - Easier for network to have zero weights, than be identity

  11. ResNet Gradient 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) +𝐵 ).$ 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) + 𝐵 ).$ = 𝑔 t 𝑋 ) - 𝐵 ).$ 𝑋 ) + 1 𝜖𝐵 ).$ 𝜖𝐵 ).$ (𝑔 t 𝑋 ) + 1) [ = (𝑎 + 1) [ ) - 𝐵 ).$ 𝑋 𝑥ℎ𝑓𝑠𝑓 𝑎 = 𝑔 t 𝑋 ) - 𝐵 ).$ 𝑋 ) (𝑎 + 1) % = 𝑎 % + 2𝑎 + 1 (𝑎 + 1) & = 𝑎 & + 3𝑎 % + 3𝑎 + 1 (𝑎 + 1) ( = 𝑎 ( + 4𝑎 & + 6𝑎 % + 4𝑎 + 1 • Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, … •

  12. Highway Network 𝐺 𝐺 𝐺 " 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 & ( $ % 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 )% - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)

  13. Highway Net Gradient 𝐵 ) = 𝑔 𝑋 ) - 𝐵 ).$ - B + 𝐵 ).$ - 1 − 𝐶 𝐶 = 𝜏(𝑋 )% - 𝐵 ).$ ) = 𝑔 𝑋 ) - 𝐵 ).$ - 𝜏(𝑋 )% - 𝐵 ).$ ) + 𝐵 ).$ - 1 − 𝜏(𝑋 )% - 𝐵 ).$ ) = 𝑔 𝑋 ) - 𝐵 ).$ - 𝜏(𝑋 )% - 𝐵 ).$ ) + 𝐵 ).$ − 𝜏(𝑋 )% - 𝐵 ).$ ) - 𝐵 ).$ 𝜖𝐵 ) = 1 𝜖𝐵 ).$ • Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, … •

  14. Back to RNNs 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( 𝑋 " 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑍 Output Init Hidden Hidden Hidden " 𝑍 Activation Activation Activation Activation 𝐵 0 = [0] 𝐵 & 𝐵 % 𝐵 $ 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) Vanishing/Exploding Gradient = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) ) 𝜖𝐵 ).$ 𝜖𝐵 ).$ Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Note: weights are the same

  15. RNN Gated Recurrent Unit + 𝐵 ) x 𝐵 ).$ 1 − 𝜏 - 𝑋 𝜏 x % 𝑌 ) - 𝑋 𝑔 𝐵 ) = 𝑔(𝑋 - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 % - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)

  16. RNN Another view of GRUs + 𝐵 ) x 𝐵 ).$ 1 − 𝜏 - 𝑋 𝜏 x % 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( 𝑌 ) 𝑋 " 𝑋 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑍 𝑔 - 𝑋 𝐵 ) = 𝑔(𝑋 - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 % - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)

  17. GRU/LSTM: More Gates GRU LSTM

  18. Memory Concerns • If T=10000, you need to keep 10000 activations/states in memory

  19. Deep Network Gradients Conclusion • The models we saw all use the same idea • One of the earlier uses of skip connections was in the Nonlinear AutoRegressive with eXogenous inputs method (NARX; Lin et al., 1996), where they improved the RNN’s ability to infer finite state machines. - Ilya Sustkever PhD thesis 2013 NNs: ResNet (2015) Neither ResNet or Highway reference GRUs/LSTMs Highway Net (2015) RNNs: LSTM (1997) GRU (2014)

  20. Thanks

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend