Gradients of Deep Networks
Chris Cremer March 29 2017
Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net - - PowerPoint PPT Presentation
Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net $ % & ( Output Input Hidden Hidden Hidden " X Activation Activation Activation & % $ ) = ( ) -
Chris Cremer March 29 2017
Input X Output 𝑍 " Hidden Activation 𝐵$ Hidden Activation 𝐵% Hidden Activation 𝐵& 𝑋
$
𝑋
%
𝑋
&
𝑋
(
𝐵) = 𝑔(𝑋
) - 𝐵).$)
Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …)
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
𝐵$ 𝐵% 𝐵& 𝐵0 𝑍 " Init Activation 𝐵0 = [0] Output 𝑍 " Hidden Activation 𝐵$ Hidden Activation 𝐵% Hidden Activation 𝐵& 𝑋 𝑋 𝑋 𝑋 𝐵) = 𝑔(𝑋
) - 𝐵).$)
Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Notice that weights are the same 𝑌$ 𝑌% 𝑌& 𝑌(
𝐵).$ 𝑌) 𝐵)
𝑔
𝐵$ 𝐵% 𝐵& 𝐵0 𝑍 " 𝑋 𝑋 𝑋 𝑋 𝑌$ 𝑌% 𝑌& 𝑌( 𝐷(𝑍 ", 𝑍) 𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋
$
𝑋
%
𝑋
&
𝑋
(
𝐷(𝑍 ", 𝑍) Where 𝐷 is a cost function (Squared error, Cross-Entropy, …)
78 9
: ,
78 9
;,
78 9
<,… so that we can do
gradient descent: 𝑋
=>? = 𝑋 @AB − 𝛽 78 9EFG
𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋
$
𝑋
%
𝑋
&
𝑋
(
𝐷(𝑍 ", 𝑍) 𝜖𝐷 𝜖𝑍 " = 𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑑𝑝𝑡𝑢 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝜖𝐷 𝜖𝑋
(
= 𝜖𝐷 𝜖𝑍 " - 𝜖𝑍 " 𝜖𝑋
(
𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝜖𝐷 𝜖𝐵% = 𝜖𝐷 𝜖𝑍 " - 𝜖𝑍 " 𝜖𝐵% = 𝜖𝐷 𝜖𝑍 " - 𝜖𝑍 " 𝜖𝐵&
𝜖𝐵% 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, …) 𝐷 = cost function (Squared error, Cross- Entropy, …) 𝜖𝐵) 𝜖𝐵).$ = 𝜖𝑔(𝑋
) - 𝐵).$)
𝜖𝐵).$ = 𝑔′(𝑋
) - 𝐵).$)𝑋 )
𝐵) = 𝑔(𝑋
) - 𝐵).$)
𝐷 𝑍 ", 𝑍 = (𝑍 " − 𝑍)% 𝜖𝐷 𝜖𝑍 " = 2(𝑍 " − 𝑍) 𝐵) = 𝜏 𝑋
) - 𝐵).$
𝜖𝐵) 𝜖𝐵).$ = 𝐵)(1 − 𝐵)) Example We want
78 9
: ,
78 9
;,
78 9
< …
𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋
$
𝑋
%
𝑋
&
𝑋
(
𝐷(𝑍 ", 𝑍) 𝜖𝐵) 𝜖𝐵).$ = 𝜖𝑔(𝑋
) - 𝐵).$)
𝜖𝐵).$ = 𝑔′(𝑋
) - 𝐵).$)𝑋 )
𝜖𝐷 𝜖𝑋
$
= 𝜖𝐷 𝜖𝑍 " - 𝜖𝑍 " 𝜖𝐵&
𝜖𝐵%
𝜖𝐵% = 𝜖𝐷 𝜖𝑍 " - (𝑔′(𝑋
) - 𝐵).$)𝑋 ))[
𝑈 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑚𝑏𝑧𝑓𝑠𝑡 = number of timesteps For NNs, t goes from T to 0 For RNNs, W is the same for every t
(𝑔′(𝑋
) - 𝐵).$)𝑋 ))[
if 𝑔′(𝑋
) - 𝐵).$)𝑋 ) > 1
if 𝑔′(𝑋
) - 𝐵).$)𝑋 ) < 1
Gradient Explodes Gradient Vanishes
translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋
$
𝑋
%
𝑋
&
𝑋
(
𝐵) = 𝑔(𝑋
) - 𝐵).$) +𝐵).$
Idea:
information), can skip it
weights, than be identity
(𝑎 + 1)%= 𝑎% + 2𝑎 + 1 (𝑎 + 1)&= 𝑎& + 3𝑎% + 3𝑎 + 1 (𝑎 + 1)(= 𝑎( + 4𝑎& + 6𝑎% + 4𝑎 + 1
𝜖𝐵) 𝜖𝐵).$ = 𝜖𝑔(𝑋
) - 𝐵).$) + 𝐵).$
𝜖𝐵).$ = 𝑔t 𝑋
) - 𝐵).$ 𝑋 ) + 1
(𝑔t 𝑋
) - 𝐵).$ 𝑋 ) + 1)[= (𝑎 + 1)[
𝑥ℎ𝑓𝑠𝑓 𝑎 = 𝑔t 𝑋
) - 𝐵).$ 𝑋 )
𝐵) = 𝑔(𝑋
) - 𝐵).$) +𝐵).$
𝐵) = 𝑔(𝑋
) - 𝐵).$) - B + 𝐵).$ - (1 − 𝐶)
𝐶 = 𝜏(𝑋
)% - 𝐵).$)
𝜏 = sigmoid since output (0,1) 𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋
$
𝑋
%
𝑋
&
𝑋
(
𝐺 𝐺 𝐺
𝜖𝐵) 𝜖𝐵).$ = 𝐵) = 𝑔 𝑋
) - 𝐵).$ - B + 𝐵).$ - 1 − 𝐶
= 𝑔 𝑋
) - 𝐵).$ - 𝜏(𝑋 )% - 𝐵).$) + 𝐵).$ - 1 − 𝜏(𝑋 )% - 𝐵).$)
= 𝑔 𝑋
) - 𝐵).$ - 𝜏(𝑋 )% - 𝐵).$) + 𝐵).$ − 𝜏(𝑋 )% - 𝐵).$) - 𝐵).$
𝐶 = 𝜏(𝑋
)% - 𝐵).$)
1
𝐵$ 𝐵% 𝐵& 𝐵0 𝑍 " Init Activation 𝐵0 = [0] Output 𝑍 " Hidden Activation 𝐵$ Hidden Activation 𝐵% Hidden Activation 𝐵& 𝑋 𝑋 𝑋 𝑋 𝐵) = 𝑔(𝑋
) - 𝐵).$)
Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Note: weights are the same 𝑌$ 𝑌% 𝑌& 𝑌( 𝜖𝐵) 𝜖𝐵).$ = 𝜖𝑔(𝑋
) - 𝐵).$)
𝜖𝐵).$ = 𝑔′(𝑋
) - 𝐵).$)𝑋 )
Vanishing/Exploding Gradient
RNN 𝐵).$ 𝑌) 𝐵)
𝑔
%
𝜏 x
1 − 𝜏
x + 𝐵) = 𝑔(𝑋 - 𝐵).$) - B + 𝐵).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋
% - 𝐵).$)
𝜏= sigmoid since output (0,1)
RNN 𝐵).$ 𝑌) 𝐵)
𝑔
%
𝜏 x
1 − 𝜏
x + 𝐵) = 𝑔(𝑋 - 𝐵).$) - B + 𝐵).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋
% - 𝐵).$)
𝜏= sigmoid since output (0,1) 𝐵$ 𝐵% 𝐵& 𝐵0 𝑍 " 𝑋 𝑋 𝑋 𝑋 𝑌$ 𝑌% 𝑌& 𝑌(
GRU LSTM
inputs method (NARX; Lin et al., 1996), where they improved the RNN’s ability to infer finite state machines.
NNs: ResNet (2015) Highway Net (2015) RNNs: LSTM (1997) GRU (2014) Neither ResNet or Highway reference GRUs/LSTMs