Gradients of Deep Networks Chris Cremer March 29 2017
Neural Net π π π π $ % & ( Output Input Hidden Hidden Hidden " π X Activation Activation Activation π΅ & π΅ % π΅ $ π΅ ) = π(π ) - π΅ ).$ ) Where π = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, β¦)
Recurrent Neural Net http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Net π $ π % π & π ( " π π π΅ 0 π΅ $ π΅ % π΅ & π π π Output Init Hidden Hidden Hidden " π Activation Activation Activation Activation π΅ 0 = [0] π΅ & π΅ % π΅ $ π΅ ) = π(π ) - π΅ ).$ ) Where π = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, β¦) Notice that weights are the same
Recurrent Neural Network β One Timestep π΅ ).$ π΅ ) - π π π )
Gradient Descent " ", π) π π X π π΅ $ π π΅ % π΅ & π π·(π & ( $ % 78 78 78 We want : , ; , < ,β¦ so that we can do β’ 9 9 9 78 gradient descent: π =>? = π @AB β π½ 9 EFG π $ π % π & π ( ", π) π " π·(π π π π΅ 0 π΅ $ π΅ % π΅ & π π Where π· is a cost function (Squared error, Cross-Entropy, β¦)
Backprop (Chain Rule) 78 78 78 We want : , ; , < β¦ 9 9 9 " ", π) π π X π π΅ $ π π΅ % π΅ & π π·(π & ( $ % Example π΅ ) = π(π ) - π΅ ).$ ) ", π = (π " β π) % π· π ππ· ππ· " β π) " = πππ ππ€ππ’ππ€π ππ πππ‘π’ ππ£πππ’πππ π = non-linear activation " = 2(π ππ ππ function (sigmoid, tanh, ReLu, Softplus, β¦) " ππ· = ππ· " - ππ π΅ ) = π π ) - π΅ ).$ πππ ππ€ππ’ππ€π ππ πππ’ππ€ππ’πππ ππ£πππ’πππ ππ ππ ππ ππ΅ ) ( ( = π΅ ) (1 β π΅ ) ) π· = cost function ππ΅ ).$ " " ππ· = ππ· " - ππ = ππ· " - ππ - ππ΅ & (Squared error, Cross- ππ΅ % ππ΅ % ππ΅ & ππ΅ % ππ ππ Entropy, β¦) ππ΅ ) = ππ(π ) - π΅ ).$ ) = πβ²(π ) - π΅ ).$ )π ) ππ΅ ).$ ππ΅ ).$
Vanishing/Exploding Gradient " ", π) π π X π π΅ $ π π΅ % π΅ & π π·(π & ( $ % ) - π΅ ).$ ) ππ΅ ) = ππ(π = πβ²(π ) - π΅ ).$ )π ) ππ΅ ).$ ππ΅ ).$ " ππ· = ππ· " - ππ - ππ΅ & - ππ΅ & = ππ· ) ) [ " - (πβ²(π ) - π΅ ).$ )π ππ ππ΅ & ππ΅ % ππ΅ % ππ ππ $ π = ππ£ππππ ππ πππ§ππ π‘ = number of timesteps For NNs, t goes from T to 0 For RNNs, W is the same for every t if πβ²(π ) - π΅ ).$ )π ) > 1 Gradient Explodes ) ) [ (πβ²(π ) - π΅ ).$ )π Gradient Vanishes if πβ²(π ) - π΅ ).$ )π ) < 1
Resnet/HighwayNet/GRU/LSTM β’ NNs: β’ ResNet (2015) β’ Highway Net (2015) β’ RNNs: β’ LSTM (1997) β’ GRU (2014) K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
Residual Network (ResNet) " π π X π π΅ $ π π΅ % π΅ & π & ( $ % Idea: - If layer is useless (ie lose π΅ ) = π(π ) - π΅ ).$ ) +π΅ ).$ information), can skip it - Easier for network to have zero weights, than be identity
ResNet Gradient π΅ ) = π(π ) - π΅ ).$ ) +π΅ ).$ ππ΅ ) = ππ(π ) - π΅ ).$ ) + π΅ ).$ = π t π ) - π΅ ).$ π ) + 1 ππ΅ ).$ ππ΅ ).$ (π t π ) + 1) [ = (π + 1) [ ) - π΅ ).$ π π₯βππ π π = π t π ) - π΅ ).$ π ) (π + 1) % = π % + 2π + 1 (π + 1) & = π & + 3π % + 3π + 1 (π + 1) ( = π ( + 4π & + 6π % + 4π + 1 β’ Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, β¦ β’
Highway Network πΊ πΊ πΊ " π π X π π΅ $ π π΅ % π΅ & π & ( $ % π΅ ) = π(π ) - π΅ ).$ ) - B + π΅ ).$ - (1 β πΆ) πΆ = π(π )% - π΅ ).$ ) π = sigmoid since output (0,1)
Highway Net Gradient π΅ ) = π π ) - π΅ ).$ - B + π΅ ).$ - 1 β πΆ πΆ = π(π )% - π΅ ).$ ) = π π ) - π΅ ).$ - π(π )% - π΅ ).$ ) + π΅ ).$ - 1 β π(π )% - π΅ ).$ ) = π π ) - π΅ ).$ - π(π )% - π΅ ).$ ) + π΅ ).$ β π(π )% - π΅ ).$ ) - π΅ ).$ ππ΅ ) = 1 ππ΅ ).$ β’ Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, β¦ β’
Back to RNNs π $ π % π & π ( π " π π π΅ 0 π΅ $ π΅ % π΅ & π π Output Init Hidden Hidden Hidden " π Activation Activation Activation Activation π΅ 0 = [0] π΅ & π΅ % π΅ $ ππ΅ ) = ππ(π ) - π΅ ).$ ) Vanishing/Exploding Gradient = πβ²(π ) - π΅ ).$ )π π΅ ) = π(π ) - π΅ ).$ ) ) ππ΅ ).$ ππ΅ ).$ Where π = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, β¦) Note: weights are the same
RNN Gated Recurrent Unit + π΅ ) x π΅ ).$ 1 β π - π π x % π ) - π π π΅ ) = π(π - π΅ ).$ ) - B + π΅ ).$ - (1 β πΆ) πΆ = π(π % - π΅ ).$ ) π = sigmoid since output (0,1)
RNN Another view of GRUs + π΅ ) x π΅ ).$ 1 β π - π π x % π $ π % π & π ( π ) π " π π π π΅ 0 π΅ $ π΅ % π΅ & π π - π π΅ ) = π(π - π΅ ).$ ) - B + π΅ ).$ - (1 β πΆ) πΆ = π(π % - π΅ ).$ ) π = sigmoid since output (0,1)
GRU/LSTM: More Gates GRU LSTM
Memory Concerns β’ If T=10000, you need to keep 10000 activations/states in memory
Deep Network Gradients Conclusion β’ The models we saw all use the same idea β’ One of the earlier uses of skip connections was in the Nonlinear AutoRegressive with eXogenous inputs method (NARX; Lin et al., 1996), where they improved the RNNβs ability to infer finite state machines. - Ilya Sustkever PhD thesis 2013 NNs: ResNet (2015) Neither ResNet or Highway reference GRUs/LSTMs Highway Net (2015) RNNs: LSTM (1997) GRU (2014)
Thanks
Recommend
More recommend