Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net - - PowerPoint PPT Presentation

gradients of deep networks
SMART_READER_LITE
LIVE PREVIEW

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net - - PowerPoint PPT Presentation

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net $ % & ( Output Input Hidden Hidden Hidden " X Activation Activation Activation & % $ ) = ( ) -


slide-1
SLIDE 1

Gradients of Deep Networks

Chris Cremer March 29 2017

slide-2
SLIDE 2

Neural Net

Input X Output 𝑍 " Hidden Activation 𝐵$ Hidden Activation 𝐵% Hidden Activation 𝐵& 𝑋

$

𝑋

%

𝑋

&

𝑋

(

𝐵) = 𝑔(𝑋

) - 𝐵).$)

Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …)

slide-3
SLIDE 3

Recurrent Neural Net

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-4
SLIDE 4

Recurrent Neural Net

𝐵$ 𝐵% 𝐵& 𝐵0 𝑍 " Init Activation 𝐵0 = [0] Output 𝑍 " Hidden Activation 𝐵$ Hidden Activation 𝐵% Hidden Activation 𝐵& 𝑋 𝑋 𝑋 𝑋 𝐵) = 𝑔(𝑋

) - 𝐵).$)

Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Notice that weights are the same 𝑌$ 𝑌% 𝑌& 𝑌(

slide-5
SLIDE 5

Recurrent Neural Network – One Timestep

𝐵).$ 𝑌) 𝐵)

  • 𝑋

𝑔

slide-6
SLIDE 6

Gradient Descent

𝐵$ 𝐵% 𝐵& 𝐵0 𝑍 " 𝑋 𝑋 𝑋 𝑋 𝑌$ 𝑌% 𝑌& 𝑌( 𝐷(𝑍 ", 𝑍) 𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋

$

𝑋

%

𝑋

&

𝑋

(

𝐷(𝑍 ", 𝑍) Where 𝐷 is a cost function (Squared error, Cross-Entropy, …)

  • We want

78 9

: ,

78 9

;,

78 9

<,… so that we can do

gradient descent: 𝑋

=>? = 𝑋 @AB − 𝛽 78 9EFG

slide-7
SLIDE 7

Backprop (Chain Rule)

𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋

$

𝑋

%

𝑋

&

𝑋

(

𝐷(𝑍 ", 𝑍) 𝜖𝐷 𝜖𝑍 " = 𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑑𝑝𝑡𝑢 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝜖𝐷 𝜖𝑋

(

= 𝜖𝐷 𝜖𝑍 " - 𝜖𝑍 " 𝜖𝑋

(

𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝜖𝐷 𝜖𝐵% = 𝜖𝐷 𝜖𝑍 " - 𝜖𝑍 " 𝜖𝐵% = 𝜖𝐷 𝜖𝑍 " - 𝜖𝑍 " 𝜖𝐵&

  • 𝜖𝐵&

𝜖𝐵% 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, …) 𝐷 = cost function (Squared error, Cross- Entropy, …) 𝜖𝐵) 𝜖𝐵).$ = 𝜖𝑔(𝑋

) - 𝐵).$)

𝜖𝐵).$ = 𝑔′(𝑋

) - 𝐵).$)𝑋 )

𝐵) = 𝑔(𝑋

) - 𝐵).$)

𝐷 𝑍 ", 𝑍 = (𝑍 " − 𝑍)% 𝜖𝐷 𝜖𝑍 " = 2(𝑍 " − 𝑍) 𝐵) = 𝜏 𝑋

) - 𝐵).$

𝜖𝐵) 𝜖𝐵).$ = 𝐵)(1 − 𝐵)) Example We want

78 9

: ,

78 9

;,

78 9

< …

slide-8
SLIDE 8

Vanishing/Exploding Gradient

𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋

$

𝑋

%

𝑋

&

𝑋

(

𝐷(𝑍 ", 𝑍) 𝜖𝐵) 𝜖𝐵).$ = 𝜖𝑔(𝑋

) - 𝐵).$)

𝜖𝐵).$ = 𝑔′(𝑋

) - 𝐵).$)𝑋 )

𝜖𝐷 𝜖𝑋

$

= 𝜖𝐷 𝜖𝑍 " - 𝜖𝑍 " 𝜖𝐵&

  • 𝜖𝐵&

𝜖𝐵%

  • 𝜖𝐵&

𝜖𝐵% = 𝜖𝐷 𝜖𝑍 " - (𝑔′(𝑋

) - 𝐵).$)𝑋 ))[

𝑈 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑚𝑏𝑧𝑓𝑠𝑡 = number of timesteps For NNs, t goes from T to 0 For RNNs, W is the same for every t

(𝑔′(𝑋

) - 𝐵).$)𝑋 ))[

if 𝑔′(𝑋

) - 𝐵).$)𝑋 ) > 1

if 𝑔′(𝑋

) - 𝐵).$)𝑋 ) < 1

Gradient Explodes Gradient Vanishes

slide-9
SLIDE 9

Resnet/HighwayNet/GRU/LSTM

  • NNs:
  • ResNet (2015)
  • Highway Net (2015)
  • RNNs:
  • LSTM (1997)
  • GRU (2014)
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine

translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

slide-10
SLIDE 10

Residual Network (ResNet)

𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋

$

𝑋

%

𝑋

&

𝑋

(

𝐵) = 𝑔(𝑋

) - 𝐵).$) +𝐵).$

Idea:

  • If layer is useless (ie lose

information), can skip it

  • Easier for network to have zero

weights, than be identity

slide-11
SLIDE 11

ResNet Gradient

(𝑎 + 1)%= 𝑎% + 2𝑎 + 1 (𝑎 + 1)&= 𝑎& + 3𝑎% + 3𝑎 + 1 (𝑎 + 1)(= 𝑎( + 4𝑎& + 6𝑎% + 4𝑎 + 1

  • Vanishing gradient problem: gradient persists through layers
  • Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, …

𝜖𝐵) 𝜖𝐵).$ = 𝜖𝑔(𝑋

) - 𝐵).$) + 𝐵).$

𝜖𝐵).$ = 𝑔t 𝑋

) - 𝐵).$ 𝑋 ) + 1

(𝑔t 𝑋

) - 𝐵).$ 𝑋 ) + 1)[= (𝑎 + 1)[

𝑥ℎ𝑓𝑠𝑓 𝑎 = 𝑔t 𝑋

) - 𝐵).$ 𝑋 )

𝐵) = 𝑔(𝑋

) - 𝐵).$) +𝐵).$

slide-12
SLIDE 12

Highway Network

𝐵) = 𝑔(𝑋

) - 𝐵).$) - B + 𝐵).$ - (1 − 𝐶)

𝐶 = 𝜏(𝑋

)% - 𝐵).$)

𝜏 = sigmoid since output (0,1) 𝐵$ 𝐵% 𝐵& X 𝑍 " 𝑋

$

𝑋

%

𝑋

&

𝑋

(

𝐺 𝐺 𝐺

slide-13
SLIDE 13

Highway Net Gradient

  • Vanishing gradient problem: gradient persists through layers
  • Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, …

𝜖𝐵) 𝜖𝐵).$ = 𝐵) = 𝑔 𝑋

) - 𝐵).$ - B + 𝐵).$ - 1 − 𝐶

= 𝑔 𝑋

) - 𝐵).$ - 𝜏(𝑋 )% - 𝐵).$) + 𝐵).$ - 1 − 𝜏(𝑋 )% - 𝐵).$)

= 𝑔 𝑋

) - 𝐵).$ - 𝜏(𝑋 )% - 𝐵).$) + 𝐵).$ − 𝜏(𝑋 )% - 𝐵).$) - 𝐵).$

𝐶 = 𝜏(𝑋

)% - 𝐵).$)

1

slide-14
SLIDE 14

Back to RNNs

𝐵$ 𝐵% 𝐵& 𝐵0 𝑍 " Init Activation 𝐵0 = [0] Output 𝑍 " Hidden Activation 𝐵$ Hidden Activation 𝐵% Hidden Activation 𝐵& 𝑋 𝑋 𝑋 𝑋 𝐵) = 𝑔(𝑋

) - 𝐵).$)

Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Note: weights are the same 𝑌$ 𝑌% 𝑌& 𝑌( 𝜖𝐵) 𝜖𝐵).$ = 𝜖𝑔(𝑋

) - 𝐵).$)

𝜖𝐵).$ = 𝑔′(𝑋

) - 𝐵).$)𝑋 )

Vanishing/Exploding Gradient

slide-15
SLIDE 15

Gated Recurrent Unit

RNN 𝐵).$ 𝑌) 𝐵)

  • 𝑋

𝑔

  • 𝑋

%

𝜏 x

1 − 𝜏

x + 𝐵) = 𝑔(𝑋 - 𝐵).$) - B + 𝐵).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋

% - 𝐵).$)

𝜏= sigmoid since output (0,1)

slide-16
SLIDE 16

Another view of GRUs

RNN 𝐵).$ 𝑌) 𝐵)

  • 𝑋

𝑔

  • 𝑋

%

𝜏 x

1 − 𝜏

x + 𝐵) = 𝑔(𝑋 - 𝐵).$) - B + 𝐵).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋

% - 𝐵).$)

𝜏= sigmoid since output (0,1) 𝐵$ 𝐵% 𝐵& 𝐵0 𝑍 " 𝑋 𝑋 𝑋 𝑋 𝑌$ 𝑌% 𝑌& 𝑌(

slide-17
SLIDE 17

GRU/LSTM: More Gates

GRU LSTM

slide-18
SLIDE 18

Memory Concerns

  • If T=10000, you need to keep 10000 activations/states in memory
slide-19
SLIDE 19

Deep Network Gradients Conclusion

  • The models we saw all use the same idea
  • One of the earlier uses of skip connections was in the Nonlinear AutoRegressive with eXogenous

inputs method (NARX; Lin et al., 1996), where they improved the RNN’s ability to infer finite state machines.

  • Ilya Sustkever PhD thesis 2013

NNs: ResNet (2015) Highway Net (2015) RNNs: LSTM (1997) GRU (2014) Neither ResNet or Highway reference GRUs/LSTMs

slide-20
SLIDE 20

Thanks