The vanishing gradient problem revisited: Highway and residual - - PowerPoint PPT Presentation

β–Ά
the vanishing gradient problem revisited highway and
SMART_READER_LITE
LIVE PREVIEW

The vanishing gradient problem revisited: Highway and residual - - PowerPoint PPT Presentation

The vanishing gradient problem revisited: Highway and residual connections CS 6956: Deep Learning for NLP Revisiting the vanishing gradient problem Stems from the fact that the derivative of the activation is between zero and one and


slide-1
SLIDE 1

CS 6956: Deep Learning for NLP

The vanishing gradient problem revisited: Highway and residual connections

slide-2
SLIDE 2

Revisiting the vanishing gradient problem

Stems from the fact that the derivative of the activation is between zero and one… … and as the number steps of gradient computation grows, these get multiplied Not just applicable for LSTMs

1

slide-3
SLIDE 3

Revisiting the vanishing gradient problem

Not just applicable for LSTMs

2

Inputs Outputs Many layers in between Loss

slide-4
SLIDE 4

Revisiting the vanishing gradient problem

Not just applicable for LSTMs

3

Inputs Outputs Many layers in between Loss Gradient vanishes as the depth grows

slide-5
SLIDE 5

Revisiting the vanishing gradient problem

Not just applicable for LSTMs

4

Inputs Outputs Many layers in between Loss Gradient vanishes as the depth grows The loss is no longer influenced by the inputs for very deep networks!

slide-6
SLIDE 6

Revisiting the vanishing gradient problem

Not just applicable for LSTMs

5

Inputs Outputs Many layers in between Loss Gradient vanishes as the depth grows The loss is no longer influenced by the inputs for very deep networks!

Can we use ideas from LSTMs/GRUs to fix this problem?

slide-7
SLIDE 7

Revisiting the vanishing gradient problem

Intuition: Consider a single layer π’Ž" = 𝑕 π’Ž"%&𝐗 + 𝒄"%&

6

The t-1th layer is used to calculate the value of the tth layer

slide-8
SLIDE 8

Revisiting the vanishing gradient problem

Intuition: Consider a single layer π’Ž" = 𝑕 π’Ž"%&𝐗 + 𝒄"%& π’Ž" = π’Ž"%& + 𝑕(π’Ž"%&𝐗 + 𝒄"%&)

7

Instead of a non-linear update that directly calculates the next layer, let us try a linear update

slide-9
SLIDE 9

Revisiting the vanishing gradient problem

Intuition: Consider a single layer π’Ž" = 𝑕 π’Ž"%&𝐗 + 𝒄"%& π’Ž" = π’Ž"%& + 𝑕(π’Ž"%&𝐗 + 𝒄"%&)

8

Instead of a non-linear update that directly calculates the next layer, let us try a linear update The gradients can be propagated all the way to the input without attenuation

slide-10
SLIDE 10

Residual networks

Each layer is reformulated as π’Ž" = π’Ž"%& + 𝑕(π’Ž"%&𝐗 + 𝒄"%&)

9

[He et al 2015] π’Ž"%& 𝑕(π’Ž"%&𝐗 + 𝒄"%&) π’Ž" Original layer

slide-11
SLIDE 11

Residual networks

Each layer is reformulated as π’Ž" = π’Ž"%& + 𝑕(π’Ž"%&𝐗 + 𝒄"%&)

10

[He et al 2015] π’Ž"%& 𝑕(π’Ž"%&𝐗 + 𝒄"%&) π’Ž" π’Ž"%& 𝑕(π’Ž"%&𝐗 + 𝒄"%&) + π’Ž" Original layer Residual connection

slide-12
SLIDE 12

Residual networks

Each layer is reformulated as π’Ž" = π’Ž"%& + 𝑕(π’Ž"%&𝐗 + 𝒄"%&) The computation graph g is not trained to predict the next layer It predicts an update to the current layer value instead That is, it can be seen as a residual function (that is the difference between the layers)

11

[He et al 2015]

slide-13
SLIDE 13

Highway connections

Extend the idea, using gates to stabilize learning

  • First, compute a proposed update

𝐃 = 𝑕(π’Ž"%&𝐗 + 𝒄"%&)

  • Next, compute how much of the proposed update should be

retained 𝐔 = 𝜏 𝐦"%&𝐗1 + 𝐜1

  • Finally, compute the actual value of the next layer

π’Ž" = 1 βˆ’ 𝐔 βŠ™ π’Ž"%& + 𝐔 βŠ™ 𝐃

12

[Srivastava et al 2015]

slide-14
SLIDE 14

Highway connections

Extend the idea, using gates to stabilize learning

  • First, compute a proposed update

𝐃 = 𝑕(π’Ž"%&𝐗 + 𝒄"%&)

  • Next, compute how much of the proposed update should be

retained 𝐔 = 𝜏 𝐦"%&𝐗1 + 𝐜1

  • Finally, compute the actual value of the next layer

π’Ž" = 1 βˆ’ 𝐔 βŠ™ π’Ž"%& + 𝐔 βŠ™ 𝐃

13

[Srivastava et al 2015]

slide-15
SLIDE 15

Highway connections

Extend the idea, using gates to stabilize learning

  • First, compute a proposed update

𝐃 = 𝑕(π’Ž"%&𝐗 + 𝒄"%&)

  • Next, compute how much of the proposed update should be

retained 𝐔 = 𝜏 𝐦"%&𝐗1 + 𝐜1

  • Finally, compute the actual value of the next layer

π’Ž" = 1 βˆ’ 𝐔 βŠ™ π’Ž"%& + 𝐔 βŠ™ 𝐃

14

[Srivastava et al 2015]

slide-16
SLIDE 16

Highway connections

Extend the idea, using gates to stabilize learning

  • First, compute a proposed update

𝐃 = 𝑕(π’Ž"%&𝐗 + 𝒄"%&)

  • Next, compute how much of the proposed update should be

retained 𝐔 = 𝜏 𝐦"%&𝐗1 + 𝐜1

  • Finally, compute the actual value of the next layer

π’Ž" = 1 βˆ’ 𝐔 βŠ™ π’Ž"%& + 𝐔 βŠ™ 𝐃

15

[Srivastava et al 2015]

slide-17
SLIDE 17

Why residual/highway connections?

  • As networks become deeper, or as sequences get

larger, we can no longer hope for gradients to be carried through the network

  • If we want to capture long-range dependencies with

the input, we need this mechanism

  • More generally, a blueprint of an idea that can be

combined with your neural network model if it gets too deep

16