The vanishing gradient problem revisited: Highway and residual - PowerPoint PPT Presentation

The vanishing gradient problem revisited: Highway and residual connections CS 6956: Deep Learning for NLP

Revisiting the vanishing gradient problem Stems from the fact that the derivative of the activation is between zero and one… … and as the number steps of gradient computation grows, these get multiplied Not just applicable for LSTMs 1

Revisiting the vanishing gradient problem Not just applicable for LSTMs Loss Outputs Many layers in between Inputs 2

Revisiting the vanishing gradient problem Not just applicable for LSTMs Loss Outputs Gradient vanishes as the depth grows Many layers in between Inputs 3

Revisiting the vanishing gradient problem Not just applicable for LSTMs Loss Outputs Gradient vanishes as the depth grows Many layers The loss is no longer in between influenced by the inputs for very deep networks! Inputs 4

Revisiting the vanishing gradient problem Not just applicable for LSTMs Loss Outputs Gradient vanishes as the depth grows Can we use ideas from LSTMs/GRUs to fix this problem? Many layers The loss is no longer in between influenced by the inputs for very deep networks! Inputs 5

Revisiting the vanishing gradient problem Intuition: Consider a single layer 𝒎 " = 𝑕 𝒎 "%& 𝐗 + 𝒄 "%& The t-1 th layer is used to calculate the value of the t th layer 6

Revisiting the vanishing gradient problem Intuition: Consider a single layer 𝒎 " = 𝑕 𝒎 "%& 𝐗 + 𝒄 "%& Instead of a non-linear update that directly calculates the next layer, let us try a linear update 𝒎 " = 𝒎 "%& + 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) 7

Revisiting the vanishing gradient problem Intuition: Consider a single layer 𝒎 " = 𝑕 𝒎 "%& 𝐗 + 𝒄 "%& Instead of a non-linear update that directly calculates the next layer, let us try a linear update 𝒎 " = 𝒎 "%& + 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) The gradients can be propagated all the way to the input without attenuation 8

Residual networks [He et al 2015] Each layer is reformulated as 𝒎 " = 𝒎 "%& + 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) 𝒎 " 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) 𝒎 "%& Original layer 9

Residual networks [He et al 2015] Each layer is reformulated as 𝒎 " = 𝒎 "%& + 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) 𝒎 " + 𝒎 " 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) 𝒎 "%& 𝒎 "%& Original layer Residual connection 10

Residual networks [He et al 2015] Each layer is reformulated as 𝒎 " = 𝒎 "%& + 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) The computation graph g is not trained to predict the next layer It predicts an update to the current layer value instead That is, it can be seen as a residual function (that is the difference between the layers) 11

Highway connections [Srivastava et al 2015] Extend the idea, using gates to stabilize learning • First, compute a proposed update 𝐃 = 𝑕(𝒎 "%& 𝐗 + 𝒄 "%& ) • Next, compute how much of the proposed update should be retained 𝐔 = 𝜏 𝐦 "%& 𝐗 1 + 𝐜 1 • Finally, compute the actual value of the next layer 𝒎 " = 1 − 𝐔 ⊙ 𝒎 "%& + 𝐔 ⊙ 𝐃 12

Why residual/highway connections? • As networks become deeper, or as sequences get larger, we can no longer hope for gradients to be carried through the network • If we want to capture long-range dependencies with the input, we need this mechanism • More generally, a blueprint of an idea that can be combined with your neural network model if it gets too deep 16

The vanishing gradient problem revisited: Highway and residual - PowerPoint PPT Presentation

The vanishing gradient problem revisited: Highway and residual connections CS 6956: Deep Learning for NLP Revisiting the vanishing gradient problem Stems from the fact that the derivative of the activation is between zero and one and

Highway Service Term Contract The Drainage Asset Rob Payne Design Service Manager The Highway

Highway Infrastructure Challenges Highway Infrastructure Challenges Presentation Ottawa,

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Problem-solving revisited Problem-solving revisited David Lim (District Judge / Mediator) State

Photo #1 Highway SBL at km 2.7 (looking north); note the curvature of the highway at the km 2.4

Robert S. Moore, Jr. Arkansas Highway Commissioner Dermott Chamber of Commerce Thursday, June

WELCOME ! Cummins Highway Open House 6:00 6:30 Cummins Highway Presentation 6:30 8:00

WELCOME ! Cummins Highway Open House 6:00 6:30 Cummins Highway Presentation 6:30 8:00

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

THE NEW NORMAL: DEMAND, SECULAR STAGNATION AND THE VANISHING MIDDLE CLASS Servaas Storm Delft

Universal vanishing corrections on the position of fronts in the Fisher-KPP class ric Brunet

A new study on the vanishing ideal of a set of points with multiplicity structures Na Lei,

Iterative Techniques in Matrix Algebra Relaxation Techniques for Solving Linear Systems Numerical

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear Regression Part 2: Residuals and Errors INFO-1301, Quantitative Reasoning 1 University of

Augmenting Paths Math 482, Lecture 25 Misha Lavrov April 3, 2020 The greedy algorithm

Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation

AMMI Introduction to Deep Learning 6.5. Residual networks Fran cois Fleuret

Residuals in Deep Super-Resolution Ruofan Zhou, Fayez Lahoud , Majed EI Helou, and Sabine

The vanishing gradient problem revisited: Highway and residual - PowerPoint PPT Presentation

The vanishing gradient problem revisited: Highway and residual connections CS 6956: Deep Learning for NLP Revisiting the vanishing gradient problem Stems from the fact that the derivative of the activation is between zero and one and

Highway Service Term Contract The Drainage Asset Rob Payne Design Service Manager The Highway

Highway Infrastructure Challenges Highway Infrastructure Challenges Presentation Ottawa,

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Problem-solving revisited Problem-solving revisited David Lim (District Judge / Mediator) State

Photo #1 Highway SBL at km 2.7 (looking north); note the curvature of the highway at the km 2.4

Robert S. Moore, Jr. Arkansas Highway Commissioner Dermott Chamber of Commerce Thursday, June

WELCOME ! Cummins Highway Open House 6:00 6:30 Cummins Highway Presentation 6:30 8:00

WELCOME ! Cummins Highway Open House 6:00 6:30 Cummins Highway Presentation 6:30 8:00

Gradient descent revisited Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

THE NEW NORMAL: DEMAND, SECULAR STAGNATION AND THE VANISHING MIDDLE CLASS Servaas Storm Delft

Universal vanishing corrections on the position of fronts in the Fisher-KPP class ric Brunet

A new study on the vanishing ideal of a set of points with multiplicity structures Na Lei,

Iterative Techniques in Matrix Algebra Relaxation Techniques for Solving Linear Systems Numerical

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear Regression Part 2: Residuals and Errors INFO-1301, Quantitative Reasoning 1 University of

Augmenting Paths Math 482, Lecture 25 Misha Lavrov April 3, 2020 The greedy algorithm

Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation

AMMI Introduction to Deep Learning 6.5. Residual networks Fran cois Fleuret

Residuals in Deep Super-Resolution Ruofan Zhou, Fayez Lahoud , Majed EI Helou, and Sabine

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1