Lecture 8: Nonlinearities Mark Hasegawa-Johnson ECE 417: Multimedia - PowerPoint PPT Presentation

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Lecture 8: Nonlinearities Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Review: Neural Network 1 Binary Nonlinearities 2 Classifiers 3 Binary Cross Entropy Loss 4 Multinomial Classifier: Cross-Entropy Loss 5 Summary 6

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Outline Review: Neural Network 1 Binary Nonlinearities 2 Classifiers 3 Binary Cross Entropy Loss 4 Multinomial Classifier: Cross-Entropy Loss 5 Summary 6

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Review: How to train a neural network 1 Find a training dataset that contains n examples showing the desired output, � y i , that the NN should compute in response to input vector � x i : D = { ( � x 1 , � y 1 ) , . . . , ( � x n , � y n ) } 2 Randomly initialize the weights and biases, W (1) , � b (1) , W (2) , and � b (2) . 3 Perform forward propagation : find out what the neural net computes as ˆ y i for each � x i . 4 Define a loss function that measures how badly ˆ y differs from � y . 5 Perform back propagation to improve W (1) , � b (1) , W (2) , and � b (2) . 6 Repeat steps 3-5 until convergence.

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Review: Second Layer = Piece-Wise Approximation The second layer of the network approximates ˆ y using a bias term w (2) � b , plus correction vectors � , each scaled by its activation h j : j b (2) + y = � � w (2) ˆ � h j j j The activation, h j , is a number between 0 and 1. For example, we could use the logistic sigmoid function: 1 � � e (1) h k = σ = ∈ (0 , 1) k 1 + exp( − e (1) k ) The logistic sigmoid is a differentiable approximation to a unit step function.

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Review: First Layer = A Series of Decisions The first layer of the network decides whether or not to “turn on” each of the h j ’s. It does this by comparing � x to a series of linear threshold vectors: � w (1) 1 ¯ k � x > 0 � � w (1) h k = σ ¯ k � x ≈ w (1) 0 ¯ k � x < 0

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Gradient Descent: How do we improve W and b ? Given some initial neural net parameter (called u kj in this figure), we want to find a better value of the same parameter. We do that using gradient descent: u kj ← u kj − η d L , du kj where η is a learning rate (some small constant, e.g., η = 0 . 02 or so).

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary The Basic Binary Pros and Cons of the Unit Step Nonlinearity: Unit Step Pro: it gives exactly piece-wise (a.k.a. Heaviside function) constant approximation of any desired � y . � w (1) 1 ¯ k � x > 0 Con: if h k = u ( e k ), then you can’t � w (1) � u ¯ k � x = w (1) use back-propagation to train the 0 ¯ k � x < 0 neural network. Remember back-prop: � � ∂ e k � d L d L � � ∂ h k � � = ∂ e k ∂ w kj dw kj dh k k but du ( x ) / dx is a Dirac delta function — zero everywhere, except where it’s infinite.

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary The Differentiable Approximation: Why to use the logistic function Logistic Sigmoid  1 1 b → ∞ σ ( b ) =   1 + e − b σ ( b ) = 0 b → −∞  in between in between  and σ ( b ) is smoothly differentiable, so back-prop works.

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Derivative of a sigmoid The derivative of a sigmoid is pretty easy to calculate: e − x 1 d σ σ ( x ) = 1 + e − x , dx = (1 + e − x ) 2 An interesting fact that’s extremely useful, in computing back-prop, is that if h = σ ( x ), then we can write the derivative in terms of h , without any need to store x : e − x d σ dx = (1 + e − x ) 2 e − x � 1 � � � = 1 + e − x 1 + e − x � 1 � � 1 � = 1 − 1 + e − x 1 + e − x = σ ( x )(1 − σ ( x )) = h (1 − h )

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Step function and its derivative Logistic function and its derivative The derivative of the step function is the Dirac delta, which is not very useful in backprop.

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Signum and Tanh The signum function is a signed binary nonlinearity. It is used if, for some reason, you want your output to be h ∈ {− 1 , 1 } , instead of h ∈ { 0 , 1 } : � − 1 b < 0 sign( b ) = 1 b > 0 It is usually approximated by the hyperbolic tangent function (tanh), which is just a scaled shifted version of the sigmoid: tanh( b ) = e b − e − b e b + e − b = 1 − e − 2 b 1 + e − 2 b = 2 σ (2 b ) − 1 and which has a scaled version of the sigmoid derivative: d tanh( b ) 1 − tanh 2 ( b ) � � = db

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Signum function and its derivative Tanh function and its derivative The derivative of the signum function is the Dirac delta, which is not very useful in backprop.

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary A suprising problem with the sigmoid: Vanishing gradients The sigmoid has a surprising problem: for large values of w , σ ′ ( wx ) → 0. When we begin training, we start with small values of w . σ ′ ( wx ) is reasonably large, and training proceeds. If w and ∇ w L are vectors in opposite directions, then w → w − η ∇ w L makes w larger. After a few iterations, w gets very large. At that point, σ ′ ( wx ) → 0, and training effectively stops. After that point, even if the neural net sees new training data that don’t match what it has already learned, it can no longer change. We say that it has suffered from the “vanishing gradient problem.”

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary A solution to the vanishing gradient problem: ReLU The most ubiquitous solution to the vanishing gradient problem is to use a ReLU (rectified linear unit) instead of a sigmoid. The ReLU is given by � b ≥ 0 b ReLU( b ) = 0 b ≤ 0 , and its derivative is the unit step. Notice that the unit step is equally large ( u ( wx ) = 1) for any positive value ( wx > 0), so no matter how large w gets, back-propagation continues to work.

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary A solution to the vanishing gradient problem: ReLU Pro: The ReLU derivative is equally large ( d ReLU ( wx ) = 1) d ( wx ) for any positive value ( wx > 0), so no matter how large w gets, back-propagation continues to work. Con: If the ReLU is used as a hidden unit ( h j = ReLU( e j )), then your output is no longer a piece-wise constant approximation of � y . It is now piece-wise linear. On the other hand, maybe piece-wise linear is better than piece-wise constant, so. . .

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary A solution to the vanishing gradient problem: the ReLU Pro: The ReLU derivative is equally large ( d ReLU ( wx ) = 1) d ( wx ) for any positive value ( wx > 0), so no matter how large w gets, back-propagation continues to work. Pro: If the ReLU is used as a hidden unit ( h j = ReLU( e j )), then your output is no longer a piece-wise constant approximation of � y . It is now piece-wise linear. Con: ??

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary The dying ReLU problem Pro: The ReLU derivative is equally large ( d ReLU ( wx ) = 1) d ( wx ) for any positive value ( wx > 0), so no matter how large w gets, back-propagation continues to work. Pro: If the ReLU is used as a hidden unit ( h j = ReLU( e j )), then your output is no longer a piece-wise constant approximation of � y . It is now piece-wise linear. Con: If wx + b < 0, then ( d ReLU ( wx ) = 0), and learning d ( wx ) stops. In the worst case, if b becomes very negative, then all of the hidden nodes are turned off—the network computes nothing, and no learning can take place! This is called the “Dying ReLU problem.”

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Solutions to the Dying ReLU problem Softplus: Pro: always positive. Con: gradient → 0 as x → −∞ . f ( x ) = ln (1 + e x ) Leaky ReLU: Pro: gradient constant, output piece-wise linear. Con: negative part might fail to match your dataset. � x x ≥ 0 f ( x ) = 0 . 01 x x ≤ 0 Parametric ReLU (PReLU:) Pro: gradient constant, ouput PWL. The slope of the negative part ( a ) is a trainable parameter, so can adapt to your dataset. Con: you have to train it. � x x ≥ 0 f ( x ) = x ≤ 0 ax

Lecture 8: Nonlinearities Mark Hasegawa-Johnson ECE 417: Multimedia - PowerPoint PPT Presentation

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Lecture 8: Nonlinearities Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020 Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary

DSGE Model Nonlinearities Frank Schorfheide University of Pennsylvania, CEPR, NBER, PIER June

"Assessing DSGE Model Nonlinearities " Andrea Prestipino NYU April 2014 Motivation

ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson University of Illinois

Analysis of the impact of model nonlinearities in inverse problem solving Tomislava Vukicevic

Bright solitons from defocusing nonlinearities Olga V. Borovkova, Yaroslav V. Kartashov, Lluis

On ( p , N ) problems with critical exponential nonlinearities Patrizia Pucci Universit degli

ASSESSING DSGE MODEL NONLINEARITIES S. Boraan Aruoba Luigi Bocola Frank Schorfheide December

Gov 2000: 11. Interactions, F-tests, and Nonlinearities Matthew Blackwell November 15, 2016 1 /

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

BS2247 Introduction to Econometrics Lecture 4: The simple regression model OLS Unbiasedness, OLS

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Ciclo XXII Settore/i scientifico disciplinari di afferenza: ING-INF/03 Wireless multimedia

aerospace structures Olivia Leo Supervisors: Alberto Pirrera and Rainer Groh Spo Sponsored

Spatial bounds for resolvent families and applications to PDES with critical IWOTA

The stochastic extended path approach Stphane Adjemian 1 and Michel Juillard 2 June, 2016 1

Risk management for hedge funds AQF 2005 Nicolas Papageorgiou Outline VaR and drawbacks

MIME Types, XSS as a module MIME Types, XSS as a module and XSS as a standard and XSS as a

Unified Stochastic Reverberation Modeling Roland Badeau LTCI, Tlcom ParisTech, Universit

Provisioning Robust & Interpretable AI/ML-based Service Bundles Alun Preece, Dan Harborne

Advanced SQL injection to operating system full control Bernardo Damele Assumpo Guimares

GT Explicabilit e Christophe Denis (EDF R&D, SU), Nicolas Maudet (LIP6, SU) Journ ee

The Legacy of Michael Hillas in Air Shower Simulations. Johannes Knapp, DESY Zeuthen on

First combined studies on Lorentz Invariance Violation from observations of astrophysical sources

Keeping Score: Keeping Score: New Approaches to the Standard of Living New Approaches to the

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 8: Nonlinearities Mark Hasegawa-Johnson ECE 417: Multimedia - PowerPoint PPT Presentation

Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary Lecture 8: Nonlinearities Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020 Review Binary Nonlinearities Classifiers BCE Loss CE Loss Summary

DSGE Model Nonlinearities Frank Schorfheide University of Pennsylvania, CEPR, NBER, PIER June

&quot;Assessing DSGE Model Nonlinearities &quot; Andrea Prestipino NYU April 2014 Motivation

ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson University of Illinois

Analysis of the impact of model nonlinearities in inverse problem solving Tomislava Vukicevic

Bright solitons from defocusing nonlinearities Olga V. Borovkova, Yaroslav V. Kartashov, Lluis

On ( p , N ) problems with critical exponential nonlinearities Patrizia Pucci Universit degli

ASSESSING DSGE MODEL NONLINEARITIES S. Boraan Aruoba Luigi Bocola Frank Schorfheide December

Gov 2000: 11. Interactions, F-tests, and Nonlinearities Matthew Blackwell November 15, 2016 1 /

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

BS2247 Introduction to Econometrics Lecture 4: The simple regression model OLS Unbiasedness, OLS

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Ciclo XXII Settore/i scientifico disciplinari di afferenza: ING-INF/03 Wireless multimedia

aerospace structures Olivia Leo Supervisors: Alberto Pirrera and Rainer Groh Spo Sponsored

Spatial bounds for resolvent families and applications to PDES with critical IWOTA

The stochastic extended path approach Stphane Adjemian 1 and Michel Juillard 2 June, 2016 1

Risk management for hedge funds AQF 2005 Nicolas Papageorgiou Outline VaR and drawbacks

MIME Types, XSS as a module MIME Types, XSS as a module and XSS as a standard and XSS as a

Unified Stochastic Reverberation Modeling Roland Badeau LTCI, Tlcom ParisTech, Universit

Provisioning Robust &amp; Interpretable AI/ML-based Service Bundles Alun Preece, Dan Harborne

Advanced SQL injection to operating system full control Bernardo Damele Assumpo Guimares

GT Explicabilit e Christophe Denis (EDF R&amp;D, SU), Nicolas Maudet (LIP6, SU) Journ ee

The Legacy of Michael Hillas in Air Shower Simulations. Johannes Knapp, DESY Zeuthen on

First combined studies on Lorentz Invariance Violation from observations of astrophysical sources

Keeping Score: Keeping Score: New Approaches to the Standard of Living New Approaches to the

Sambuz

Useful Links

Newsletter

Mail Us

"Assessing DSGE Model Nonlinearities " Andrea Prestipino NYU April 2014 Motivation

Provisioning Robust & Interpretable AI/ML-based Service Bundles Alun Preece, Dan Harborne

GT Explicabilit e Christophe Denis (EDF R&D, SU), Nicolas Maudet (LIP6, SU) Journ ee