Deep Learning Primer Nishith Khandwala Neural Networks Overview - PowerPoint PPT Presentation

Deep Learning Primer Nishith Khandwala

Neural Networks

Overview ● Neural Network Basics Activation Functions ● Stochastic Gradient Descent (SGD) ● ● Regularization (Dropout) ● Training Tips and Tricks

Neural Network (NN) Basics Dataset: (x, y) where x : inputs, y : labels Steps to train a 1-hidden layer NN: ● Do a forward pass: ŷ = f(xW + b) Compute loss: loss(y, ŷ) ● ● Compute gradients using backprop Update weights using an optimization ● algorithm, like SGD Do hyperparameter tuning on Dev set ● ● Evaluate NN on Test set

Activation Functions: Sigmoid Properties: ● Squashes input between 0 and 1. Problems: ● Saturation of neurons kills gradients. ● Output is not centered at 0.

Activation Functions: Tanh Properties: ● Squashes input between -1 and 1. ● Output centered at 0. Problems: ● Saturation of neurons kills gradients.

Activation Functions: ReLU Properties: ● No saturation ● Computationally cheap ● Empirically known to converge faster Problems: ● Output not centered at 0 ● When input < 0, ReLU gradient is 0. Never changes.

Stochastic Gradient Descent (SGD) ● Stochastic Gradient Descent (SGD) 𝝸 : weights/parameters ○ ○ 𝛃 : learning rate J : loss function ○ ● SGD update happens after every training example. ● Minibatch SGD (sometimes also abbreviated as SGD) considers a small batch of training examples at once, averages their loss and updates 𝝸 .

Regularization: Dropout ● Randomly drop neurons at forward pass during training. ● At test time, turn dropout off. Prevents overfitting by forcing ● network to learn redundancies. Think about dropout as training an ensemble of networks.

Training Tips and Tricks ● Learning rate: loss very high learning rate If loss curve seems to be unstable ○ (jagged line), decrease learning rate. low learning rate If loss curve appears to be “linear”, ○ high learning rate increase learning rate. good learning rate

Training Tips and Tricks ● Regularization (Dropout, L2 Norm, … ): If the gap between train and dev ○ accuracies is large (overfitting), increase the regularization constant. DO NOT test your model on the test set until overfitting is no longer an issue.

Backpropagation and Gradients Slides courtesy of Barak Oshri

Problem Statement Given a function f with respect to inputs x , labels y , and parameters 𝜄 compute the gradient of Loss with respect to 𝜄

Backpropagation An algorithm for computing the gradient of a compound function as a series of local, intermediate gradients

Backpropagation 1. Identify intermediate functions (forward prop) 2. Compute local gradients 3. Combine with upstream error signal to get full gradient

Modularity - Simple Example Compound function Intermediate Variables (forward propagation)

Modularity - Neural Network Example Compound function Intermediate Variables (forward propagation)

Intermediate Variables Intermediate Gradients (forward propagation) (backward propagation)

Chain Rule Behavior Key chain rule intuition: Slopes multiply

Circuit Intuition

Backprop Menu for Success 1. Write down variable graph 2. Compute derivative of cost function 3. Keep track of error signals 4. Enforce shape rule on error signals 5. Use matrix balancing when deriving over a linear transformation

Convolutional Neural Networks Slides courtesy of Justin Johnson, Serena Yeung, and Fei-Fei Li Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 22

Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 1 10 x 3072 3072 10 weights Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 23

Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 1 10 x 3072 3072 10 weights 1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product) Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 24

Convolution Layer 32x32x3 image -> preserve spatial structure height 32 width 32 depth 3 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 25

Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” 32 3 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 26

Convolution Layer Filters always extend the full depth of the input volume 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” 32 3 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 27

Convolution Layer 32x32x3 image 5x5x3 filter 32 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image 32 (i.e. 5*5*3 = 75-dimensional dot product + bias) 3 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 28

Convolution Layer activation map 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 32 3 1 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 29

consider a second, green filter Convolution Layer activation maps 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 32 3 1 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 30

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 28 32 3 6 We stack these up to get a “new image” of size 28x28x6! Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 31

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 28 CONV, ReLU e.g. 6 5x5x3 32 28 filters 3 6 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 32

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 28 24 …. CONV, CONV, CONV, ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x3 5x5x 6 32 28 24 filters filters 3 6 10 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 17, 2018 April 17, 2018 33

RNNs, Language Models, LSTMs, GRUs Slides courtesy of Lisa Wang and Juhi Naik

RNNs Review of RNNs ● ● RNN Language Models ● Vanishing Gradient Problem GRUs ● LSTMs ●

RNN Review Key points: Weights are shared (tied) across ● timesteps ( W xh , W hh , W hy ) ● Hidden state at time t depends on previous hidden state and new input ● Backpropagation across timesteps (use unrolled network)

RNN Review RNNs are good for: Learning representations for ● sequential data with temporal relationships Predictions can be made at every ● timestep, or at the end of a sequence

RNN Language Model Language Modeling (LM): task of computing probability distributions over ● sequence of words ● Important role in speech recognition, text summarization, etc. ● RNN Language Model:

RNN Language Model for Machine Translation Encoder for source language ● Decoder for target language ● ● Different weights in encoder and decoder sections of the RNN (Could see them as two chained RNNs)

Vanishing Gradient Problem ● Backprop in RNNs: recursive gradient call for hidden layer ● Magnitude of gradients of typical activation functions between 0 and 1. ● When terms less than 1, product can get small very quickly Vanishing gradients → RNNs fail to learn, since parameters barely update. ● GRUs and LSTMs to the rescue! ●

Gated Recurrent Units (GRUs) Gated Recurrent Units ● Reset gate, r t ● Update gate, z t r t and z t control long-term and ● short-term dependencies (mitigates vanishing gradients problem)

Gated Recurrent Units (GRUs) Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTMs ● i t : Input gate - How much does current input matter f t : Input gate - How much does past ● matter ● o t : Output gate - How much should current cell be exposed c t : New memory - Memory from ● current cell

Deep Learning Primer Nishith Khandwala Neural Networks Overview - PowerPoint PPT Presentation

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics Activation Functions Stochastic Gradient Descent (SGD) Regularization (Dropout) Training Tips and Tricks Neural Network (NN)

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Table S2. Gene-specific PCR primer pairs for all validated SBSs. Forward primer Reverse primer

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Tariff Primer: A Graphic Presentation of the Fordney- Tariff Primer: A Graphic Presentation of the

Linac Simulation Linac Simulation Primer Primer J.-F. Ostiguy APC ostiguy@fnal.gov September

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Mathematical Programming: Modelling and Applications September 2009 Sonia Cafieri LIX, cole

Enabling Silicon-as-a-Service ENABLING SILICON-AS-A-SERVICE Algodone at a Glance When/Where

gNMI gRPC Network Management Interface Samuel Ribeiro Fall 2017 - Faucet Conference Why gNMI?

Highly Scalable Parallel Sorting Edgar Solomonik University of Illinois at Urbana-Champaign

AIRS Science Processing Software Version 4.0 Status Version 5.0 Strategy and Goals Steven

Deep Neural Networks II Sen Wang UDRC Co-I WP3.1 and WP3.2 Assistant Professor in Robotics

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai

YHDP Round 3 New Project Application June 3, 2020 Lena McGinn, ICF Jen Best, ICF In

Sambuz

Useful Links

Newsletter

Mail Us

Deep Learning Primer Nishith Khandwala Neural Networks Overview - PowerPoint PPT Presentation

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics Activation Functions Stochastic Gradient Descent (SGD) Regularization (Dropout) Training Tips and Tricks Neural Network (NN)

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Table S2. Gene-specific PCR primer pairs for all validated SBSs. Forward primer Reverse primer

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Tariff Primer: A Graphic Presentation of the Fordney- Tariff Primer: A Graphic Presentation of the

Linac Simulation Linac Simulation Primer Primer J.-F. Ostiguy APC ostiguy@fnal.gov September

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Mathematical Programming: Modelling and Applications September 2009 Sonia Cafieri LIX, cole

Enabling Silicon-as-a-Service ENABLING SILICON-AS-A-SERVICE Algodone at a Glance When/Where

gNMI gRPC Network Management Interface Samuel Ribeiro Fall 2017 - Faucet Conference Why gNMI?

Highly Scalable Parallel Sorting Edgar Solomonik University of Illinois at Urbana-Champaign

AIRS Science Processing Software Version 4.0 Status Version 5.0 Strategy and Goals Steven

Deep Neural Networks II Sen Wang UDRC Co-I WP3.1 and WP3.2 Assistant Professor in Robotics

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai

YHDP Round 3 New Project Application June 3, 2020 Lena McGinn, ICF Jen Best, ICF In

Sambuz

Useful Links

Newsletter

Mail Us

Deep learning for natural language processing A short primer on deep learning Benoit Favre <