Deep Learning Primer
Nishith Khandwala
Deep Learning Primer Nishith Khandwala Neural Networks Overview - - PowerPoint PPT Presentation
Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics Activation Functions Stochastic Gradient Descent (SGD) Regularization (Dropout) Training Tips and Tricks Neural Network (NN)
Nishith Khandwala
Dataset: (x, y) where x: inputs, y: labels Steps to train a 1-hidden layer NN:
algorithm, like SGD
Properties:
Problems:
Properties:
Problems:
Properties:
Problems:
changes.
○ 𝝸 : weights/parameters ○ 𝛃 : learning rate ○ J : loss function
example.
abbreviated as SGD) considers a small batch of training examples at once, averages their loss and updates 𝝸.
pass during training.
network to learn redundancies. Think about dropout as training an ensemble of networks.
○ If loss curve seems to be unstable (jagged line), decrease learning rate. ○ If loss curve appears to be “linear”, increase learning rate.
very high learning rate high learning rate good learning rate low learning rate loss
○ If the gap between train and dev accuracies is large (overfitting), increase the regularization constant. DO NOT test your model on the test set until
Slides courtesy of Barak Oshri
Given a function f with respect to inputs x, labels y, and parameters 𝜄 compute the gradient of Loss with respect to 𝜄
An algorithm for computing the gradient of a compound function as a series of local, intermediate gradients
1. Identify intermediate functions (forward prop) 2. Compute local gradients 3. Combine with upstream error signal to get full gradient
Compound function Intermediate Variables
(forward propagation)
Compound function Intermediate Variables
(forward propagation)
Intermediate Variables
(forward propagation)
Intermediate Gradients
(backward propagation)
Key chain rule intuition: Slopes multiply
1. Write down variable graph 2. Compute derivative of cost function 3. Keep track of error signals 4. Enforce shape rule on error signals 5. Use matrix balancing when deriving over a linear transformation
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 22
Slides courtesy of Justin Johnson, Serena Yeung, and Fei-Fei Li
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 23
3072 1
10 x 3072 weights activation input 1 10
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 24
3072 1
10 x 3072 weights activation input
1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product)
1 10
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 25
32 32 3
width height depth
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 26
32 32 3
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 27
32 32 3
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Filters always extend the full depth of the input volume
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 28
32 32 3
1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 29
32 32 3
convolve (slide) over all spatial locations activation map 1 28 28
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 30
32 32 3
convolve (slide) over all spatial locations activation maps 1 28 28
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 31
32 32 3 Convolution Layer activation maps 6 28 28
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6!
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 32
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 32 3 28 28 6 CONV, ReLU e.g. 6 5x5x3 filters
Fei-Fei Li & Justin Johnson & Serena Yeung
April 17, 2018 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 17, 2018 33
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU
10 24 24
Slides courtesy of Lisa Wang and Juhi Naik
Key points:
timesteps (Wxh , Whh , Why )
previous hidden state and new input
timesteps (use unrolled network)
RNNs are good for:
sequential data with temporal relationships
timestep, or at the end of a sequence
sequence of words
and decoder sections of the RNN (Could see them as two chained RNNs)
short-term dependencies (mitigates vanishing gradients problem)
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
input matter
matter
current cell be exposed
current cell
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
○ Barak Oshri, Lisa Wang, and Juhi Naik (CS224N, Winter 2017) ○ Justin Johnson, Serena Yeung, and Fei-Fei Li (CS231N, Spring 2018)