Deep Learning Primer Nishith Khandwala Neural Networks Overview - - PowerPoint PPT Presentation

deep learning primer
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Primer Nishith Khandwala Neural Networks Overview - - PowerPoint PPT Presentation

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics Activation Functions Stochastic Gradient Descent (SGD) Regularization (Dropout) Training Tips and Tricks Neural Network (NN)


slide-1
SLIDE 1

Deep Learning Primer

Nishith Khandwala

slide-2
SLIDE 2

Neural Networks

slide-3
SLIDE 3

Overview

  • Neural Network Basics
  • Activation Functions
  • Stochastic Gradient Descent (SGD)
  • Regularization (Dropout)
  • Training Tips and Tricks
slide-4
SLIDE 4

Neural Network (NN) Basics

Dataset: (x, y) where x: inputs, y: labels Steps to train a 1-hidden layer NN:

  • Do a forward pass: ŷ = f(xW + b)
  • Compute loss: loss(y, ŷ)
  • Compute gradients using backprop
  • Update weights using an optimization

algorithm, like SGD

  • Do hyperparameter tuning on Dev set
  • Evaluate NN on Test set
slide-5
SLIDE 5

Activation Functions: Sigmoid

Properties:

  • Squashes input between 0 and 1.

Problems:

  • Saturation of neurons kills gradients.
  • Output is not centered at 0.
slide-6
SLIDE 6

Activation Functions: Tanh

Properties:

  • Squashes input between -1 and 1.
  • Output centered at 0.

Problems:

  • Saturation of neurons kills gradients.
slide-7
SLIDE 7

Activation Functions: ReLU

Properties:

  • No saturation
  • Computationally cheap
  • Empirically known to converge faster

Problems:

  • Output not centered at 0
  • When input < 0, ReLU gradient is 0. Never

changes.

slide-8
SLIDE 8

Stochastic Gradient Descent (SGD)

  • Stochastic Gradient Descent (SGD)

○ 𝝸 : weights/parameters ○ 𝛃 : learning rate ○ J : loss function

  • SGD update happens after every training

example.

  • Minibatch SGD (sometimes also

abbreviated as SGD) considers a small batch of training examples at once, averages their loss and updates 𝝸.

slide-9
SLIDE 9

Regularization: Dropout

  • Randomly drop neurons at forward

pass during training.

  • At test time, turn dropout off.
  • Prevents overfitting by forcing

network to learn redundancies. Think about dropout as training an ensemble of networks.

slide-10
SLIDE 10

Training Tips and Tricks

  • Learning rate:

○ If loss curve seems to be unstable (jagged line), decrease learning rate. ○ If loss curve appears to be “linear”, increase learning rate.

very high learning rate high learning rate good learning rate low learning rate loss

slide-11
SLIDE 11
  • Regularization (Dropout, L2 Norm, … ):

○ If the gap between train and dev accuracies is large (overfitting), increase the regularization constant. DO NOT test your model on the test set until

  • verfitting is no longer an issue.

Training Tips and Tricks

slide-12
SLIDE 12

Backpropagation and Gradients

Slides courtesy of Barak Oshri

slide-13
SLIDE 13

Problem Statement

Given a function f with respect to inputs x, labels y, and parameters 𝜄 compute the gradient of Loss with respect to 𝜄

slide-14
SLIDE 14

Backpropagation

An algorithm for computing the gradient of a compound function as a series of local, intermediate gradients

slide-15
SLIDE 15

Backpropagation

1. Identify intermediate functions (forward prop) 2. Compute local gradients 3. Combine with upstream error signal to get full gradient

slide-16
SLIDE 16

Modularity - Simple Example

Compound function Intermediate Variables

(forward propagation)

slide-17
SLIDE 17

Modularity - Neural Network Example

Compound function Intermediate Variables

(forward propagation)

slide-18
SLIDE 18

Intermediate Variables

(forward propagation)

Intermediate Gradients

(backward propagation)

slide-19
SLIDE 19

Chain Rule Behavior

Key chain rule intuition: Slopes multiply

slide-20
SLIDE 20

Circuit Intuition

slide-21
SLIDE 21

Backprop Menu for Success

1. Write down variable graph 2. Compute derivative of cost function 3. Keep track of error signals 4. Enforce shape rule on error signals 5. Use matrix balancing when deriving over a linear transformation

slide-22
SLIDE 22

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 22

Convolutional Neural Networks

Slides courtesy of Justin Johnson, Serena Yeung, and Fei-Fei Li

slide-23
SLIDE 23

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 23

3072 1

Fully Connected Layer

32x32x3 image -> stretch to 3072 x 1

10 x 3072 weights activation input 1 10

slide-24
SLIDE 24

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 24

3072 1

Fully Connected Layer

32x32x3 image -> stretch to 3072 x 1

10 x 3072 weights activation input

1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product)

1 10

slide-25
SLIDE 25

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 25

32 32 3

Convolution Layer

32x32x3 image -> preserve spatial structure

width height depth

slide-26
SLIDE 26

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 26

32 32 3

Convolution Layer

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

slide-27
SLIDE 27

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 27

32 32 3

Convolution Layer

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Filters always extend the full depth of the input volume

slide-28
SLIDE 28

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 28

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)

slide-29
SLIDE 29

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 29

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation map 1 28 28

slide-30
SLIDE 30

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 30

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation maps 1 28 28

consider a second, green filter

slide-31
SLIDE 31

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 31

32 32 3 Convolution Layer activation maps 6 28 28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6!

slide-32
SLIDE 32

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 32

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 32 3 28 28 6 CONV, ReLU e.g. 6 5x5x3 filters

slide-33
SLIDE 33

Fei-Fei Li & Justin Johnson & Serena Yeung

April 17, 2018 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 17, 2018 33

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU

….

10 24 24

slide-34
SLIDE 34

RNNs, Language Models, LSTMs, GRUs

Slides courtesy of Lisa Wang and Juhi Naik

slide-35
SLIDE 35

RNNs

  • Review of RNNs
  • RNN Language Models
  • Vanishing Gradient Problem
  • GRUs
  • LSTMs
slide-36
SLIDE 36

RNN Review

Key points:

  • Weights are shared (tied) across

timesteps (Wxh , Whh , Why )

  • Hidden state at time t depends on

previous hidden state and new input

  • Backpropagation across

timesteps (use unrolled network)

slide-37
SLIDE 37

RNN Review

RNNs are good for:

  • Learning representations for

sequential data with temporal relationships

  • Predictions can be made at every

timestep, or at the end of a sequence

slide-38
SLIDE 38

RNN Language Model

  • Language Modeling (LM): task of computing probability distributions over

sequence of words

  • Important role in speech recognition, text summarization, etc.
  • RNN Language Model:
slide-39
SLIDE 39

RNN Language Model for Machine Translation

  • Encoder for source language
  • Decoder for target language
  • Different weights in encoder

and decoder sections of the RNN (Could see them as two chained RNNs)

slide-40
SLIDE 40

Vanishing Gradient Problem

  • Backprop in RNNs: recursive gradient call for hidden layer
  • Magnitude of gradients of typical activation functions between 0 and 1.
  • When terms less than 1, product can get small very quickly
  • Vanishing gradients → RNNs fail to learn, since parameters barely update.
  • GRUs and LSTMs to the rescue!
slide-41
SLIDE 41

Gated Recurrent Units

  • Reset gate, rt
  • Update gate, zt
  • rt and zt control long-term and

short-term dependencies (mitigates vanishing gradients problem)

Gated Recurrent Units (GRUs)

slide-42
SLIDE 42

Gated Recurrent Units (GRUs)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-43
SLIDE 43

LSTMs

  • it: Input gate - How much does current

input matter

  • ft: Input gate - How much does past

matter

  • t: Output gate - How much should

current cell be exposed

  • ct: New memory - Memory from

current cell

slide-44
SLIDE 44

LSTMs

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-45
SLIDE 45

Acknowledgements

  • Slides adapted from:

○ Barak Oshri, Lisa Wang, and Juhi Naik (CS224N, Winter 2017) ○ Justin Johnson, Serena Yeung, and Fei-Fei Li (CS231N, Spring 2018)

  • Andrej Karpathy, Research Scientist, OpenAI
  • Christopher Olah, Research Scientist, Google Brain