All You Want To Know About CNNs Yukun Zhu Deep Learning Deep - - PowerPoint PPT Presentation

all you want to know about cnns
SMART_READER_LITE
LIVE PREVIEW

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep - - PowerPoint PPT Presentation

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep


slide-1
SLIDE 1

All You Want To Know About CNNs

Yukun Zhu

slide-2
SLIDE 2

Deep Learning

slide-3
SLIDE 3

Deep Learning

Image from http://imgur.com/

slide-4
SLIDE 4

Deep Learning

Image from http://imgur.com/

slide-5
SLIDE 5

Deep Learning

Image from http://imgur.com/

slide-6
SLIDE 6

Deep Learning

Image from http://imgur.com/

slide-7
SLIDE 7

Deep Learning

Image from http://imgur.com/

slide-8
SLIDE 8

Deep Learning

Image from http://imgur.com/

slide-9
SLIDE 9

Deep Learning in Vision

33.4 DPM (2010)

Object detection performance, PASCAL VOC 2010

slide-10
SLIDE 10

Deep Learning in Vision

33.4 DPM (2010) 40.4 segDPM (2014)

Object detection performance, PASCAL VOC 2010

slide-11
SLIDE 11

Deep Learning in Vision

33.4 DPM (2010) 40.4 segDPM (2014) 53.7 RCNN (2014)

Object detection performance, PASCAL VOC 2010

slide-12
SLIDE 12

Deep Learning in Vision

33.4 DPM (2010) 40.4 segDPM (2014) 53.7 RCNN (2014)

Object detection performance, PASCAL VOC 2010

62.9 RCNN*

(Oct 2014)

slide-13
SLIDE 13

Deep Learning in Vision

33.4 DPM (2010) 40.4 segDPM (2014) 53.7 RCNN (2014) 67.2

segRCNN

(Jan 2015)

Object detection performance, PASCAL VOC 2010

62.9 RCNN*

(Oct 2014)

slide-14
SLIDE 14

Deep Learning in Vision

33.4 DPM (2010) 40.4 segDPM (2014) 53.7 RCNN (2014) 67.2

segRCNN

(Jan 2015)

Object detection performance, PASCAL VOC 2010

62.9 RCNN*

(Oct 2014)

70.8

Fast RCNN

(Jun 2015)

slide-15
SLIDE 15

A Neuron

Image from http://cs231n.github.io/neural-networks-1/

slide-16
SLIDE 16

A Neuron in Neural Network

Image from http://cs231n.github.io/neural-networks-1/

slide-17
SLIDE 17

Activation Functions

  • Sigmoid: f(x) = 1 / (1 + e-x)
  • ReLU: f(x) = max(0, x)
  • Leaky ReLU: f(x) = max(ax, x)
  • Maxout: f(x) = max(w0x + b0, w1x + b1)
  • and many others…
slide-18
SLIDE 18

Neural Network (MLP)

Image modified from http://cs231n.github.io/neural-networks-1/

The network simulates a function y = f(x; w)

slide-19
SLIDE 19

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 3.00
  • 2.00
  • 3.00
slide-20
SLIDE 20

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00
slide-21
SLIDE 21

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00

slide-22
SLIDE 22

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00

slide-23
SLIDE 23

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

slide-24
SLIDE 24

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00
slide-25
SLIDE 25

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37

slide-26
SLIDE 26

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37

slide-27
SLIDE 27

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73

slide-28
SLIDE 28

Loss Function

Loss function measures how well prediction matches true value Commonly used loss function:

  • Squared loss: (y - y’)2
  • Cross-entropy loss: -sumi(yi’ * log(yi))
  • and many others
slide-29
SLIDE 29

Loss Function

During training, we would like to minimize the total loss on a set

  • f training data
  • We want to find w* = argmin{sumi[loss(f(xi; w), yi)]}
slide-30
SLIDE 30

Loss Function

During training, we would like to minimize the total loss on a set

  • f training data
  • We want to find w* = argmin{sumi[loss(f(xi; w), yi)]}
  • Usually we use gradient based approach

○ wt+1 = wt - a∇w

slide-31
SLIDE 31

Backward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73 1.00

slide-32
SLIDE 32

Backward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73 1.00

  • 0.53

f = 1/x df/dx = -1/x2

slide-33
SLIDE 33

Backward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73 1.00

  • 0.53
  • 0.53

f = x + 1 df/dx = 1

slide-34
SLIDE 34

Backward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73 1.00

  • 0.53
  • 0.53
  • 0.20

f = ex df/dx = ex

slide-35
SLIDE 35

Backward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73 1.00

  • 0.53
  • 0.53
  • 0.20

0.20

f = -x df/dx = -1

slide-36
SLIDE 36

Backward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73 1.00

  • 0.53
  • 0.53
  • 0.20

0.20 0.20 0.20

f = x + a df/dx = 1

slide-37
SLIDE 37

Backward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73 1.00

  • 0.53
  • 0.53
  • 0.20

0.20 0.20 0.20 0.20 0.20

slide-38
SLIDE 38

Backward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

f(x0, x1) = 1 / (1 + exp(-(w0x0 + w1x1 + w2))) x0 x1 1

sigmoid 2.00

  • 1.00
  • 2.00
  • 3.00
  • 2.00
  • 3.00

6.00 4.00 1.00

  • 1.00

0.37 1.37 0.73 1.00

  • 0.53
  • 0.53
  • 0.20

0.20 0.20 0.20 0.20 0.40

  • 0.20
  • 0.40
  • 0.60

0.20

f = ax df/dx = a

slide-39
SLIDE 39

Why NNs?

slide-40
SLIDE 40

Universal Approximation Theorem

A feed-forward network with a single hidden layer containing a finite number of neurons, can approximate continuous functions

  • n compact subsets of Rn, under mild assumptions on the

activation function.

https://en.wikipedia.org/wiki/Universal_approximation_theorem

slide-41
SLIDE 41

Stone’s Theorem

  • Suppose X is a compact Hausdorff space and B is a

subalgebra in C(X, R) such that:

○ B separates points. ○ B contains the constant function 1. ○ If f ∈ B then af ∈ B for all constants a ∈ R. ○ If f, g ∈ B, then f + g, max{f, g} ∈ B.

  • Then every continuous function defined on C(X, R) can be

approximated as closely as desired by functions in B

slide-42
SLIDE 42

Why CNNs?

slide-43
SLIDE 43

Problems of MLP in Vision

For input as a 10 * 10 image:

  • A 3 layer MLP with 200 hidden units contains ~100k

parameters For input as a 100 * 100 image:

  • A 1 layer MLP with 20k hidden units contains ~200m

parameters

slide-44
SLIDE 44

Can We Do Better?

slide-45
SLIDE 45

Can We Do Better?

slide-46
SLIDE 46

Can We Do Better?

slide-47
SLIDE 47

Can We Do Better?

slide-48
SLIDE 48

Can We Do Better?

Based on such observation, MLP can be improved in two ways:

  • Locally connected instead of fully connected
  • Sharing weights between neurons

We achieve those by using convolution neurons

slide-49
SLIDE 49

Convolutional Layers

Image from http://cs231n.github.io/convolutional-networks/

slide-50
SLIDE 50

Convolutional Layers

Image from http://cs231n.github.io/convolutional-networks/. See this page for an excellent example of convolution.

depth height width

slide-51
SLIDE 51

Pooling Layers

Image from http://cs231n.github.io/convolutional-networks/

slide-52
SLIDE 52

Pooling Layers Example: Max Pooling

Image from http://cs231n.github.io/convolutional-networks/

slide-53
SLIDE 53

Pooling Layers

Commonly used pooling layers:

  • Max pooling
  • Average pooling

Why pooling layers?

  • Reduce activation dimensionality
  • Robust against tiny shifts
slide-54
SLIDE 54

CNN Architecture: An Example

Image from http://cs231n.github.io/convolutional-networks/

slide-55
SLIDE 55

Layer Activations for CNNs

Image modified from http://cs231n.github.io/convolutional-networks/

Conv:1 ReLU:1 Conv:2 ReLU:2 MaxPool:1 Conv:3

slide-56
SLIDE 56

Layer Activations for CNNs

Image modified from http://cs231n.github.io/convolutional-networks/

MaxPool:2 Conv:5 ReLU:5 Conv:6 ReLU:6 MaxPool:3

slide-57
SLIDE 57

Learnt Weights for CNNs: First Conv Layer of AlexNet

Image from http://cs231n.github.io/convolutional-networks/

slide-58
SLIDE 58

Why CNNs Work Now?

slide-59
SLIDE 59

Convolutional Neural Networks

  • Faster heterogeneous parallel computing

○ CPU clusters, GPUs, etc.

  • Large dataset

○ ImageNet: 1.2m images of 1,000 object classes ○ CoCo: 300k images of 2m object instances

  • Improvements in model architecture

○ ReLU, dropout, inception, etc.

slide-60
SLIDE 60

AlexNet

Krizhevsky, Alex, et al. "Imagenet classification with deep convolutional neural networks." NIPS 2012

slide-61
SLIDE 61

GoogLeNet

Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).

slide-62
SLIDE 62

Quiz

# of parameters for the first conv layer of AlexNet?

slide-63
SLIDE 63

Quiz

# of parameters if the first layer is fully-connected?

slide-64
SLIDE 64

Quiz

Given a convolution operation written as f(x3x3; w3x3, b) = sumi,j(xi,jwi,j) + b Can you derive its gradients (df/dx, df/dw, df/db)?

slide-65
SLIDE 65

Ready to Build Your Own Networks?

slide-66
SLIDE 66

Tips and Tricks for CNNs

  • Know your data, clean your data, and normalize your data

○ A common trick: subtract the mean and divide by its std.

Image from http://cs231n.github.io/neural-networks-2/

slide-67
SLIDE 67

Tips and Tricks for CNNs

  • Augment your data
slide-68
SLIDE 68

Tips and Tricks for CNNs

  • Organize your data:

○ Keep training data balanced ○ Shuffle data before batching

  • Feed your data in the correct way

○ Image channel order ○ Tensor storage order

slide-69
SLIDE 69

Tips and Tricks for CNNs

First Order, in order. First Order, out of order.

slide-70
SLIDE 70

Tips and Tricks for CNNs

Common tensor storage order:

  • BDRC

○ Used in Caffe, Torch, Theano, supported by CuDNN ○ Pros: faster for convolution (FFT, memory access)

  • BRCD

○ Used in TensorFlow, limited support by CuDNN ○ Pros: Fast batch normalization, easier batching

slide-71
SLIDE 71

Tips and Tricks for CNNs

Designing model architecture

  • Convolution, max pooling, then fully connected layers
  • Nonlinearity

○ Stay away from sigmoid (except for output) ○ ReLU preferred ○ Leaky ReLU after ○ Use Maxout if most ReLU units die (have zero activation)

slide-72
SLIDE 72

Tips and Tricks for CNNs

Setting parameters

  • Weights

○ Random initialization with proper variance

  • Biases

○ For ReLU we prefer a small positive bias to activate ReLU

slide-73
SLIDE 73

Tips and Tricks for CNNs

Setting hyperparameters

  • Learning Rate / Momentum (Δwt* = Δwt + mΔwt-1)

○ Decrease learning rate while training ○ Setting momentum to 0.8 - 0.9

  • Batch Size

○ For large dataset: set to whatever fits your memory ○ For smaller dataset: find a tradeoff between instance randomness and gradient smoothness

slide-74
SLIDE 74

Tips and Tricks for CNNs

Monitoring your training:

  • Split your dataset to training, validation and test

○ Optimize your hyperparameter in val and evaluate on test ○ Keep track of training and validation loss during training ○ Do early stopping if training and validation loss diverge ○ Loss doesn’t tell you all. Try precision, class-wise precision, and more

slide-75
SLIDE 75

Tips and Tricks for CNNs

Borrow knowledge from another dataset

  • Pre-train your CNN on a large dataset (e.g. ImageNet)
  • Remove / reshape the last a few layers
  • Fix the parameters of first a few layers, or make the learning

rate small for them

  • Fine-tune the parameters on your own dataset
slide-76
SLIDE 76

Tips and Tricks for CNNs

Debugging

  • import unittest, not import pdb
  • Check your gradient [**deprecated**]
  • Make your model large enough, and try overfitting training
  • Check gradient norms, weight norms, and activation norms
slide-77
SLIDE 77

Talk is Cheap, Show Me Some Code

slide-78
SLIDE 78

Image from http://www.linkresearchtools.com/

slide-79
SLIDE 79

Fully Convolutional Networks

Long, Jonathan, et al. "Fully convolutional networks for semantic segmentation." arXiv preprint arXiv:1411.4038 (2014).