ECE 6504: Deep Learning for Perception Topics: (Finish) Backprop - - PowerPoint PPT Presentation

ece 6504 deep learning for perception
SMART_READER_LITE
LIVE PREVIEW

ECE 6504: Deep Learning for Perception Topics: (Finish) Backprop - - PowerPoint PPT Presentation

ECE 6504: Deep Learning for Perception Topics: (Finish) Backprop Convolutional Neural Nets Dhruv Batra Virginia Tech Administrativia Presentation Assignments https://docs.google.com/spreadsheets/d/


slide-1
SLIDE 1

ECE 6504: Deep Learning for Perception

Dhruv Batra Virginia Tech

Topics:

– (Finish) Backprop – Convolutional Neural Nets

slide-2
SLIDE 2

Administrativia

  • Presentation Assignments

– https://docs.google.com/spreadsheets/d/ 1m76E4mC0wfRjc4HRBWFdAlXKPIzlEwfw1-u7rBw9TJ8/ edit#gid=2045905312

(C) Dhruv Batra 2

slide-3
SLIDE 3

Recap of last time

(C) Dhruv Batra 3

slide-4
SLIDE 4

Last Time

  • Notation + Setup
  • Neural Networks
  • Chain Rule + Backprop

(C) Dhruv Batra 4

slide-5
SLIDE 5

Recall: The Neuron Metaphor

  • Neurons
  • accept information from multiple inputs,
  • transmit information to other neurons.
  • Artificial neuron
  • Multiply inputs by weights along edges
  • Apply some function to the set of inputs at each node

5

Image Credit: Andrej Karpathy, CS231n

slide-6
SLIDE 6

Activation Functions

  • sigmoid vs tanh

(C) Dhruv Batra 6

slide-7
SLIDE 7

A quick note

(C) Dhruv Batra 7 Image Credit: LeCun et al. ‘98

slide-8
SLIDE 8

Rectified Linear Units (ReLU)

(C) Dhruv Batra 8

slide-9
SLIDE 9

(C) Dhruv Batra 9

slide-10
SLIDE 10

(C) Dhruv Batra 10

slide-11
SLIDE 11

Visualizing Loss Functions

  • Sum of individual losses

(C) Dhruv Batra 11

Image Credit: Andrej Karpathy, CS231n

slide-12
SLIDE 12

Detour

(C) Dhruv Batra 12

slide-13
SLIDE 13

Logistic Regression as a Cascade

(C) Dhruv Batra 13

w

|x |x |x |x

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-14
SLIDE 14

Key Computation: Forward-Prop

(C) Dhruv Batra 14

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-15
SLIDE 15

Key Computation: Back-Prop

(C) Dhruv Batra 15

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-16
SLIDE 16

Plan for Today

  • MLPs

– Notation – Backprop

  • CNNs

– Notation – Convolutions – Forward pass – Backward pass

(C) Dhruv Batra 16

slide-17
SLIDE 17

Multilayer Networks

  • Cascade Neurons together
  • The output from one layer is the input to the next
  • Each Layer has its own sets of weights

(C) Dhruv Batra 17

Image Credit: Andrej Karpathy, CS231n

slide-18
SLIDE 18

Equivalent Representations

(C) Dhruv Batra 18

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-19
SLIDE 19

19

Question: Does BPROP work with ReLU layers only? Answer: Nope, any a.e. differentiable transformation works. Question: What's the computational cost of BPROP? Answer: About twice FPROP (need to compute gradients w.r.t. input and parameters at every layer). Note: FPROP and BPROP are dual of each other. E.g.,: + + FPROP BPROP

SUM COPY Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

(C) Dhruv Batra

Backward Propagation

slide-20
SLIDE 20

20

Example: 200x200 image 40K hidden units ~2B parameters!!!

  • Spatial correlation is local
  • Waste of resources + we have not enough

training samples anyway..

Fully Connected Layer

Slide Credit: Marc'Aurelio Ranzato

slide-21
SLIDE 21

21

Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition).

Locally Connected Layer

Slide Credit: Marc'Aurelio Ranzato

slide-22
SLIDE 22

22

STATIONARITY? Statistics is similar at different locations Note: This parameterization is good when input image is registered (e.g., face recognition). Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters

Locally Connected Layer

Slide Credit: Marc'Aurelio Ranzato

slide-23
SLIDE 23

23

Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

slide-24
SLIDE 24

(C) Dhruv Batra 24

"Convolution of box signal with itself2" by Convolution_of_box_signal_with_itself.gif: Brian Ambergderivative work: Tinos (talk)

  • Convolution_of_box_signal_with_itself.gif. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/

wiki/File:Convolution_of_box_signal_with_itself2.gif#/media/File:Convolution_of_box_signal_with_itself2.gif

slide-25
SLIDE 25

Convolution Explained

  • http://setosa.io/ev/image-kernels/
  • https://github.com/bruckner/deepViz

(C) Dhruv Batra 25

slide-26
SLIDE 26

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 26

slide-27
SLIDE 27

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 27

slide-28
SLIDE 28

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 28

slide-29
SLIDE 29

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 29

slide-30
SLIDE 30

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 30

slide-31
SLIDE 31

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 31

slide-32
SLIDE 32

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 32

slide-33
SLIDE 33

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 33

slide-34
SLIDE 34

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 34

slide-35
SLIDE 35

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 35

slide-36
SLIDE 36

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 36

slide-37
SLIDE 37

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 37

slide-38
SLIDE 38

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 38

slide-39
SLIDE 39

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 39

slide-40
SLIDE 40

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 40

slide-41
SLIDE 41

Mathieu et al. “Fast training of CNNs through FFTs” ICLR 2014

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 41

slide-42
SLIDE 42

*

  • 1 0 1
  • 1 0 1
  • 1 0 1

=

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 42

slide-43
SLIDE 43

Learn multiple filters.

E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 43

slide-44
SLIDE 44

Convolutional Nets

a

(C) Dhruv Batra 44

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

Image Credit: Yann LeCun, Kevin Murphy

slide-45
SLIDE 45

Conv. layer

h1

n− 1

h2

n− 1

h3

n− 1

h1

n

h2

n

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 45

  • utput

feature map input feature map kernel

hn

i = max

8 < :0,

#input channels

X

j=1

hn−1

j

∗ wn

ij

9 = ;

slide-46
SLIDE 46

h1

n− 1

h2

n− 1

h3

n− 1

h1

n

h2

n

  • utput

feature map input feature map kernel

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 46

hn

i = max

8 < :0,

#input channels

X

j=1

hn−1

j

∗ wn

ij

9 = ;

slide-47
SLIDE 47

h1

n− 1

h2

n− 1

h3

n− 1

h1

n

h2

n

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 47

  • utput

feature map input feature map kernel

hn

i = max

8 < :0,

#input channels

X

j=1

hn−1

j

∗ wn

ij

9 = ;

slide-48
SLIDE 48

Question: What is the size of the output? What's the computational cost? Answer: It is proportional to the number of filters and depends on the stride. If kernels have size KxK, input has size DxD, stride is 1, and there are M input feature maps and N output feature maps then:

  • the input has size M@DxD
  • the output has size N@(D-K+1)x(D-K+1)
  • the kernels have MxNxKxK coefficients (which have to be learned)
  • cost: M*K*K*N*(D-K+1)*(D-K+1)

Question: How many feature maps? What's the size of the filters? Answer: Usually, there are more output feature maps than input feature maps. Convolutional layers can increase the number of hidden units by big factors (and are expensive to compute). The size of the filters has to match the size/scale of the patterns we want to detect (task dependent).

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 48

slide-49
SLIDE 49

A standard neural net applied to images:

  • scales quadratically with the size of the input
  • does not leverage stationarity

Solution:

  • connect each hidden unit to a small patch of the input
  • share the weight across space

This is called: convolutional layer. A network with convolutional layers is called convolutional network.

LeCun et al. “Gradient-based learning applied to document recognition” IEEE 1998

Key Ideas

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 49

slide-50
SLIDE 50

Let us assume filter is an “eye” detector. Q.: how can we make the detection robust to the exact location of the eye?

Pooling Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 50

slide-51
SLIDE 51

By “pooling” (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features.

Pooling Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 51

slide-52
SLIDE 52

Max-pooling: Average-pooling: L2-pooling: L2-pooling over features:

Pooling Layer: Examples

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 52

hn

i (r, c) =

max

¯ r∈N(r), ¯ c∈N(c) hn−1 i

(¯ r, ¯ c) hn

i (r, c) =

mean

¯ r∈N(r), ¯ c∈N(c) hn−1 i

(¯ r, ¯ c) hn

i (r, c) =

s X

¯ r∈N(r), ¯ c∈N(c)

hn−1

i

(¯ r, ¯ c)2 hn

i (r, c) =

s X

j∈N(i)

hn−1

i

(r, c)2

slide-53
SLIDE 53

Question: What is the size of the output? What's the computational cost? Answer: The size of the output depends on the stride between the

  • pools. For instance, if pools do not overlap and have size KxK, and

the input has size DxD with M input feature maps, then:

  • output is M@(D/K)x(D/K)
  • the computational cost is proportional to the size of the input

(negligible compared to a convolutional layer) Question: How should I set the size of the pools? Answer: It depends on how much “invariant” or robust to distortions we want the representation to be. It is best to pool slowly (via a few stacks of conv-pooling layers).

Pooling Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 53

slide-54
SLIDE 54

Task: detect orientation L/R Conv layer: linearizes manifold

Pooling Layer: Interpretation

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 54

slide-55
SLIDE 55

Conv layer: linearizes manifold Pooling layer: collapses manifold Task: detect orientation L/R

Pooling Layer: Interpretation

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 55

slide-56
SLIDE 56

Conv. layer

h

n− 1

hn

Pool. layer

h

n 1

If convolutional filters have size KxK and stride 1, and pooling layer has pools of size PxP, then each unit in the pooling layer depends upon a patch (at the input of the preceding conv. layer) of size: (P+K-1)x(P+K-1)

Pooling Layer: Receptive Field Size

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 56

slide-57
SLIDE 57

Conv. layer

h

n− 1

hn

Pool. layer

h

n 1

If convolutional filters have size KxK and stride 1, and pooling layer has pools of size PxP, then each unit in the pooling layer depends upon a patch (at the input of the preceding conv. layer) of size: (P+K-1)x(P+K-1)

Pooling Layer: Receptive Field Size

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 57

slide-58
SLIDE 58

Convol. Pooling One stage (zoom)

ConvNets: Typical Stage

courtesy of

  • K. Kavukcuoglu

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 58

slide-59
SLIDE 59

Convol. Pooling One stage (zoom)

ConvNets: Typical Stage

Conceptually similar to: SIFT, HoG, etc.

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 59

slide-60
SLIDE 60

courtesy of

  • K. Kavukcuoglu

Note: after one stage the number of feature maps is usually increased (conv. layer) and the spatial resolution is usually decreased (stride in

  • conv. and pooling layers). Receptive field gets bigger.

Reasons:

  • gain invariance to spatial translation (pooling layer)
  • increase specificity of features (approaching object specific units)

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 60

slide-61
SLIDE 61

One stage (zoom) Fully Conn. Layers Whole system

1st stage 2nd stage 3rd stage Input Image Class Labels

Convol. Pooling

ConvNets: Typical Architecture

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 61

slide-62
SLIDE 62

Visualizing Learned Filters

(C) Dhruv Batra 62 Figure Credit: [Zeiler & Fergus ECCV14]

slide-63
SLIDE 63

Visualizing Learned Filters

(C) Dhruv Batra 63 Figure Credit: [Zeiler & Fergus ECCV14]

slide-64
SLIDE 64

Visualizing Learned Filters

(C) Dhruv Batra 64 Figure Credit: [Zeiler & Fergus ECCV14]

slide-65
SLIDE 65

Frome et al. “Devise: a deep visual semantic embedding model” NIPS 2013

CNN Text Embedding tiger Matching

shared representation

Fancier Architectures: Multi-Modal

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 65

slide-66
SLIDE 66

Zhang et al. “PANDA..” CVPR 2014

Conv Norm Pool Conv Norm Pool Conv Norm Pool Conv Norm Pool Fully Conn. Fully Conn. Fully Conn. Fully Conn.

...

  • Attr. 1
  • Attr. 2
  • Attr. N

image

Fancier Architectures: Multi-Task

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 66

slide-67
SLIDE 67

Any DAG of differentialble modules is allowed!

Fancier Architectures: Generic DAG

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 67