Lecture 13: Introduction to Deep Learning Deep Convolutional Neural - - PowerPoint PPT Presentation

lecture 13
SMART_READER_LITE
LIVE PREVIEW

Lecture 13: Introduction to Deep Learning Deep Convolutional Neural - - PowerPoint PPT Presentation

Lecture 13: Introduction to Deep Learning Deep Convolutional Neural Networks Aykut Erdem November 2016 Hacettepe University Administrative Assignment 3 is out! It is due November 30, 2016 You will implement a 2-layer Neural


slide-1
SLIDE 1

Lecture 13:

−Introduction to Deep Learning −Deep Convolutional Neural Networks

Aykut Erdem

November 2016 Hacettepe University

slide-2
SLIDE 2

Administrative

  • Assignment 3 is out!

− It is due November 30, 2016 − You will implement a 2-layer Neural Network

2

slide-3
SLIDE 3

3

  • From now on, regular (weekly) blog

posts about your progress on the course projects!

  • We will use medium.com

3

An update on course projects

slide-4
SLIDE 4

slide by Fei-Fei Li &

f

activations gradients

“local gradient”

Last time.. Computational Graph

4

x W

*

hinge loss

R

+

L s (scores)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-5
SLIDE 5

Last time… Training Neural Networks

5

Mini-batch SGD Loop: 1.Sample a batch of data 2.Forward prop it through the graph, get loss 3.Backprop to calculate the gradients 4.Update the parameters using the gradient

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-6
SLIDE 6

This week

  • Introduction to Deep Learning
  • Deep Convolutional Neural Networks


6

slide-7
SLIDE 7

“Deep learning allows computational models that are composed

  • f multiple processing layers to learn representations of

data with multiple levels of abstraction.”

− Yann LeCun, Yoshua Bengio and Geoff Hinton

What is deep learning?

  • Y. LeCun, Y. Bengio, G. Hinton, "Deep Learning", Nature, Vol. 521, 28 May 2015

7

slide-8
SLIDE 8

1943 – 2006: 
 A Prehistory of Deep Learning

8

slide-9
SLIDE 9

1943: Warren McCulloch and Walter Pitts

  • First computational model
  • Neurons as logic gates

(AND, OR, NOT)

  • A neuron model that sums

binary inputs and outputs 
 1 if the sum exceeds 
 a certain threshold value, and otherwise outputs 0

9

slide-10
SLIDE 10

1958: Frank Rosenblatt’s Perceptron

  • A computational model of a single

neuron

  • Solves a binary classification

problem

  • Simple training algorithm
  • Built using specialized hardware

10

  • F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain”,

Psychological Review, Vol. 65, 1958

slide-11
SLIDE 11

1969: Marvin Minsky and Seymour Papert

“No machine can learn to recognize 
 X unless it possesses, at least 
 potentially, some scheme for 
 representing X.” (p. xiii)

  • Perceptrons can only represent 


linearly separable functions.

  • such as XOR Problem 

  • Wrongly attributed as the reason 


behind the AI winter, a period of 
 reduced funding and interest in AI research

11

slide-12
SLIDE 12

1990s

  • Multi-layer perceptrons can 


theoretically learn any function 
 (Cybenko, 1989; Hornik, 1991)

  • Training multi-layer perceptrons
  • Back-propagation 


(Rumelhart, Hinton, Williams, 1986)

  • Back-propagation through time (BPTT) 


(Werbos, 1988)

  • New neural architectures
  • Convolutional neural nets (LeCun et al.,

1989)

  • Long-short term memory networks

(LSTM) (Schmidhuber, 1997)

12

slide-13
SLIDE 13

Why it failed then

  • Too many parameters to learn from few labeled

examples.

  • “I know my features are better for this task”.
  • Non-convex optimization? No, thanks.
  • Black-box model, no interpretability.
  • Very slow and inefficient
  • Overshadowed by the success of SVMs 


(Cortes and Vapnik, 1995)

13

Adapted from Joan Bruna

slide-14
SLIDE 14

A major breakthrough in 2006

14

slide-15
SLIDE 15

2006 Breakthrough: 
 Hinton and Salakhutdinov

  • The first solution to the vanishing gradient problem.
  • Build the model in a layer-by-layer fashion using unsupervised

learning

  • The features in early layers are already initialized or “pretrained” with some

suitable features (weights).

  • Pretrained features in early layers only need to be adjusted slightly during

supervised learning to achieve good results.

15

  • G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks”, 


Science, Vol. 313, 28 July 2006.

slide-16
SLIDE 16

The 2012 revolution

16

slide-17
SLIDE 17

ImageNet Challenge

17

Image classification

Easiest classes Hardest classes

Output Scale T-shirt Steel drum Drumstick Mud turtle Output Scale T-shirt Giant panda Drumstick Mud turtle

  • J. Deng, Wei Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei , “ImageNet: A Large-Scale Hierarchical Image Database”, CVPR 2009.
  • O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge”, Int. J. Comput. Vis.,, Vol. 115, Issue 3, pp 211-252, 2015.
  • Large Scale Visual Recognition

Challenge (ILSVRC)

  • 1.2M training images with 


1K categories

  • Measure top-5 classification 


error

slide-18
SLIDE 18

ILSVRC 2012 Competition

2012 Teams %Error Supervision (Toronto) 15.3 ISI (Tokyo) 26.1 VGG (Oxford) 26.9 XRCE/INRIA 27.0 UvA (Amsterdam) 29.6 INRIA/LEAR 33.4

  • A. Krizhevsky, I. Sutskever, G.E. Hinton “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS 2012
  • The success of AlexNet, a deep

convolutional network

  • 7 hidden layers (not counting some max

pooling layers)

  • 60M parameters
  • Combined several tricks
  • ReLU activation function, data augmentation,

dropout

18

CNN based, 
 non-CNN based

slide-19
SLIDE 19

2012 – now
 A Cambrian explosion in deep learning

19

slide-20
SLIDE 20

Robotics

Amodei et al., "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin", In CoRR 2015 M.-T. Luong et al., "Effective Approaches to Attention-based Neural Machine Translation", EMNLP 2015

  • M. Bojarski et al., “End to End Learning for Self-

Driving Cars”, In CoRR 2016

  • D. Silver et al., "Mastering the game of Go with

deep neural networks and tree search", Nature 529, 2016

  • L. Pinto and A. Gupta, “Supersizing Self-

supervision: Learning to Grasp from 50K Tries and 700 Robot Hours” ICRA 2015

  • H. Y. Xiong et al., "The human splicing code

reveals new insights into the genetic determinants of disease", Science 347, 2015

  • M. Ramona et al., "Capturing a Musician's

Groove: Generation of Realistic Accompaniments from Single Song Recordings", In IJCAI 2015

Speech recognition Self-Driving Cars Game Playing Genomics Machine Translation

am a student _ Je suis étudiant Je suis étudiant _ I

Audio Generation

And many more… 20

slide-21
SLIDE 21

Why now?

21

slide-22
SLIDE 22

22

Slide credit: Neil Lawrence

22

slide-23
SLIDE 23

Datasets vs. Algorithms

23

Year Breakthroughs in AI Datasets (First Available) Algorithms (First Proposed) 1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other texts (1991) Hidden Markov Model (1984) 1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The Extended Book” (1991) Negascout planning algorithm (1983) 2005 Google’s Arabic-and Chinese-to- English translation 1.8 trillion tokens from Google Web and News pages (collected in 2005) Statistical machine translation algorithm (1988) 2011 IBM Watson became the world Jeopardy! champion 8.6 million documents from Wikipedia, Wiktionary, and Project Gutenberg (updated in 2010) Mixture-of-Experts (1991) 2014 Google’s GoogLeNet object classification at near-human performance ImageNet corpus of 1.5 million labeled images and 1,000 object categories (2010) Convolutional Neural Networks (1989) 2015 Google’s DeepMind achieved human parity in playing 29 Atari games by learning general control from video Arcade Learning Environment dataset of over 50 Atari games (2013) Q-learning (1992) Average No. of Years to Breakthrough: 3 years 18 years

Table credit: Quant Quanto

slide-24
SLIDE 24
  • CPU vs. GPU

GPU vs. CPU

Slide credit:

24

Powerful Hardware

slide-25
SLIDE 25

25 Slide credit:

25

slide-26
SLIDE 26
  • Better Learning Regularization (e.g. Dropout)

26

  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural

Networks from Overfitting”, 
 JMLR Vol. 15, No. 1,

Working ideas on how to train deep architectures

slide-27
SLIDE 27

27

  • Better Optimization Conditioning (e.g. Batch Normalization)
  • S. Ioffe, C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”,

In ICML 2015

Working ideas on how to train deep architectures

slide-28
SLIDE 28

28

  • Better neural achitectures (e.g. Residual Nets)
  • K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image Recognition”, In CVPR 2016

Working ideas on how to train deep architectures

slide-29
SLIDE 29

So what is deep learning?

29

slide-30
SLIDE 30

Three key ideas

  • (Hierarchical) Compositionality
  • Cascade of non-linear transformations
  • Multiple layers of representations
  • End-to-End Learning
  • Learning (goal-driven) representations
  • Learning to feature extract
  • Distributed Representations
  • No single neuron “encodes” everything
  • Groups of neurons work together

30

slide by Dhruv Batra

slide-31
SLIDE 31

Three key ideas

  • (Hierarchical) Compositionality
  • Cascade of non-linear transformations
  • Multiple layers of representations
  • End-to-End Learning
  • Learning (goal-driven) representations
  • Learning to feature extract
  • Distributed Representations
  • No single neuron “encodes” everything
  • Groups of neurons work together

31

slide by Dhruv Batra

slide-32
SLIDE 32

Traditional Machine Learning

\ˈd ē p\

fixed learned

your favorite
 classifier hand-crafted
 features SIFT/HOG

“car” “+”

This burrito place is yummy and fun!

VISION SPEECH NLP

fixed learned

your favorite
 classifier hand-crafted
 features MFCC

fixed learned

your favorite
 classifier hand-crafted
 features Bag-of-words

slide by Marc’Aurelio Ranzato, Yann LeCun

32

slide-33
SLIDE 33

It’s an old paradigm

  • The first learning machine: 


the Perceptron

  • Built at Cornell in 1960
  • The Perceptron was a linear classifier on

top of a simple feature extractor

  • The vast majority of practical applications
  • f ML today use glorified linear classifiers
  • r glorified template matching.
  • Designing a feature extractor requires

considerable efforts by experts.

y=sign(

i=1 N

W i F i(X )+b)

A

Feature Extractor

Wi

33

slide by Marc’Aurelio Ranzato, Yann LeCun

slide-34
SLIDE 34

Hierarchical Compositionality

VISION SPEECH NLP pixels edge texton motif part

  • bject

sample spectral band formant motif phone word character NP/VP/.. clause sentence story word

slide by Marc’Aurelio Ranzato, Yann LeCun

34

slide-35
SLIDE 35

Building A Complicated Function

Given a library of simple functions Compose into a complicate function

slide by Marc’Aurelio Ranzato, Yann LeCun

35

slide-36
SLIDE 36

Building A Complicated Function

Given a library of simple functions

Idea 1: Linear Combinations

  • Boosting
  • Kernels

Compose into a complicate function

slide by Marc’Aurelio Ranzato, Yann LeCun

36

slide-37
SLIDE 37

Building A Complicated Function

Given a library of simple functions

Idea 2: Compositions

  • Deep Learning
  • Grammar models
  • Scattering transforms…

Compose into a complicate function

slide by Marc’Aurelio Ranzato, Yann LeCun

37

slide-38
SLIDE 38

Building A Complicated Function

Given a library of simple functions

Idea 2: Compositions

  • Deep Learning
  • Grammar models
  • Scattering transforms…

Compose into a complicate function

slide by Marc’Aurelio Ranzato, Yann LeCun

38

slide-39
SLIDE 39

“car”

slide by Marc’Aurelio Ranzato, Yann LeCun

Deep Learning = Hierarchical Compositionality

39

slide-40
SLIDE 40

Trainable 
 Classifier Low-Level
 Feature Mid-Level
 Feature High-Level
 Feature

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

“car”

Deep Learning = Hierarchical Compositionality

slide by Marc’Aurelio Ranzato, Yann LeCun

40

slide-41
SLIDE 41

41 Sparse DBNs [Lee et al. ICML ‘09] Figure courtesy: Quoc Le

slide by Dhruv Batra

slide-42
SLIDE 42

Three key ideas

  • (Hierarchical) Compositionality
  • Cascade of non-linear transformations
  • Multiple layers of representations
  • End-to-End Learning
  • Learning (goal-driven) representations
  • Learning to feature extract
  • Distributed Representations
  • No single neuron “encodes” everything
  • Groups of neurons work together

42

slide by Dhruv Batra

slide-43
SLIDE 43

Traditional Machine Learning

\ˈd ē p\

fixed learned

your favorite
 classifier hand-crafted
 features SIFT/HOG

“car” “+”

This burrito place is yummy and fun!

VISION SPEECH NLP

fixed learned

your favorite
 classifier hand-crafted
 features MFCC

fixed learned

your favorite
 classifier hand-crafted
 features Bag-of-words

slide by Marc’Aurelio Ranzato, Yann LeCun

43

slide-44
SLIDE 44

fixed unsupervised supervised

classifier Mixture of Gaussians

MFCC

\ˈd ē p\

fixed unsupervised supervised

classifier K-Means/ pooling

SIFT/HOG

“car”

fixed unsupervised supervised

classifier

n-grams

Parse Tree Syntactic

“+”

This burrito place is yummy and fun!

VISION SPEECH NLP

Traditional Machine Learning (more accurately)

“Learned”

slide by Marc’Aurelio Ranzato, Yann LeCun

44

slide-45
SLIDE 45

fixed unsupervised supervised

classifier Mixture of Gaussians

MFCC

\ˈd ē p\

fixed unsupervised supervised

classifier K-Means/ pooling

SIFT/HOG

“car”

fixed unsupervised supervised

classifier

n-grams

Parse Tree Syntactic

“+”

This burrito place is yummy and fun!

VISION SPEECH NLP “Learned”

slide by Marc’Aurelio Ranzato, Yann LeCun

Deep Learning = End-to-End Learning

45

slide-46
SLIDE 46
  • A hierarchy of trainable feature transforms
  • Each module transforms its input representation into a

higher-level one.

  • High-level features are more global and more invariant
  • Low-level features are shared among categories

Trainable
 Feature- Transform / 
 Classifier Trainable
 Feature- Transform / 
 Classifier Trainable
 Feature- Transform / 
 Classifier Learned Internal Representations

Deep Learning = End-to-End Learning

slide by Marc’Aurelio Ranzato, Yann LeCun

46

slide-47
SLIDE 47
  • “Shallow” models
  • Deep models

Trainable
 Feature- Transform / 
 Classifier Trainable
 Feature- Transform / 
 Classifier Trainable
 Feature- Transform / 
 Classifier Learned Internal Representations

“Shallow” vs Deep Learning

“Simple” Trainable Classifier hand-crafted Feature Extractor

fixed learned

slide by Marc’Aurelio Ranzato, Yann LeCun

47

slide-48
SLIDE 48

Three key ideas

  • (Hierarchical) Compositionality
  • Cascade of non-linear transformations
  • Multiple layers of representations
  • End-to-End Learning
  • Learning (goal-driven) representations
  • Learning to feature extract
  • Distributed Representations
  • No single neuron “encodes” everything
  • Groups of neurons work together

48

slide by Dhruv Batra

slide-49
SLIDE 49

Localist representations

  • The simplest way to represent things

with neural networks is to dedicate one neuron to each thing.

  • Easy to understand.
  • Easy to code by hand
  • Often used to represent inputs to a net
  • Easy to learn
  • This is what mixture models do.
  • Each cluster corresponds to one neuron
  • Easy to associate with other representations
  • r responses.
  • But localist models are very inefficient

whenever the data has componential structure.

49 Image credit: Moontae Lee

slide by Geoff Hinton

slide-50
SLIDE 50

Distributed Representations

  • Each neuron must represent

something, so this must be a local representation.

  • Distributed representation means a

many-to-many relationship between two types of representation (such as concepts and neurons).

  • Each concept is represented by many

neurons

  • Each neuron participates in the

representation of many concepts

50

Local Distributed

slide by Geoff Hinton

Image credit: Moontae Lee

slide-51
SLIDE 51

Power of distributed representations!

  • Possible internal representations:
  • Objects
  • Scene attributes
  • Object parts
  • Textures

51

  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba “Object Detectors Emerge in Deep Scene CNNs”, ICLR 2015

bedroom mountain

Scene Classification

slide by Bolei Zhou

slide-52
SLIDE 52

Deep Convolutional 
 Neural Networks

52

slide-53
SLIDE 53

Convolutions

slide by Yisong Yue

53

slide-54
SLIDE 54

Convolution Filters

54

slide by Yisong Yue

slide-55
SLIDE 55

Gabor Filters

55

slide by Yisong Yue

slide-56
SLIDE 56

Gaussian Blur Filters

56

slide by Yisong Yue

slide-57
SLIDE 57

Convolutional Neural Networks

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

57

slide-58
SLIDE 58

32 32 3

Convolution Layer

32x32x3 image

width height depth

58

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-59
SLIDE 59

32 32 3

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

Convolution Layer

59

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-60
SLIDE 60

32 32 3

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

Filters always extend the full depth of the input volume

Convolution Layer

60

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-61
SLIDE 61

32 32 3

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)

Convolution Layer

61

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-62
SLIDE 62

32 32 3

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation map 1 28 28

Convolution Layer

62

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-63
SLIDE 63

32 32 3

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation maps 1 28 28

consider a second, green filter

Convolution Layer

63

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-64
SLIDE 64

32 32 3 Convolution Layer activation maps 6 28 28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!

64

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-65
SLIDE 65

32 32 3 28 28 6 CONV, ReLU e.g. 6 5x5x3 filters

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

65

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-66
SLIDE 66

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU

….

10 24 24

66

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-67
SLIDE 67

Preview

[From recent Yann LeCun slides]

67

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-68
SLIDE 68

[From recent Yann LeCun slides]

Preview

68

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-69
SLIDE 69

example 5x5 filters

(32 total)

We call the layer convolutional because it is related to convolution of two signals:

elementwise multiplication and sum of a filter and the signal (image)

  • ne filter =>
  • ne activation map

69

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-70
SLIDE 70

70

Preview

70

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-71
SLIDE 71

A closer look at spatial dimensions:

32 32 3

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation map 1 28 28

71

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-72
SLIDE 72

7 7 7x7 input (spatially) assume 3x3 filter

72

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-73
SLIDE 73

7 7x7 input (spatially) assume 3x3 filter 7

73

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-74
SLIDE 74

7 7x7 input (spatially) assume 3x3 filter 7

74

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-75
SLIDE 75

7 7x7 input (spatially) assume 3x3 filter 7

75

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-76
SLIDE 76

7x7 input (spatially) assume 3x3 filter => 5x5 output 7 7

76

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-77
SLIDE 77

7x7 input (spatially) assume 3x3 filter applied with stride 2 7 7

77

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-78
SLIDE 78

7x7 input (spatially) assume 3x3 filter applied with stride 2 7 7

78

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-79
SLIDE 79

7 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output!

79

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-80
SLIDE 80

7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 7

80

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-81
SLIDE 81

7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3. 7

81

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions:

slide-82
SLIDE 82

N N F F Output size: (N - F) / stride + 1 e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 : \

82

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-83
SLIDE 83

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the

  • utput?

(recall:) (N - F) / stride + 1

In practice: Common to zero pad 
 the border

83

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-84
SLIDE 84

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the

  • utput?

7x7 output!

84

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

In practice: Common to zero pad 
 the border

slide-85
SLIDE 85

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the

  • utput?

7x7 output! in general, common to see CONV layers with stride 1, filters of size FxF , and zero- padding with (F-1)/2. (will preserve size spatially) e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3

85

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

In practice: Common to zero pad 
 the border

slide-86
SLIDE 86

Remember back to… E.g. 32x32 input convolved repeatedly with 5x5 filters 
 shrinks volumes spatially! (32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU

….

10 24 24

86

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-87
SLIDE 87

Recap: Convolution Layer

(No padding, no strides) Convolving a 3 × 3 kernel over a 4 × 4 input using unit strides 
 (i.e., i = 4, k = 3, s = 1 and p = 0).

Image credit: Vincent Dumoulin and Francesco Visin 87

slide-88
SLIDE 88

Computing the output values of a 2D discrete convolution 
 i1 = i2 = 5, k1 = k2 = 3, s1 = s2 = 2, and p1 = p2 = 1

Image credit: Vincent Dumoulin and Francesco Visin

88

slide-89
SLIDE 89

Examples time:

Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: ?

89

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-90
SLIDE 90

Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: (32+2*2-5)/1+1 = 32 spatially, so 32x32x10

Examples time:

90

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-91
SLIDE 91

Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer?

Examples time:

91

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-92
SLIDE 92

Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer? each filter has 5*5*3 + 1 = 76 params (+1 for bias) => 76*10 = 760

Examples time:

92

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-93
SLIDE 93

93

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-94
SLIDE 94

Common settings: K = (powers of 2, e.g. 32, 64, 128, 512)

  • F = 3, S = 1, P = 1
  • F = 5, S = 1, P = 2
  • F = 5, S = 2, P = ? (whatever fits)
  • F = 1, S = 1, P = 0

94

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-95
SLIDE 95

(btw, 1x1 convolution layers make perfect sense)

64 56 56 1x1 CONV with 32 filters 32 56 56 (each filter has size 1x1x64, and performs a 64-dimensional dot product)

95

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-96
SLIDE 96

Example: CONV layer in Torch

96

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-97
SLIDE 97

Example: CONV layer in Caffe

97

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-98
SLIDE 98

Example: CONV layer in Lasagne

98

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-99
SLIDE 99

The brain/neuron view of CONV Layer

32 32 3

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and this part of the image (i.e. 5*5*3 = 75-dimensional dot product)

99

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-100
SLIDE 100

32 32 3

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and this part of the image (i.e. 5*5*3 = 75-dimensional dot product) It’s just a neuron with local connectivity...

100

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

The brain/neuron view of CONV Layer

slide-101
SLIDE 101

32 32 3 An activation map is a 28x28 sheet of neuron

  • utputs:
  • 1. Each is connected to a small region in the input
  • 2. All of them share parameters

“5x5 filter” -> “5x5 receptive field for each neuron”

28 28

101

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

The brain/neuron view of CONV Layer

slide-102
SLIDE 102

32 32 3

28 28

E.g. with 5 filters, CONV layer consists of neurons arranged in a 3D grid (28x28x5) There will be 5 different neurons all looking at the same region in the input volume 5

102

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

The brain/neuron view of CONV Layer

slide-103
SLIDE 103

103

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Activation Functions

slide-104
SLIDE 104

Activation Functions

Sigmoid tanh tanh(x) ReLU max(0,x)

104

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-105
SLIDE 105

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron 3 problems:

  • 1. Saturated neurons “kill” the

gradients

  • 2. Sigmoid outputs are not zero-

centered

  • 3. exp() is a bit compute expensive

Activation Functions

105

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-106
SLIDE 106

Activation Functions

tanh(x)

  • Squashes numbers to range [-1,1]
  • zero centered (nice)
  • still kills gradients when saturated :(

[LeCun et al., 1991]

106

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-107
SLIDE 107

Activation Functions -

Computes f(x) = max(0,x)

  • Does not saturate (in

+region)

  • Very computationally

efficient

  • Converges much faster than

sigmoid/tanh in practice 
 (e.g. 6x) ReLU (Rectified Linear Unit)

[Krizhevsky et al., 2012]

107

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-108
SLIDE 108

108

two more layers to go: POOL/FC

108

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-109
SLIDE 109

Pooling layer

  • makes the representations smaller and more manageable
  • perates over each activation map independently:

109

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-110
SLIDE 110

1 1 2 4 5 6 7 8 3 2 1 1 2 3 4 Single depth slice x y

max pool with 2x2 filters and stride 2

6 8 3 4

Max Pooling

6 8 3 4 1 1 2 4 5 6 7 8 3 2 1 1 2 3 4

110

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-111
SLIDE 111

111

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-112
SLIDE 112

Common settings: F = 2, S = 2 F = 3, S = 2

112

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-113
SLIDE 113

113

Fully Connected Layer (FC layer)

  • Contains neurons that connect to the entire input volume, as in ordinary

Neural Networks

113

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-114
SLIDE 114

http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

[ConvNetJS demo: training on CIFAR-10]

114

slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson