Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x - - PowerPoint PPT Presentation

gradient for cross entropy loss with sigmoid
SMART_READER_LITE
LIVE PREVIEW

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x - - PowerPoint PPT Presentation

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K L 1 L y k log k ( x ) + (1 y k ) log k ( x ) k =1 s l 1 L s l + 2 w l ij 2 m l =1 i =1


slide-1
SLIDE 1

Gradient for Cross-Entropy Loss with Sigmoid

For a single example (x, y): − K

  • k=1

yk log

  • σL

k (x)

  • + (1 − yk) log
  • 1 − σL

k (x)

  • + λ

2m

L

  • l=1

sl−1

  • i=1

sl

  • j=1
  • wl

ij

2 suml

j = sl−1

  • k=1

wl

kjσl−1 k

and σl

i = 1 1+e−suml

i

∂E ∂wl

ij

= ∂E

∂σl

j

∂σl

j

∂suml

j

∂suml

j

∂wl

ij

+ λ

m wl ij ∂σl

j

∂suml

j

=

  • 1

1+e−suml

i

1 −

1 1+e−suml

i

  • = σl

j(1 − σl j) ∂suml

j

∂wl

ij

=

∂ ∂wl

ij

sl−1

  • k=1

wl

kjσl−1 k

  • = σl−1

i

slide-2
SLIDE 2

Backpropagation in Action

. σl−1

i

. .

∂E ∂σl

j

∂E ∂σl

sl

.

∂E ∂wl

ij

= ∂E

∂σl

j

∂σl

j

∂suml

j

∂suml

j

∂wl

ij

+ λ

m wl ij

. . . . . . σl+1

1

σl+1

sl+1

(l − 1)th layer

slide-3
SLIDE 3

Gradient for Cross-Entropy Loss with Sigmoid: ∂E

∂σl

j

For a single example (x, y):

− K

  • k=1

yk log

  • σL

k (x)

  • + (1 − yk) log
  • 1 − σL

k (x)

  • + λ

2m

L

  • l=1

sl−1

  • i=1

sl

  • j=1
  • wl

ij

2 ∂E ∂σl

j =

slide-4
SLIDE 4

Gradient for Cross-Entropy Loss with Sigmoid: ∂E

∂σl

j

For a single example (x, y):

− K

  • k=1

yk log

  • σL

k (x)

  • + (1 − yk) log
  • 1 − σL

k (x)

  • + λ

2m

L

  • l=1

sl−1

  • i=1

sl

  • j=1
  • wl

ij

2 ∂E ∂σl

j =

sl+1

  • p=1

∂E ∂suml+1

p

∂suml+1

p

∂σl

j

=

sl+1

  • p=1

∂E ∂σl+1

p

∂σl+1

p

∂suml+1

p

w l+1

jp

since

∂suml+1

p

∂σl

j

= w l+1

jp ∂E ∂σL

j =

slide-5
SLIDE 5

Gradient for Cross-Entropy Loss with Sigmoid: ∂E

∂σl

j

For a single example (x, y):

− K

  • k=1

yk log

  • σL

k (x)

  • + (1 − yk) log
  • 1 − σL

k (x)

  • + λ

2m

L

  • l=1

sl−1

  • i=1

sl

  • j=1
  • wl

ij

2 ∂E ∂σl

j =

sl+1

  • p=1

∂E ∂suml+1

p

∂suml+1

p

∂σl

j

=

sl+1

  • p=1

∂E ∂σl+1

p

∂σl+1

p

∂suml+1

p

w l+1

jp

since

∂suml+1

p

∂σl

j

= w l+1

jp ∂E ∂σL

j = − yj

σL

j − 1−yj

1−σL

j

slide-6
SLIDE 6

Backpropagation in Action:

Identify Sigmoid+Cross-Entropy-specific steps σl−1

sl−1

σl−1

i

σl−1

2

σl−1

1

∂E ∂σl j

=

sl+1

  • p=1

∂E ∂σl+1

p

∂σl+1

p

∂suml+1

p

wl+1

jp ∂E ∂σl sl

=

sl+1

  • p=1

∂E ∂σl+1

p

∂σl+1

p

∂suml+1

sl+1

wl+1

sl+1p

wl

sl−1j

wl

ij

wl

2j

wl

1j

wl

sl−1sl

wl

isl

wl

2sl

wl

1sl

σl+1

1

σl+1

sl+1

(l − 1)th layer

slide-7
SLIDE 7

Backpropagation in Action:

Identify Sigmoid+Cross-Entropy-specific steps . σl−1

i

. .

∂E ∂σl

j

∂E ∂σl

sl

.

∂E ∂wl

ij

= ∂E

∂σl

j

∂σl

j

∂suml

j

∂suml

j

∂wl

ij

+ λ

m wl ij

. . . . . . σl+1

1

σl+1

sl+1

(l − 1)th layer

slide-8
SLIDE 8

Recall and Substitute:

Sigmoid+Cross-Entropy-specific

suml

j = sl−1

  • k=1

w l

kjσl−1 k

and σl

i = 1 1+e−suml

i

∂E ∂wl

ij = ∂E

∂σl

j

∂σl

j

∂suml

j

∂suml

j

∂wl

ij + λ

mw l ij ∂σl

j

∂suml

j = σl

j(1 − σl j) ∂suml

j

∂wl

ij = σl−1

i ∂E ∂σl

j =

sl+1

  • p=1

∂E ∂σl+1

j

∂σl+1

j

∂suml+1

j

w l+1

jp ∂E ∂σL

j = − yj

σL

j − 1−yj

1−σL

j

slide-9
SLIDE 9

Backpropagation in Action:

Sigmoid+Cross-Entropy-specific

. σl−1

i

. .

∂E ∂σl

j , σl

j ∂E ∂σl

sl

, σl

sl

.

∂E ∂wl

ij =

∂E ∂σl

j σl

j(1 − σl j)σl−1 i

+ λ

mw l ij

. . . . . . σl+1

1

σl+1

sl+1

(l − 1)th layer

slide-10
SLIDE 10

Backpropagation in Action

. σl−1

i

. .

∂E ∂σl

j

∂E ∂σl

sl

. w l

ij = w l ij − η ∂E ∂wl

ij

. . . . . . σl+1

1

σl+1

sl+1

(l − 1)th layer

slide-11
SLIDE 11

The Backpropagation Algorithm:

Identify Sigmoid+Cross-Entropy-specific steps

1

Randomly initialize weights w l

ij for l = 1, . . . , L, i = 1, . . . , sl, j = 1, . . . , sl+1. 2

Implement forward propagation to get fw(x) for every x ∈ D.

3

Execute backpropagation on any misclassified x ∈ D by performing gradient descent to minimize (non-convex) E (w) as a function of parameters w.

4 ∂E ∂σL

j = −

yj σL

j −

1−yj 1−σL

j for j = 1 to sL.

5

For l = L − 1 down to 2:

1

∂E ∂σl

j =

sl+1

  • p=1

∂E ∂σl+1

j

σl+1

j

(1 − σl+1

j

)w l+1

jp

2

∂E ∂wl

ij = ∂E

∂σl

j σl

j(1 − σl j)σl−1 i

+ λ

mw l ij

3

w l

ij = w l ij − η ∂E ∂wl

ij 6

Keep picking misclassified examples until the cost function E (w) shows significant reduction; else resort to some random perturbation of weights w and restart a couple of times.

slide-12
SLIDE 12

Challenges in training Deep Neural Networks

1

Actual evaluation function = Surrogate Loss function (most often)

Classification: F measure = MLE/Cross-entropy/Hinge Loss Regression: Mean Absolute Error = Sum of Squares Error Loss Sequence prediction: BLEU score = Cross-entropy

2

Appropriately exploiting decomposability of surrogate loss functions

Stochasticity vs. redundancy, Mini-batch, etc.

3

Local minima, extremely fluctuating curvature2, large gradients, ill-conditioning

Momentum (gradient accummulator), Adaptive gradient, Clipping

4

Overfitting ⇒ Need for generalization

Universal Approximation Properties and Depth (Section 6.4): With a single hidden layer of a sufficient size, one can represent any smooth function to any desired accuracy The greater the desired accuracy, the more hidden units are required L1 (L2) regularization, Early stopping, dropout, etc.

2see demo

slide-13
SLIDE 13

Generalization through Parameter Tying/Sharing, Sparse Representations The Lego World of Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks

slide-14
SLIDE 14

Challenges and Opportunities with Neural Networks

1

Local Optima: Only Approximately Correct Solution. But stochastic gradient descent by avoiding even local minima3 often gives good generalization

2

Training data: Need for large number of training instances. Pre-training4

3

Extensive Computations and Numerical Precision: With lots of gradient computations and backpropagation, errors can be compounding. Advances in numerical computing tricks/matrix multiplication and GPUs!

4

Architecture Design: How many nodes and edges in each hidden layer? How many layers? Network structures can be overestimated and then regularized using Dropout, i.e., randomly multiply the output of a node with 0 using a random dropout bit vector d ∈

  • l sl across all

nodes: Pr(y|x) =

d Pr(d) Pr(y|x, d) 5

Associating Semantics by Parameter Tying/Sharing & Sparse Representations (Secs 7.9, 7.10): Architectures to suit particular tasks? Architectures can be designed keeping the problem in mind: Examples of CNNs, RNNs, Memory Cells, LSTMs, BiLSTMs, Embeddings, Inception, Attention Networks, etc.

3See Quiz 2, Problem 3 for when optimal solution can hurt. 4Unsupervised learning of parameters in the first few layers of the NN

slide-15
SLIDE 15

Recap: Regularization

Lowering Neural Network bias:

Increase (i) the epochs (ii) network size

Lowering Neural Network variance:

(i) Increase the training set data (ii) A smarter/more Intelligent Neural Network architecture (iii) Regularize

Methods of Regularization

Norm Penalties (Sec 7.1) & Dropout (Sec 7.12) Bagging and Other Ensemble Methods (Sec 7.11) Domain knowledge based methods

Dataset Augmentation (Sec 7.4) Manifold Tangent Classifier (Sec 7.14)

Optional: Multi-Task (Sec 7.7), Semi-Supervised & Adversarial Learning Parameter Tying & Sharing, Sparse Representations (Secs 7.9, 7.10, Chapters 9 & 10)

slide-16
SLIDE 16

Neural Networks: Towards Intelligence

Great Expressive power

Recall VC dimension Recall from Tutorial 3, Curse of Dimensionality: Given n variables we can have 22n boolean functions

Varied degrees of non-linearity using activation functions Catch? Training

Scalable Fast Stable Generalizable

Intelligent

slide-17
SLIDE 17

The Lego Blocks in Modern Deep Learning

1

Depth/Feature Map

2

Patches/Filters (provide for spatial interpolations)

3

Non-linear Activation unit (provided for detection/classification)

4

Strides (enable downsampling)

5

Padding (shrinking across layers)

6

Pooling (non-linear downsampling)

7

Inception [Optional: Extra slides]

8

RNN, Attention and LSTM (Backpropagation through time and Memory cell) [Optional: Extra slides]

9

Embeddings (Unsupervised learning) [Optional: Extra slides]

slide-18
SLIDE 18

[Optional] What Changed with Neural Networks?

Origin: Computational Model of Threshold Logic from Warren McCulloch and Walter Pitts (1943) Big Leap: For ImageNet Challenge, AlexNet acheived 85 % accuracy (NIPS 2012). Previous best was 75 % (CVPR 2011). Subsequent best was 96.5 % MSRA (arXiv 2015). Comparable to human level accuracy. Challenges involved were varied background, same object with different colors (e.g., cats), varied sizes and postures of same

  • bjects, varied illuminated conditions.

Tasks like OCR, Speech recognition are now possible without segmenting the word image/signal into character images/signals.

slide-19
SLIDE 19

LeNet(1989 and 1998) v/s AlexNet(2012)

slide-20
SLIDE 20

[Optional] Reasons for Big Leap

Why LeeNet was not as successful as AlexNet, though the algorithm was same? Right algorithm at wrong time. Modern features. Advancement in Machine learning. Realistic data collection in huge amount due to: regular competitions, evaluation metrics or challenging problem statements. Advances in Computational Resources: GPUs, industrial scale clusters. Evolution of tasks: Classification of 10 objects to 100 objects to “structure of classes”.

slide-21
SLIDE 21

Example: Cat with varied poses/backgrounds

Can you similarly

slide-22
SLIDE 22

CONVOLUTIONAL NEURAL NETWORKS

slide-23
SLIDE 23

Recall: Fully connected networks

σl−1

sl−1

σl−1

i

σl−1

2

σl−1

1

σl

j = σ

  • suml

j

  • σl

sl = σ

  • suml

sl

  • w l

sl−1j

w l

ij

w l

2j

w l

1j

w l

sl−1sl

w l

isl

w l

2sl

w l

1sl

σL

1

σL

K

(l − 1)th layer

slide-24
SLIDE 24

Challenges with Neural Networks (contd.)

200 X 200 grayscale image, 40k hidden units ⇒ around 1.6 Billion parameters!

slide-25
SLIDE 25

Challenges with Neural Networks (contd.)

Now consider the task of Colored Image Recognition Input Image Size: 200 X 200 X 3 (RGB) Multi-Layer Perceptron (MLP): Hidden Layer with 40k neurons results in parameters. Question: How many neurons (location specific)? Answer:

slide-26
SLIDE 26

Convolutional Neural Network

https://www.youtube.com/watch?v=vRDkZOv_kck&index=21&list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa https://www.youtube.com/watch?v=KoEHIS06GfI&index=22&list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa

slide-27
SLIDE 27

Convolutional Neural Network

Variation of multi layer feedforward neural network designed to use minimal preprocessing with wide application in image recognition and natural language processing Traditional MLP models do not take into account spatial structure of data and suffer from curse of dimensionality Convolution Neural network has smaller number of effective parameters due to local connections, weight sharing and equivariant representations

slide-28
SLIDE 28

Challenges and Opportunities with Neural Networks

Consider the task of RGB Colored Image Recognition ⇒ Size of previous feature map needs to be multiplied by 3 Input Image Size: 200 × 200 × 3 (RGB) MLP: Hidden Layer with 40k neurons results in 4.8 Billion parameters. With Convolutional Neural Networks?

slide-29
SLIDE 29

The Lego Blocks in Modern Deep Learning

slide-30
SLIDE 30

The Lego Blocks in Modern Deep Learning

1

Depth/Feature Map [Eg: Red, Green and Blue feature maps]

2

Patches/Filters (provide for spatial interpolations)

3

Non-linear Activation unit (provided for detection/classification)

4

Strides (enable downsampling)

5

Padding (shrinking across layers)

6

Pooling (non-linear downsampling)

7

Inception [Optional: Extra slides]

8

RNN, Attention and LSTM (Backpropagation through time and Memory cell) [Optional: Extra slides]

9

Embeddings (Unsupervised learning) [Optional: Extra slides]

slide-31
SLIDE 31

Convolution: Sparse Interactions through Filters K(.) (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl

11

wl

12

wl

12 wl 21

wl

22

wl

23

wl

23 wl 32

wl

33

wl

34

wl

34 wl 43

wl

44

wl

45

wl

45 wl 54

wl

55

input/(l − 1)th layer lth layer

slide-32
SLIDE 32

Convolution: Sparse Interactions through Filters K(.) (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl

11

wl

12

wl

12 wl 21

wl

22

wl

23

wl

23 wl 32

wl

33

wl

34

wl

34 wl 43

wl

44

wl

45

wl

45 wl 54

wl

55

input/(l − 1)th layer lth layer hi =

  • m

xmwmiK(i − m) On RHS, K(i − m) = 1 iff |m − i| ≤ 1 For 2-D inputs (such as images):

slide-33
SLIDE 33

Convolution: Sparse Interactions through Filters K(.) (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl

11

wl

12

wl

12 wl 21

wl

22

wl

23

wl

23 wl 32

wl

33

wl

34

wl

34 wl 43

wl

44

wl

45

wl

45 wl 54

wl

55

input/(l − 1)th layer lth layer hi =

  • m

xmwmiK(i − m) On RHS, K(i − m) = 1 iff |m − i| ≤ 1 For 2-D inputs (such as images): hij =

  • m
  • n

xmnwij,mnK(i − m, j − n) Intuition: Neighboring signals xm (or pixels xmn) more relevant than one’s further away, reduces prediction time Can be viewed as multiplication with a Toeplitz matrix K (which has each row as the row above shifted by one element) Further, K is sparse wrt parameter θ (eg: K(i − m) = 1 iff |m − i| ≤ θ)

slide-34
SLIDE 34

Convolution: Shared parameters and Patches (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl

input/(l − 1)th layer lth layer

slide-35
SLIDE 35

Convolution: Shared parameters and Patches (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl

input/(l − 1)th layer lth layer hi =

  • m

xmwi−mK(i − m) On LHS, K(i − m) = 1 iff |m − i| ≤ 1 For 2-D inputs (such as images):

slide-36
SLIDE 36

Convolution: Shared parameters and Patches (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl

input/(l − 1)th layer lth layer hi =

  • m

xmwi−mK(i − m) On LHS, K(i − m) = 1 iff |m − i| ≤ 1 For 2-D inputs (such as images): hij =

  • m
  • n

xmnwi−m,j−nK(i − m, j − n) Intuition: Neighboring signals xm (or pixels xmn) affect in similar way irrespective of location (i.e., value of m or n) More Intuition: Corresponds to moving patches around the image Further reduces storage requirement; does not affect prediction time Further, K is often sparse (eg: K(i − m) = 1 iff |m − i| ≤ θ)

slide-37
SLIDE 37

Convolution: Strides and Padding (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl

input/(l − 1)th layer lth layer

slide-38
SLIDE 38

Convolution: Strides and Padding (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl

input/(l − 1)th layer lth layer Consider only hi’s where i is a multiple of s. Intuition: Stride of s corresponds to moving the patch by s steps at a time More Intuition: Stride of s corresponds to downsampling by s What to do at the corners?

slide-39
SLIDE 39

Convolution: Strides and Padding (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl wl

1

wl

1 wl −1

wl

input/(l − 1)th layer lth layer Consider only hi’s where i is a multiple of s. Intuition: Stride of s corresponds to moving the patch by s steps at a time More Intuition: Stride of s corresponds to downsampling by s What to do at the corners? Ans: Pad with 0’s at the edges to create output of same size as input (same padding) or have no padding at all and let the next layer have fewer nodes (valid) Reduces storage requirement as well as prediction time

slide-40
SLIDE 40

Examples of Convolutional Filters: Guess what each does

+1

  • 1

+2

  • 2

+1

  • 1

5Also referred to as kernels, but not to be confused with the positive semi-definite kernel

slide-41
SLIDE 41

Examples of Convolutional Filters: Guess what each does

+1

  • 1

+2

  • 2

+1

  • 1

Sobel Vertical edge detector +1 +2 +1

  • 1
  • 2
  • 1

5Also referred to as kernels, but not to be confused with the positive semi-definite kernel

slide-42
SLIDE 42

Examples of Convolutional Filters: Guess what each does

+1

  • 1

+2

  • 2

+1

  • 1

Sobel Vertical edge detector +1 +2 +1

  • 1
  • 2
  • 1

Sobel Horizontal edge detector 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 Image blurring filter

  • 1
  • 1

3

  • 1
  • 1

Image sharpening filter

Illustration at https://www.saama.com/blog/different-kinds-convolutional-filters/ In CNNs, these filters5 (i.e. weights wi−m,j−n) are generally learnt from the data. Filter size ⇒ Strong prior, Filter value ⇒ Posterior

5Also referred to as kernels, but not to be confused with the positive semi-definite kernel

slide-43
SLIDE 43

The Convolutional Filter

slide-44
SLIDE 44

The Convolutional Filter

slide-45
SLIDE 45

The Convolutional Filter

slide-46
SLIDE 46

Question: MLP Vs CNN

Convolution leverages three important ideas that can help improve a machine learning system: (a) sparse interactions, (b) parameter sharing and (c) equivariant representations: f (g(x)) = g(f (x)) when f is convolution and g is shift function. We just saw these in action:

slide-47
SLIDE 47

Question: MLP Vs CNN

Convolution leverages three important ideas that can help improve a machine learning system: (a) sparse interactions, (b) parameter sharing and (c) equivariant representations: f (g(x)) = g(f (x)) when f is convolution and g is shift function. We just saw these in action: Input Image Size: 200 × 200 × 3 MLP: Hidden Layer has 40k neurons, resulting in 4.8 billion parameters. CNN: Say, hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1 and zero padding of 4 on each side, i.e., maximum overlapping of convolution windows. A feature map corresponds to one set of weights wl

  • ij. F feature maps ⇒ F times the number of weight

parameters Question: How many parameters? Answer: Question: How many neurons (location specific)? Answer:

slide-48
SLIDE 48

Answer: MLP Vs CNN

MLP: Hidden Layer has 40k neurons, so it has 4800000 parameters. CNN: Hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1, and zero padding of 4 on each side, i.e., maximum overlapping of convolution windows. Question: How many parameters? Answer: Just 1500 Question: How many neurons (location specific)? Let M × N × 3 be dimension of image and P × Q × 3 be dimension of filter for convolution. Let D be number

  • f zero paddings and s be stride length.

Answer:Output size =

slide-49
SLIDE 49

Answer: MLP Vs CNN

MLP: Hidden Layer has 40k neurons, so it has 4800000 parameters. CNN: Hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1, and zero padding of 4 on each side, i.e., maximum overlapping of convolution windows. Question: How many parameters? Answer: Just 1500 Question: How many neurons (location specific)? Let M × N × 3 be dimension of image and P × Q × 3 be dimension of filter for convolution. Let D be number

  • f zero paddings and s be stride length.

Answer:Output size =

  • M−P+2D

s

+ 1

  • ×
  • N−Q+2D

s

+ 1

  • .

In current case, D = P − 1 ⇒ Output size =

  • M+P

s

− 1

  • ×
  • N+Q

s

− 1

  • .

20 × ((200 + 5)/s) − 1) × ((200 + 5)/s) − 1) = 832320 (around 830 thousand which can increase with max-pooling). If D = (P − 1)/2 and S = 1,

slide-50
SLIDE 50

Answer: MLP Vs CNN

MLP: Hidden Layer has 40k neurons, so it has 4800000 parameters. CNN: Hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1, and zero padding of 4 on each side, i.e., maximum overlapping of convolution windows. Question: How many parameters? Answer: Just 1500 Question: How many neurons (location specific)? Let M × N × 3 be dimension of image and P × Q × 3 be dimension of filter for convolution. Let D be number

  • f zero paddings and s be stride length.

Answer:Output size =

  • M−P+2D

s

+ 1

  • ×
  • N−Q+2D

s

+ 1

  • .

In current case, D = P − 1 ⇒ Output size =

  • M+P

s

− 1

  • ×
  • N+Q

s

− 1

  • .

20 × ((200 + 5)/s) − 1) × ((200 + 5)/s) − 1) = 832320 (around 830 thousand which can increase with max-pooling). If D = (P − 1)/2 and S = 1, output will be of same size as input!