 
              Supervising the neocognitron Output class label(s) β’ Add an extra decision layer after the final C layer β Produces a class-label output β’ We now have a fully feed forward MLP with shared parameters β All the S-cells within an S-plane have the same weights β’ Simple backpropagation can now train the S-cell weights in every plane of every layer β C-cells are not updated
Scanning vs. multiple filters β’ Note : The original Neocognitron actually uses many identical copies of a neuron in each S and C plane
Supervising the neocognitron Output class label(s) β’ The Math β Assuming square receptive fields, rather than elliptical ones β Receptive field of S cells in lth layer is πΏ π Γ πΏ π β Receptive field of C cells in lth layer is π π Γ π π
Supervising the neocognitron Output class label(s) π³ π π³ π π½ π»,π,π π, π = π ΰ· ΰ· ΰ· π π»,π,π (π, π, π)π½ π«,πβπ,π (π + π, π + π) π π=1 π=1 π½ π«,π,π π, π = πβ π,π+π π ,πβ(π,π+π π ) π½ π»,π,π π, π max β’ This is, however, identical to βscanningβ (convolving) with a single neuron/filter (what LeNet actually did)
Convolutional Neural Networks
The general architecture of a convolutional neural network Output Multi-layer Perceptron β’ A convolutional neural network comprises of βconvolutionalβ and βdown - samplingβ layers β The two may occur in any sequence, but typically they alternate β’ Followed by an MLP with one or more layers
The general architecture of a convolutional neural network Output Multi-layer Perceptron β’ A convolutional neural network comprises of βconvolutionalβ and β downsampling β layers β The two may occur in any sequence, but typically they alternate β’ Followed by an MLP with one or more layers
The general architecture of a convolutional neural network Output Multi-layer Perceptron β’ Convolutional layers and the MLP are learnable β Their parameters must be learned from training data for the target classification task β’ Down-sampling layers are fixed and generally not learnable
A convolutional layer Maps Previous layer β’ A convolutional layer comprises of a series of βmapsβ β Corresponding the βS - planesβ in the Neocognitron β Variously called feature maps or activation maps
A convolutional layer Previous Previous layer layer β’ Each activation map has two components β A linear map, obtained by convolution over maps in the previous layer β’ Each linear map has, associated with it, a learnable filter β An activation that operates on the output of the convolution
A convolutional layer Previous Previous layer layer β’ All the maps in the previous layer contribute to each convolution
A convolutional layer Previous Previous layer layer β’ All the maps in the previous layer contribute to each convolution β Consider the contribution of a single map
What is a convolution Example 5x5 image with binary pixels Example 3x3 filter bias 1 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 0 1 1 0 3 3 π¨ π, π = ΰ· ΰ· π π, π π½ π + π, π + π + π 0 1 1 0 0 π=1 π=1 β’ Scanning an image with a βfilterβ β Note: a filter is really just a perceptron, with weights and a bias
What is a convolution 1 0 1 0 0 1 0 bias 1 0 1 Filter Input Map β’ Scanning an image with a βfilterβ β At each location, the βfilter and the underlying map values are multiplied component wise, and the products are added along with the bias
The βStrideβ between adjacent scanned locations need not be 1 1 0 1 0 1 1 1 0 0 x1 x0 x1 0 1 0 bias 4 0 1 1 1 0 1 0 1 x0 x1 x0 Filter 0 0 1 1 1 x1 x0 x1 0 0 1 1 0 0 1 1 0 0 β’ Scanning an image with a βfilterβ β The filter may proceed by more than 1 pixel at a time β E.g. with a βstrideβ of two pixels per shift
The βStrideβ between adjacent scanned locations need not be 1 1 0 1 0 1 1 1 0 0 x1 x0 x1 0 1 0 bias 4 4 0 1 1 1 0 1 0 1 x0 x1 x0 Filter 0 0 1 1 1 x1 x0 x1 0 0 1 1 0 0 1 1 0 0 β’ Scanning an image with a βfilterβ β The filter may proceed by more than 1 pixel at a time β E.g. with a βhopβ of two pixels per shift
The βStrideβ between adjacent scanned locations need not be 1 1 0 1 0 1 1 1 0 0 0 1 0 bias 4 4 0 1 1 1 0 1 0 1 Filter 2 0 0 1 1 1 x1 x0 x1 0 0 1 1 0 x0 x1 x0 0 1 1 0 0 x1 x0 x1 β’ Scanning an image with a βfilterβ β The filter may proceed by more than 1 pixel at a time β E.g. with a βhopβ of two pixels per shift
The βStrideβ between adjacent scanned locations need not be 1 1 0 1 0 1 1 1 0 0 0 1 0 bias 4 4 0 1 1 1 0 1 0 1 Filter 4 2 0 0 1 1 1 x1 x0 x1 0 0 1 1 0 x0 x1 x0 0 1 1 0 0 x1 x0 x1 β’ Scanning an image with a βfilterβ β The filter may proceed by more than 1 pixel at a time β E.g. with a βhopβ of two pixels per shift
Extending to multiple input maps Previous Previous layer layer β’ We actually compute any individual convolutional map from all the maps in the previous layer
Extending to multiple input maps Previous layer β’ We actually compute any individual convolutional map from all the maps in the previous layer β’ The actual processing is better understood if we modify our visualization of all the maps in a layer as vertical arrangement to..
Extending to multiple input maps Stacked arrangement of kth layer of maps Filter applied to kth layer of maps (convolutive component plus bias) β’ ..A stacked arrangement of planes β’ We can view the joint processing of the various maps as processing the stack using a three- dimensional filter
Extending to multiple input maps bias π π π¨ π, π = ΰ· ΰ· ΰ· π₯ π,π,π (π, π, π)π π (π + π, π + π) + π π π=1 π=1 β’ The computation of the convolutive map at any location sums the convolutive outputs at all planes
Extending to multiple input maps bias One map π π π¨ π, π = ΰ· ΰ· ΰ· π₯ π,π,π (π, π, π)π π (π + π, π + π) + π π π=1 π=1 β’ The computation of the convolutive map at any location sums the convolutive outputs at all planes
Extending to multiple input maps bias All maps π π π¨ π, π = ΰ· ΰ· ΰ· π₯ π,π,π (π, π, π)π π (π + π, π + π) + π π π=1 π=1 β’ The computation of the convolutive map at any location sums the convolutive outputs at all planes
Extending to multiple input maps bias All maps π π π¨ π, π = ΰ· ΰ· ΰ· π₯ π,π,π (π, π, π)π π (π + π, π + π) + π π π=1 π=1 β’ The computation of the convolutive map at any location sums the convolutive outputs at all planes
Extending to multiple input maps bias All maps π π π¨ π, π = ΰ· ΰ· ΰ· π₯ π,π,π (π, π, π)π π (π + π, π + π) + π π π=1 π=1 β’ The computation of the convolutive map at any location sums the convolutive outputs at all planes
Extending to multiple input maps bias All maps π π π¨ π, π = ΰ· ΰ· ΰ· π₯ π,π,π (π, π, π)π π (π + π, π + π) + π π π=1 π=1 β’ The computation of the convolutive map at any location sums the convolutive outputs at all planes
Extending to multiple input maps bias All maps π π π¨ π, π = ΰ· ΰ· ΰ· π₯ π,π,π (π, π, π)π π (π + π, π + π) + π π π=1 π=1 β’ The computation of the convolutive map at any location sums the convolutive outputs at all planes
Extending to multiple input maps bias All maps π π π¨ π, π = ΰ· ΰ· ΰ· π₯ π,π,π (π, π, π)π π (π + π, π + π) + π π π=1 π=1 β’ The computation of the convolutive map at any location sums the convolutive outputs at all planes
The size of the convolution 1 0 1 0 0 1 0 bias 1 0 1 Filter Input Map β’ Image size: 5x5 β’ Filter: 3x3 β’ βStrideβ: 1 β’ Output size = ?
The size of the convolution 1 0 1 0 0 1 0 bias 1 0 1 Filter Input Map β’ Image size: 5x5 β’ Filter: 3x3 β’ Stride: 1 β’ Output size = ?
The size of the convolution 1 0 1 0 1 1 1 0 0 0 1 0 bias 4 4 0 1 1 1 0 1 0 1 Filter 4 2 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 β’ Image size: 5x5 β’ Filter: 3x3 β’ Stride: 2 β’ Output size = ?
The size of the convolution 1 0 1 0 1 1 1 0 0 0 1 0 bias 4 4 0 1 1 1 0 1 0 1 Filter 4 2 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 β’ Image size: 5x5 β’ Filter: 3x3 β’ Stride: 2 β’ Output size = ?
The size of the convolution π Γ π 0 1 1 1 0 0 πππ¨π βΆ π Γ π bias 0 1 1 1 0 ? Filter 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 β’ Image size: π Γ π β’ Filter: π Γ π β’ Stride: 1 β’ Output size = ?
The size of the convolution π Γ π 0 1 1 1 0 0 πππ¨π βΆ π Γ π bias 0 1 1 1 0 ? Filter 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 β’ Image size: π Γ π β’ Filter: π Γ π β’ Stride: π β’ Output size = ?
The size of the convolution π Γ π 0 1 1 1 0 0 πππ¨π βΆ π Γ π bias 0 1 1 1 0 ? Filter 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 β’ Image size: π Γ π β’ Filter: π Γ π β’ Stride: π β’ Output size (each side) = π β π /π + 1 β Assuming youβre not allowed to go beyond the edge of the input
Convolution Size β’ Simple convolution size pattern: β Image size: π Γ π β Filter: π Γ π β Stride: π β Output size (each side) = π β π /π + 1 β’ Assuming youβre not allowed to go beyond the edge of the input β’ Results in a reduction in the output size β Even if π = 1 β Not considered acceptable β’ If thereβs no active downsampling, through max pooling and/or π > 1 , then the output map should ideally be the same size as the input
Solution 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 1 0 bias 1 0 1 0 0 0 1 1 1 0 Filter 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 β’ Zero-pad the input β Pad the input image/map all around β’ Add P L rows of zeros on the left and P R rows of zeros on the right β’ Add P L rows of zeros on the top and P L rows of zeros at the bottom β P L and P R chosen such that: β’ P L = P R OR | P L β P R | = 1 β’ P L + P R = M-1 β For stride 1, the result of the convolution is the same size as the original image
Solution 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 bias 1 0 1 0 0 1 1 1 0 0 Filter 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 β’ Zero-pad the input β Pad the input image/map all around β Pad as symmetrically as possible, such that.. β For stride 1, the result of the convolution is the same size as the original image
Why convolution? β’ Convolutional neural networks are, in fact, equivalent to scanning with an MLP β Just run the entire MLP on each block separately, and combine results β’ As opposed to scanning (convolving) the picture with individual neurons/filters β Even computationally, the number of operations in both computations is identical β’ The neocognitron in fact views it equivalently to a scan β’ So why convolutions?
Cost of Correlation Correlation N M β’ Correlation : π§ π, π = ΰ· ΰ· π¦ π + π, π + π π₯(π, π) π π Cost of scanning an π Γ π image with an π Γ π filter: O π 2 π 2 β’ β π 2 multiplications at each of π 2 positions β’ Not counting boundary effects β Expensive, for large filters
Correlation in Transform Domain Correlation N M β’ Correlation usind DFTs : Y = π½πΈπΊπ2 πΈπΊπ2(π) β ππππ πΈπΊπ2(π) β’ Cost of doing this using the Fast Fourier Transform to compute the DFTs: O π 2 ππππ β Significant saving for large filters β Or if there are many filters
A convolutional layer Previous Previous layer layer β’ The convolution operation results in a convolution map β’ An Activation is finally applied to every entry in the map
The other component Downsampling/Pooling Output Multi-layer Perceptron β’ Convolution (and activation) layers are followed intermittently by β downsampling β (or βpoolingβ) layers β Often, they alternate with convolution, though this is not necessary
Recall: Max pooling 6 3 1 Max 4 6 Max β’ Max pooling selects the largest from a pool of elements β’ Pooling is performed by βscanningβ the input
Recall: Max pooling 6 6 1 3 Max 6 5 Max β’ Max pooling selects the largest from a pool of elements β’ Pooling is performed by βscanningβ the input
Recall: Max pooling 6 6 7 3 2 Max 5 7 Max β’ Max pooling selects the largest from a pool of elements β’ Pooling is performed by βscanningβ the input
Recall: Max pooling Max β’ Max pooling selects the largest from a pool of elements β’ Pooling is performed by βscanningβ the input
Recall: Max pooling Max β’ Max pooling selects the largest from a pool of elements β’ Pooling is performed by βscanningβ the input
Recall: Max pooling Max β’ Max pooling selects the largest from a pool of elements β’ Pooling is performed by βscanningβ the input
βStridesβ Max β’ The βmaxβ operations may βstrideβ by more than one pixel
βStridesβ Max β’ The βmaxβ operations may βstrideβ by more than one pixel
βStridesβ Max β’ The βmaxβ operations may βstrideβ by more than one pixel
βStridesβ Max β’ The βmaxβ operations may βstrideβ by more than one pixel
βStridesβ Max β’ The βmaxβ operations may βstrideβ by more than one pixel
Max Pooling Single depth slice 1 1 2 4 x max pool with 2x2 filters 6 8 5 6 7 8 and stride 2 3 4 3 2 1 0 1 2 3 4 y β’ An π Γ π picture compressed by a π Γ π maxpooling filter with stride πΈ results in an output map of side ΪΏ(π β
Alternative to Max pooling: Mean Pooling Single depth slice 1 1 2 4 x Mean pool with 2x2 3.25 5.25 5 6 7 8 filters and stride 2 2 2 3 2 1 0 1 2 3 4 y β’ An π Γ π picture compressed by a π Γ π maxpooling filter with stride πΈ results in an output map of side ΪΏ(π β
Other options Network applies to each 2x2 block and strides by Single depth slice 2 in this example 1 1 2 4 x 6 8 5 6 7 8 3 4 3 2 1 0 1 2 3 4 y β’ The pooling may even be a learned filter β’ The same network is applied on each block β’ (Again, a shared parameter network)
Other options Network applies to each 2x2 block and strides by Single depth slice 2 in this example 1 1 2 4 x 6 8 5 6 7 8 3 4 3 2 1 0 1 2 3 4 Network in network y β’ The pooling may even be a learned filter β’ The same network is applied on each block β’ (Again, a shared parameter network)
Setting everything together β’ Typical image classification task
Convolutional Neural Networks β’ Input: 1 or 3 images β Black and white or color β Will assume color to be generic
Convolutional Neural Networks β’ Input: 3 pictures
Convolutional Neural Networks β’ Input: 3 pictures
Preprocessing β’ Typically works with square images β Filters are also typically square β’ Large networks are a problem β Too much detail β Will need big networks β’ Typically scaled to small sizes, e.g. 32x32 or 128x128
Convolutional Neural Networks π½ Γ π½ πππππ β’ Input: 3 pictures
Convolutional Neural Networks K 1 total filters Filter size: π Γ π Γ 3 π½ Γ π½ πππππ β’ Input is convolved with a set of K 1 filters β Typically K 1 is a power of 2, e.g. 2, 4, 8, 16, 32,.. β Filters are typically 5x5, 3x3, or even 1x1
Convolutional Neural Networks K 1 total filters Filter size: π Γ π Γ 3 Small enough to capture fine features (particularly important for scaled-down images) π½ Γ π½ πππππ β’ Input is convolved with a set of K 1 filters β Typically K 1 is a power of 2, e.g. 2, 4, 8, 16, 32,.. β Filters are typically 5x5, 3x3, or even 1x1
Convolutional Neural Networks K 1 total filters Filter size: π Γ π Γ 3 Small enough to capture fine features (particularly important for scaled-down images) What on earth is this? π½ Γ π½ πππππ β’ Input is convolved with a set of K 1 filters β Typically K 1 is a power of 2, e.g. 2, 4, 8, 16, 32,.. β Filters are typically 5x5, 3x3, or even 1x1
The 1x1 filter β’ A 1x1 filter is simply a perceptron that operates over the depth of the map, but has no spatial extent β Takes one pixel from each of the maps (at a given location) as input
Convolutional Neural Networks K 1 total filters Filter size: π Γ π Γ 3 π½ Γ π½ πππππ β’ Input is convolved with a set of K 1 filters β Typically K 1 is a power of 2, e.g. 2, 4, 8, 16, 32,.. β Better notation: Filters are typically 5x5(x3), 3x3(x3), or even 1x1(x3)
Convolutional Neural Networks K 1 total filters Filter size: π Γ π Γ 3 Parameters to choose: πΏ 1 , π and π 1. Number of filters πΏ 1 2. Size of filters π Γ π Γ 3 + ππππ‘ 3. Stride of convolution π Total number of parameters: πΏ 1 3π 2 + 1 π½ Γ π½ πππππ β’ Input is convolved with a set of K 1 filters β Typically K 1 is a power of 2, e.g. 2, 4, 8, 16, 32,.. β Better notation: Filters are typically 5x5(x3), 3x3(x3), or even 1x1(x3) β Typical stride : 1 or 2
Convolutional Neural Networks K 1 total filters Filter size: π Γ π Γ 3 π½ Γ π½ πππππ β’ The input may be zero-padded according to the size of the chosen filters
Convolutional Neural Networks K 1 filters of size: π Γ π Γ 3 π½ Γ π½ 1 π The layer includes a convolution operation 1 followed by an activation (typically RELU) π π 1 π 1 (π, π) = 1 π, π, π π½ π π + π, π + π + π π (1) 2 π¨ π ΰ· ΰ· ΰ· π₯ π πβ{π,π»,πΆ} π=1 π=1 1 (π, π) = π π¨ π 1 (π, π) π π π½ Γ π½ πππππ 1 π πΏ 1 β’ First convolutional layer: Several convolutional filters β Filters are β3 - Dβ (third dimension is color) β Convolution followed typically by a RELU activation β’ Each filter creates a single 2-D output map
Learnable parameters in the first convolutional layer β’ The first convolutional layer comprises πΏ 1 filters, each of size π Γ π Γ 3 β Spatial span: π Γ π β Depth : 3 (3 colors) β’ This represents a total of πΏ 1 3π 2 + 1 parameters β β+ 1β because each filter also has a bias β’ All of these parameters must be learned
Convolutional Neural Networks Filter size: π Γ π Γ 3 π½ Γ π½ π½/πΈ Γ (π½/πΈ The layer pools PxP blocks of Y into a single value 1 1 π π 1 1 It employs a stride D between adjacent blocks pool 1 π 1 1 (π, π) = 1 (π, π) π 2 2 π π max π π πβ πβ1 πΈ+1, ππΈ , πβ πβ1 πΈ+1, ππΈ π½ Γ π½ πππππ 1 1 π π πΏ 1 πΏ 1 β’ First downsampling layer: From each π Γ π block of each map, pool down to a single value β For max pooling, during training keep track of which position had the highest value
Convolutional Neural Networks Filter size: π Γ π Γ 3 π½ Γ π½ π½/πΈ Γ (π½/πΈ 1 (π, π) = 1 (π, π) π π max π π 1 1 πβ πβ1 πΈ+1, ππΈ , π π 1 1 πβ πβ1 πΈ+1, ππΈ Parameters to choose: Size of pooling block π pool 1 π 1 Pooling stride πΈ π 2 2 Choices: Max pooling or mean pooling? π½ Γ π½ πππππ Or learned pooling? 1 1 π π πΏ 1 πΏ 1 β’ First downsampling layer: From each π Γ π block of each map, pool down to a single value β For max pooling, during training keep track of which position had the highest value
Recommend
More recommend