Con Convol
- luti
tion
- nal Neural Netw
twork
- rks
Presented by Tristan Maidment Adapted from Ke Yuโs Slides
Con Convol oluti tion onal Neural Netw twork orks Presented by - - PowerPoint PPT Presentation
Con Convol oluti tion onal Neural Netw twork orks Presented by Tristan Maidment Adapted from Ke Yus Slides Ou Outlin line Neural Network recap Building blocks of CNNs Architecture of CNNs Visualizing and understanding
Presented by Tristan Maidment Adapted from Ke Yuโs Slides
Fully-connected (FC) layer
๐ ๐ฆ! ๐ฆ" ๐ฆ# ๐[!] ๐["] ๐!
[!]
๐"
[!]
๐#
[!]
๐&
[!]
๐!
["]
๐"
["]
๐#
["]
๐&
["]
$ ๐ง ๐[!] ๐[!] ๐["] ๐["] ๐[#] ๐[#] ๐ ๐[!] = ๐ (๐[!]๐ + ๐[!]) ๐[!]~ 4,3 , ๐~ 3, ๐ , ๐[!]~(4, ๐) ๐["] = ๐ (๐["]๐[!] + ๐["]) ๐["]~ 4,4 , ๐[!]~ 4, ๐ , ๐["]~(4, ๐) $ ๐ง = ๐ (๐[#]๐["] + ๐[#]) ๐[#]~ 1,4 , ๐["]~ 4, ๐ , $ ๐ง~(1, ๐)
a x x a x a x a
๐ ๐ฆ = 1 1 + ๐'( ๐ข๐๐โ ๐ฆ = ๐( โ ๐'( ๐( + ๐'( ๐๐๐๐: max 0, ๐ฆ ๐๐๐๐๐ง ๐๐๐๐: max 0.1๐ฆ, ๐ฆ
๐["]: = ๐["] โ ๐ฝ & ๐[๐["]] ๐["]: = ๐["] โ ๐ฝ & ๐[๐["]]
๐จ["] = ๐["]๐ฆ + ๐["] ๐ฆ ๐["] ๐["] ๐["] = ๐(๐จ["]) โ(/ ๐ง, y) ๐จ[$] = ๐[$]๐["] + ๐[$] / ๐ง = ๐(๐จ[$]) ๐[$] ๐[$]
โ $ ๐ง, y = โ(๐ง log $ ๐ง + 1 โ ๐ง log(1 โ $ ๐ง)) ๐[$ ๐ง] = ๐โ ๐$ ๐ง = โ ๐ง $ ๐ง + 1 โ ๐ง 1 โ $ ๐ง ๐[๐จ " ] = ๐โ ๐$ ๐ง ๐$ ๐ง ๐๐จ["] = $ ๐ง โ ๐ง ๐[๐ " ] = ๐โ ๐$ ๐ง ๐$ ๐ง ๐๐จ " ๐๐จ " ๐๐ " = ๐[๐จ " ]๐ ! ! ๐[๐["]] = ๐โ ๐$ ๐ง ๐$ ๐ง ๐๐จ["] ๐๐จ["] ๐๐ " = ๐[๐จ " ] ๐ ๐จ ! = ๐[๐[!]] ๐๐[!] ๐๐จ ! = ๐ " )๐[๐จ " ] โ ๐โฒ(z ! ) ๐[๐[!]] = ๐[๐จ " ] ๐๐จ " ๐๐ ! = ๐ " )๐[๐จ " ] ๐๐[!] = ๐[๐จ[!]] ๐๐จ[!] ๐๐ ! = ๐[๐จ[!]]๐ฆ) ๐๐[!] = ๐[๐จ[!]] ๐๐จ[!] ๐๐ ! = ๐[๐จ[!]]
โ RMSProp โ Adam
๐ค!" = ๐พ๐ค!" + 1 โ ๐พ ๐๐ ๐ค!# = ๐พ๐ค!# + 1 โ ๐พ ๐๐ ๐ = ๐ โ ๐ฝ๐ค!", ๐ = ๐ โ ๐ฝ๐ค!# ๐!" = ๐พ๐!" + 1 โ ๐พ ๐๐$ ๐ = ๐ โ ๐ฝ ๐!" ๐๐ ๐ค!" = ๐พ%๐ค!" + 1 โ ๐พ% ๐๐ ๐!" = ๐พ$๐!" + 1 โ ๐พ$ ๐๐$ ๐ค!&
'()) = ๐ค!&
๐พ%
*
๐!&
'()) = ๐!&
๐พ$
*
๐ = ๐ โ ๐ฝ ๐!&
'()) + ๐
๐ค!&
'())
โ Adding L1 (Lasso) , L2 (Ridge) or sometimes combined (Elastic) to cost function โ Other norms are computationally ineffective
โ Forward: multiply the output of hidden layer with mask of 0s and 1s randomly drawn from a Bernoulli distribution and remove all the links to the dropout nodes โ Backward: do gradient descent through diminished network
General form: ๐ ๐ข = 5 ๐ ๐ ๐ ๐ข โ ๐ ๐๐ Denoted by: ๐ก ๐ข = (๐ โ ๐)(๐ข) Network terminology: ๐: input, usually a multidimensional arrays ๐: kernel or filter ๐ก: output is referred to as the feature map
pointwise product of their Fourier transforms.
๐(๐log ๐ )
CPU only alternatives. (via NVIDIA CUDA) โฑ ๐ฆ โ ๐ฅ = โฑ ๐ฆ & โฑ ๐ฅ ๐ฆ โ ๐ฅ = โฑ23{โฑ ๐ฆ & โฑ ๐ฅ }
An example of 2D Convolution without kernel flipping. Boxes connected by arrows indicating how the upper-left element of the output is formed by applying the kernel to the corresponding upper-left region of the input. This process is called as template
kernel and a piece of image is maximized exactly when those two vectors match up.
Identity
Edge detection 1
Edge detection 2
Box blur
every input unit.
input, CNN typically have sparse interactions.
the memory requirements and improves statistical efficiency.
deeper layers may indirectly interact with a larger portion of the input.
complicated interactions from constructing simple building blocks that each describe only sparse interactions.
variables, while g3 is connected to all 5 input variables through indirect connections
the weight matrix is used exactly once when computing the output of a layer.
(except some of the boundary pixels).
learning a separate set of parameters for every location, we learn only one set.
model parameters. Thus convolution is dramatically more efficient than dense matrix multiplication in terms of memory requirements and statistical efficiency
and subtracting the value of its neighboring
pixels wide. The output image is 319 pixels wide.
the same transformation with a matrix multiplication would need 320ร280ร319ร 280 > 8e9 weights
Input size: 320 by 280 Kernel size: 2 by 1 Output size: 319 by 280
translation.
input, itโs representation will move the same amount in the output.
The same edges appear everywhere in the image, so the same kernel can be used to extract features throughout.
Downsides of convolution
steps, we end up with a very small output.
from the edges of the image are throwed away.
1 1 1 1 1 1 1 1 1 =
6 by 6 3 by 3 4 by 4
A B A B B B B B B B B B
=
3 by 3 6 by 6 8 by 8
A A A A A
every layer
pixels, without zero padding, we are only able to have three convolutional layers
prevents the representation from shrinking with depth
want smaller output dimensions 2 3 7 4 6 2 9 6 6 9 8 7 4 3 3 4 8 3 8 9 7 7 8 3 6 6 3 4 4 2 1 8 3 4 6 3 2 4 1 9 8 3 1 3 9 2 1 4
3 4 4 1 2
3 =
91 100 83 69
sampling strategy
involves down sampling is computationally wasteful.
Eร๐ผE
โ filter size: ๐บร๐บ โ amount of zero padding: ๐ โ stride: ๐
โ ๐
1 = 4
!25617
8
+ 1 โ ๐ผ1 =
9!25617 8
+ 1
"ร๐ผ"ร๐ธ"
โ filter size: ๐บร๐บร๐ธ" โ amount of zero padding: ๐ โ stride: ๐ โ number of filters: ๐ฟ
โ ๐
" = *"'+,"- .
+ 1 โ ๐ผ" =
/"'+,"- .
+ 1 โ ๐ธ" = ๐ฟ
โ Weights: ๐บร๐บร๐ธ!ร๐ฟ โ Bias: ๐ฟ
http://cs231n.github.io/assets/conv-demo/index.html
representation
layer
up the computation.
โ Max Pooling (most popular) โ Average Pooling โ L2 norm of a rectangular neighborhood
Hyperparameters:
Common choices:
representation approximately invariant to small translations of the input.
property if we care more about whether some feature is present than exactly where it is
row of the lower network has changed, but only half of the values in the top pooling layer have changed, because the max pooling units are sensitive only to the maximum value in the neighborhood, not its exact location
learn to be invariant to transformations of the input, such as rotation.
detect a handwritten 5 and each filter attempts to match a slightly different
a large activation regardless of which filter unit was activated.
with different spatial size (e.g. 1 ร 1 , 3 ร 3 , 5 ร 5 , etc.)
28 ร 28 ร 192 1 ร 1 ร 192 16 28 ร 28 ร 16
Bottleneck layer
โฎ โฎ 32ร32 ร1 28ร28ร6 14ร14ร6 10ร10ร16 5ร5ร16 120 84
5 ร 5 s = 1 d = 1 k = 6 f = 2 s = 2 pooling 5 ร 5 s = 1 d = 6 k = 16 pooling f = 2 s = 2
Handwritten character recognition
conv conv FC FC
Common pattern
28ร28 โ 14ร14 โ 10ร10 โ 5ร5) , while the number of channels will increase (e.g. 1 โ 6 โ 16)
LeCun, Bottou, Bengio, Haffner, โGradient-Based Learning Applied to Document Recognitionโ, IEEE 1998
Activation shape Activation Size # parameters Input: (32,32,1) 1,024 CONV1 (f=5, s=1) (28,28,6) 4,704 156 POOL1 (14,14,6) 1,176 CONV2 (f=5, s=1) (10,10,16) 1,600 416 POOL2 (5,5,16) 400 FC3 (120,1) 120 48,001 FC4 (84,1) 84 10,081 Softmax (10,1) 10 841
28.2% 25.8% 16.4% 11.7% 7.3% 6.7% 3.6% 3.0% 2.3% 5.1% 8 8 19 22 152 152 152
20 40 60 80 100 120 140 160 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0%
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human Error Rate #Layers
Lin et al Sanchez & Perronnin Krizhevsky et al. (AlexNet) Zeiler & Fergus Simonyan (VGG) Szegedy et al. (GoogLeNet) He et al. (ResNet) Shao et al. Hu et. al (SENet)
classification and detection
algorithm produce a list of object categories present in the image. The quality of a labeling is evaluated based on the label that best matches the ground truth label for the image.
~1.2 million training images, 50,000 validation images and 150,000 testing images
28.2% 25.8% 16.4% 11.7% 7.3% 6.7% 3.6% 3.0% 2.3% 5.1% 8 8 19 22 152 152 152
20 40 60 80 100 120 140 160 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0%
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human Error Rate #Layers
Lin et al Sanchez & Perronnin Krizhevsky et al. (AlexNet) Zeiler & Fergus Simonyan (VGG) Szegedy et al. (GoogLeNet) He et al. (ResNet) Shao et al. Hu et. al (SENet)
1st CNN-based winner
โ [227*227*3] Input โ [55*55*96] CONV1: 96 11*11 filters at stride 4, pad 0 โ [27*27*96] MAX POOL1: 3*3 filters at stride 2 โ [27*27*96] NORM1: Normalization layer โ [27*27*256] CONV2: 256 5*5 filters at stride 1, pad 2 โ [13*13*256] MAX POOL2: 3*3 filters at stride 2 โ [13*13*256] NORM2: Normalization layer โ [13*13*384] CONV3: 384 3*3 filters at stride 1, pad 1 โ [13*13*384] CONV4: 384 3*3 filters at stride 1, pad 1 โ [13*13*256] CONV5: 256 3*3 filters at stride 1, pad 1 โ [6*6*256] MAX POOL3: 3*3 filters at stride 2 โ [4096] FC6: 4096 neurons โ [4096] FC7: 4096 neurons โ [1000] FC8: 1000 neurons (class scores)
Krizhevsky, Sutskever, Hinton, โImageNet Classification with Deep Convolutional Neural Networksโ, NIPS 2012
28.2% 25.8% 16.4% 11.7% 7.3% 6.7% 3.6% 3.0% 2.3% 5.1% 8 8 19 22 152 152 152
20 40 60 80 100 120 140 160 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0%
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human Error Rate #Layers
Lin et al Sanchez & Perronnin Krizhevsky et al. (AlexNet) Zeiler & Fergus Simonyan (VGG) Szegedy et al. (GoogLeNet) He et al. (ResNet) Shao et al. Hu et. al (SENet)
Deeper Networks
โ 3*3 CONV stride 1 โ 2*2 MAX POOLING stride 2
receptive field as one 7*7 layer with fewer parameters 3*(3*3*K)
memory
Simonyan, Zisserman, โVery Deep Convolutional Networks For Large-Scale Image Recognitionโ, ICLR 2015
28.2% 25.8% 16.4% 11.7% 7.3% 6.7% 3.6% 3.0% 2.3% 5.1% 8 8 19 22 152 152 152
20 40 60 80 100 120 140 160 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0%
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human Error Rate #Layers
Lin et al Sanchez & Perronnin Krizhevsky et al. (AlexNet) Zeiler & Fergus Simonyan (VGG) Szegedy et al. (GoogLeNet) He et al. (ResNet) Shao et al. Hu et. al (SENet)
Revolution of Depth
โ 56-layer model performs worse on both training and test error โ It is not caused by overfitting
He, Zhang, Ren, and Sun, โDeep Residual Learning for Image Recognitionโ, CVPR 2015
๐[,] ๐[,.$] ๐[,.%]
๐จ[0,"] = ๐[0,"]๐[0,!] + ๐[0,"] ๐[0,"] = ๐(๐จ[0,"])
๐[,] ๐[,.$] ๐[,.%]
๐จ[0,"] = ๐[0,"]๐[0,!] + ๐[0,"] ๐[0,"] = ๐(๐จ 0," + ๐[0]) ๐[0,"] = ๐ ๐ 0 = ๐ 0
Residual block
learn any useful information (i.e. zero weights)
layers donโt hurt networkโs overall performance.
๐จ[0,!] = ๐[0,!] ๐[0] + ๐[0,!] ๐[0,!] = ๐(๐จ[0,!]) ๐จ[0,!] = ๐[0,!] ๐[0] + ๐[0,!] ๐[0,!] = ๐(๐จ[0,!])
He, Zhang, Ren, and Sun, โDeep Residual Learning for Image Recognitionโ, CVPR 2015
https://arxiv.org/pdf/1712.09913.pdf
โ Multi-scale ensembling of inception, inception-Resnet, Resnet, Wide Resnet models โ ILSVRCโ16 classification winner
โ Add a โfeature recalibrationโ module that learns to adaptively reweight feature maps โ Global average pooling layer + 2 FC layers used to determine feature map weights โ ILSVRCโ17 classification winner
convolutional network from scratch.
related new task.
โ When new dataset is small and similar to original dataset
the new dataset.
โ When new dataset is large and similar to original dataset
some of the earlier layers fixed and only fine-tune some deeper portion of the network
โ When new dataset is large different from the original dataset
fine-tune through the entire network.
โ Horizontal flips โ Random crops/scales โ Translation โ Color jitter โ Rotation โ etc.
Horizontal flips Crops/scales Jitter contrast
looking for?
Whatโs going on inside CNN?
96 convolutional kernels of size 11 by 11 by 3 learned by the first convolutional layer on the 224 by 224 by 3 input images Why visualize the weights of the fist layer? Template matching - the inner product between a kernel and a piece of image is maximized exactly when those two vectors match up.
2nd convolutional filters are not very interpretable They are connected to the nonlinear output of first
pattern after first layer would cause second convolutional layer to maximally activated. Not very interpretable 3rd and deeper convolutional filters are getting more and more difficult to directly interpret.
Nearest Neighbors in Pixel Space Nearest Neighbors in Feature Space
cannel 20 in AlexNet
AlexNet and record values of the chosen channel
with maximal activations
and look at larger objects
Each row is a different channel
Springenberg et al. โStriving for Simplicity: The All Convolutional Netโ, ICLR Workshop 2015
Deeper Shallower
cannel 20 in AlexNet
respect to image pixels
Zeiler and Fergus, โVisualizing and Understanding Convolutional Networksโ, ECCV 2014
fixed image and tries to find which part of the image or which set of pixels influence the output of selected neuron
input in general would cause this neuron to activate
โ Fix the weights of trained network โ Synthesize image by performing gradient ascent โ Maximize the score of a given class or an intermediate neuron
Simonyan, Vedaldi, and Zisserman, โDeep Inside Convolutional Networks: Visualizing Image Classification Models and Saliency Mapsโ, ICLR Workshop 2014
โ Trained a deep 34-layer CNN which maps a sequence of ECG to a sequence of rhythm classes โ Optimization with residual blocks. โ Achieved cardiologist-level accuracy
One dimensional filters looking at local patterns
attendance)
matching)
proteins)
1998
NIPS 2012
2015
Simonyan, Vedaldi, and Zisserman, ICLR Workshop 2014