Convolutional Neural Networks Basics Praveen Krishnan Overview - - PowerPoint PPT Presentation
Convolutional Neural Networks Basics Praveen Krishnan Overview - - PowerPoint PPT Presentation
Convolutional Neural Networks Basics Praveen Krishnan Overview Paradigm Shift Simple Network Convolutional Network Layers Case Study1: Alex Net Training Generalization Visualizations Transfer Learning Case
Overview
Paradigm Shift Simple Network Convolutional Network
Layers Case Study1: Alex Net
Training Generalization Visualizations Transfer Learning
Case Study2: JAZ Net
Practical Aspects
Gradient checks. Data GPU Coding/Libraries
Paradigm Shift
Feature Extraction (SIFT, HoG,.) Classifier Feature Learning (CNN, RBM, …) Classifier L1 Sparrow Layers - (Hierarchical decomposition) L2 L4 L3 Coding Sparrow Part Models Pooling
A simple network
f1 f2 fn-1 fn x1 xn-2 xn-1 x0 xn w1 w2 wn-1 wn Here each output xj depends on previous input xj-1 through a function fj with parameters wj
Feed forward neural network
x00 xn1 W1 x01 x0d xn2 xnc Zooming In Wn
Feed forward neural network
x00 xn1 x01 x0d xn2 xnc
LOSS
y = [0,0,…,1,…0] z W1 Wn
Feed forward neural network
xn1 xn2 xnc
LOSS
z Weight updates using back propagation of gradients W1 Wn
Convolutional Network
Fully connected layer Locally connected layer
- #Hidden Units: 1200,00
- #Params: 12 billion
- Need huge training data to prevent
- ver-fitting!
- #Hidden Units: 1200,00
- #Params: 1.08 Million
- Useful when the image is highly registered
200x200x3 3x3x3
Convolutional Network
Convolutional layer
- #Hidden Units: 1200,00
- #Params: 27
- #feature map: 1
- Exploiting the stationarity property.
3x3x3
Convolutional Network
- Use of multiple feature maps.
- Sharing parameters
- Exploits stationarity of statistics.
- Preserves locality of pixel dependencies.
3 200 # feature maps
3 3 3
Receptive field Convolutional layer
Convolutional Network
200x200x3 Image size: W1xH1xD1 Receptive field size: FxF #Feature maps: K
- Q. Find out W2,H2 and D2 ?
Convolutional Network
200x200x3 Image size: W1xH1xD1 Receptive field size: FxF #Feature maps: K W2=(W1-F)/S+1 H2=(H1-F)/S+1 D2=K It is also better to do zero padding to preserve input size spatially.
Convolutional Layer
Conv. Layer x1
n-1
x2
n-1
x3
n-1
y1
n
y2
n
Here “f” is a non-linear activation function. F= no. of input feature maps n= layer index “*” represents convolution/correlation?
- Q. Is there a difference between correlation and convolution in learned
network?
Activation Functions
Sigmoid tanh ReLU Leaky ReLU maxout
A Typical Supervised CNN Architecture
A typical deep convolutional network Other layers
Pooling Normalization Fully connected etc. CONV POOL NORM CONV POOL NORM FC SOFTMAX
Pooling Layer
Aggregation over space or feature type. Invariance to image transformation and increases compactness to
representation.
Pooling types: Max, Average, L2 etc.
2 8 9 4 3 6 5 7 3 1 6 4 2 5 7 3 8 9 5 7 Pool Size: 2x2 Stride: 2 Type: Max Max pooling
CONV POOL NORM CONV POOL NORM FC SOFTMAX
Normalization
Local contrast normalization (Jarrett et.al ICCV‟09)
reduce illumination artifacts. performs local subtractive and divisive normalization.
Local response normalization (Krizhevesky et.al. NIPS‟12)
form of lateral inhibition across channels.
Batch normalization (More later)
CONV POOL NORM CONV POOL NORM FC SOFTMAX
Fully connected
Multi layer perceptron Role of an classifier** Generally used in final layers to classify the object
represented in terms of discriminative parts and higher semantic entities.
CONV POOL NORM CONV POOL NORM FC SOFTMAX
Case Study: AlexNet
Winner of ImageNet LSVRC-2012. Trained over 1.2M images using SGD with regularization. Deep architecture (60M parameters.) Optimized GPU implementation (cuda-convnet)
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012. - Cited by 11915
Case Study: AlexNet
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012.
CONV 11x11x96 LRN MAX POOL 2x2 CONV 5x5x256 LRN MAX POOL 2x2 CONV 3x3x384 MAX POOL 2x2 FC - 4096 FC - 4096
SOFTMAX - 1000
CONV 3x3x384 MAX POOL 2x2 CONV 3x3x256 MAX POOL 2x2
Training
Learning: Minimizing the loss function (incl.
regularization) w.r.t. parameters of the network.
Mini batch stochastic gradient descent
Sample a batch of data. Forward propagation Backward propagation Parameter update Filter weights
CONV POOL NORM CONV POOL NORM FC LOSS
xn yn
Training
Back propagation
Consider an layer f with parameters w:
Here z is scalar which is the loss computed from loss function h. The derivative of loss function w.r.t to parameters is given as:
CONV POOL NORM CONV POOL NORM FC LOSS
xn yn Recursive eq. which is applicable to each layer
Training
Parameter update
Stochastic gradient descent
Here η is the learning rate and θ is the set of all parameters
Stochastic gradient descent with momentum
CONV POOL NORM CONV POOL NORM FC LOSS
xn yn More in coming slides…
Training
Loss functions.
Measures the compatibility between prediction and ground truth.
one vs. rest classification
Soft-max classifier (cross entropy loss)
Derivative w.r.t. xi
CONV POOL NORM CONV POOL NORM FC LOSS
xn yn Proof?
Training
Loss functions.
one vs. rest classification
Hinge Loss
Hinge loss is a convex function but not differentiable but sub-gradient exists. Sub-gradient w.r.t. xi
CONV POOL NORM CONV POOL NORM FC LOSS
xn yn
Training
Loss functions.
Regression
Euclidean loss / Squared loss
Derivative w.r.t. xi
CONV POOL NORM CONV POOL NORM FC LOSS
xn yn
Training
Visualization of loss function
Momentum Step size/learning rate Step direction Initialization Loss θ Typically viewed as highly non-convex function but more recently it‟s believed to have smoother surfaces but with many saddle regions !
Training
Momentum
Better convergence rates. Physical perspective: Affects velocity of the update. Higher velocity in the consistent direction of gradient. Momentum update: Position Velocity
Loss θ
Training
Learning Rates (η)
Controls the kinetic energy
- f the updates.
Important to know when
the decay the η.
Common methods
(Annealing):-
Step decay Exponential/log space decay Manual
Adaptive learning methods
Adagrad RMSprop
Figure courtesy: Fei Fei et al. , cs231n
Loss θ
Training
Loss θ
Initialization
Never initialize weights to all zero‟s or same value. (Why?) Popular techniques:-
Random values sampled from N(0,1) Xavier (Glorot et.al JMLR‟10)
Scale of initialization is dependent on the number of input and output
neurons.
Initial weights are sampled from N(0,var(w))
Pre-training
Using RBMs. (Hinton et.al, Science 2006)
Fan-in Fan-out
Training
Generalization
How to prevent?
Underfitting – Deeper n/w‟s Overfitting
Stopping at the right time. Weight penalties.
L1 L2 Max norm
Dropout
Model ensembles
E.g. Same model, different
initializations.
epoch top5- error training accuracy val-1 accuracy (*) val-2 accuracy (overfitting)
Generalization
Dropouts
Stochastic regularization. Idea applicable to many other
networks.
Dropping out hidden units randomly
with fixed probability „p‟ (say 0.5) temporarily while training.
While testing the all units are
preserved but scaled with „p‟.
Dropouts along with max norm
constraint is found to be useful.
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014 Before dropout After dropout
Generalization
Without dropout With dropout
Sparsity Features learned with one hidden layers auto-encoder
- n MNIST dataset.
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014
Generalization
Batch Normalization
Covariate Shift
defined as a change in the distribution of a function‟s domain Mini-batches (randomized) reduces the effect of covariate shift
Internal Covariate Shift
Current layer parameters change the distribution of the input to
successive layers.
Slows down training and careful initialization.
Image Credit: https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b go water the plants got water in your pants kite bang eat face monkey
Generalization
Batch Normalization
Fixes the distribution of layer input as training progresses. Faster convergence. Ioffe, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arxiv'15
Some results on ImageNet
Source: Krizhevsky et.al. NIPS‟12 T
- p-5 classification accuracy
GoogLeNet Clarifai AlexNet
Visualizing CNNs
CNNs are cool but some of the
below questions need answers before we move forward :-
How do I interpret the learned filters? What is it that stimulates/excites a
neuron?
How do I decide the architecture or
improve existing ones?
To answer we need to probe the
learned a models:-
Deconvolutional Networks. [Zeiler
et.al. ICCV‟11, ECCV‟14]
Synthesize images [Simonyan et.al
ICLR‟14, Mahendran et.al CVPR‟15]
Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014 Visualizing the first conv. layer is possible but how about the later layers. Source: Krizhevsky et.al. NIPS‟12
Visualizing CNNs
Deconvnets
Non-parametric approach. Projects the feature activation
back to input space.
Analyses a trained model and
use validation data to interpret the feature activation.
Visualizes a single activation
and not the joint activity.
Helps in understanding the
generalizing ability of CNNs.
Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014 Source: Zeiler e.t. al. ECCV‟14
Visualizing CNNs
Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014 Grass ! Source: Zeiler e.t. al. ECCV‟14
- A. How do I interpret the learned filters?
Visualizing CNNs
Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014
- A. What is it that
stimulates/excites a neuron?
Visualizing CNNs
Class Model Visualization Image-Specific Class Saliency Visualization
Washing Machine Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. CoRR 2014
Visualizing CNNs
Class Model Visualization
Find an L2 normalized image which maximizes the CI class
score Here Sc(I) is the score of class „c‟ before soft max.
Initialize with mean image. Back-propagate to update the input pixels, keeping the weights
- f intermediate layer fixed.
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. CoRR 2014 Some more results
Visualizing CNNs
Image-Specific Class Saliency
Visualization
Understanding the spatial support of a class in a specific image.
Nonlinear mapping but approximated using first order Taylor expansion
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. CoRR 2014
- Orig. Image
Spatial Support Object localization mask Grab Cut
Transfer Learning
A key observation that we noticed in visualization:-
Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep neural networks? NIPS ‟14
CONV POOL NORM CONV POOL NORM FC
xn
SOFTMAX CONV POOL NORM
Gabor/Color blobs Dog Face General Specific
Transfer Learning
A key observation that we noticed in visualization:- Further ques?
Can we quantify the layer generality/specificity? Where does the transition occur? Is the transition sudden or spread over layers? Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep neural networks? NIPS ‟14
CONV POOL NORM CONV POOL NORM FC
xn
SOFTMAX CONV POOL NORM
General Specific
Transfer Learning
Transfer performance
experiment
Task A and B Types of networks
Selffer (BnB/ BnB+) Transfer (AnB+)
Datasets
Random split Dissimilar split
Observations
Higher level neurons are
more specialized.
There exists co-adapted
neurons between layers which makes optimization difficult.
Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep neural networks? NIPS ‟14 baseB BnB BnB+ AnB AnB+
Transfer Learning
Take away message
Initializing a network with transferred features almost always
gives better generalization
CONV POOL NORM CONV POOL NORM FC
xn
SOFTMAX CONV POOL NORM
Notes If dataset is small retrain the softmax
CONV POOL NORM CONV POOL NORM FC
xn
SOFTMAX CONV POOL NORM
If dataset is reasonable retrain larger portion with fine tuning of initial layers
Transfer Learning
Benchmarks
Razavian et. al. CVPRW‟2014 Chatfield et. al. BMVC‟2014
Case Study: JAZNet
Scene text recognition
Recognition of word image
holistically.
Synthetic generation of training
data.
State of art results in most of
document datasets.
Model 1: Encoding words model Synthetic data engine
Challenges: Huge no of target classes ~90K (Size of english
vocabulary)
Incremental training. Uses 9M training data
Jaderberg, Max, et al. "Synthetic data and artificial neural networks for natural scene text recognition." arXivv 2014
Case Study: JAZNet
Model 2: Sequence
- f Chars
Model 3: Bag of n-grams Jaderberg, Max, et al. "Synthetic data and artificial neural networks for natural scene text recognition." arXivv 2014
Practical Aspects
Data Preprocessing
(0,1) Normalization Whitening
Data augmentation
Perturbation to image to make it more resilient to variations
Cropping Flipping Jittering Degradation models specific to modalities. (e.g. text )
Applicable at both train/test time
Practical Aspects
Modern CNN libraries
Theano T
- rch
Caffe Matconvnet T
ensor flow
and many more…