Convolutional Neural Networks Basics Praveen Krishnan Overview - - PowerPoint PPT Presentation

convolutional neural networks basics
SMART_READER_LITE
LIVE PREVIEW

Convolutional Neural Networks Basics Praveen Krishnan Overview - - PowerPoint PPT Presentation

Convolutional Neural Networks Basics Praveen Krishnan Overview Paradigm Shift Simple Network Convolutional Network Layers Case Study1: Alex Net Training Generalization Visualizations Transfer Learning Case


slide-1
SLIDE 1

Convolutional Neural Networks Basics

Praveen Krishnan

slide-2
SLIDE 2

Overview

 Paradigm Shift  Simple Network  Convolutional Network

 Layers  Case Study1: Alex Net

 Training  Generalization  Visualizations  Transfer Learning

 Case Study2: JAZ Net

 Practical Aspects

 Gradient checks.  Data  GPU Coding/Libraries

slide-3
SLIDE 3

Paradigm Shift

Feature Extraction (SIFT, HoG,.) Classifier Feature Learning (CNN, RBM, …) Classifier L1 Sparrow Layers - (Hierarchical decomposition) L2 L4 L3 Coding Sparrow Part Models Pooling

slide-4
SLIDE 4

A simple network

f1 f2 fn-1 fn x1 xn-2 xn-1 x0 xn w1 w2 wn-1 wn Here each output xj depends on previous input xj-1 through a function fj with parameters wj

slide-5
SLIDE 5

Feed forward neural network

x00 xn1 W1 x01 x0d xn2 xnc Zooming In Wn

slide-6
SLIDE 6

Feed forward neural network

x00 xn1 x01 x0d xn2 xnc

LOSS

y = [0,0,…,1,…0] z W1 Wn

slide-7
SLIDE 7

Feed forward neural network

xn1 xn2 xnc

LOSS

z Weight updates using back propagation of gradients W1 Wn

slide-8
SLIDE 8

Convolutional Network

Fully connected layer Locally connected layer

  • #Hidden Units: 1200,00
  • #Params: 12 billion
  • Need huge training data to prevent
  • ver-fitting!
  • #Hidden Units: 1200,00
  • #Params: 1.08 Million
  • Useful when the image is highly registered

200x200x3 3x3x3

slide-9
SLIDE 9

Convolutional Network

Convolutional layer

  • #Hidden Units: 1200,00
  • #Params: 27
  • #feature map: 1
  • Exploiting the stationarity property.

3x3x3

slide-10
SLIDE 10

Convolutional Network

  • Use of multiple feature maps.
  • Sharing parameters
  • Exploits stationarity of statistics.
  • Preserves locality of pixel dependencies.

3 200 # feature maps

3 3 3

Receptive field Convolutional layer

slide-11
SLIDE 11

Convolutional Network

200x200x3 Image size: W1xH1xD1 Receptive field size: FxF #Feature maps: K

  • Q. Find out W2,H2 and D2 ?
slide-12
SLIDE 12

Convolutional Network

200x200x3 Image size: W1xH1xD1 Receptive field size: FxF #Feature maps: K W2=(W1-F)/S+1 H2=(H1-F)/S+1 D2=K It is also better to do zero padding to preserve input size spatially.

slide-13
SLIDE 13

Convolutional Layer

Conv. Layer x1

n-1

x2

n-1

x3

n-1

y1

n

y2

n

Here “f” is a non-linear activation function. F= no. of input feature maps n= layer index “*” represents convolution/correlation?

  • Q. Is there a difference between correlation and convolution in learned

network?

slide-14
SLIDE 14

Activation Functions

Sigmoid tanh ReLU Leaky ReLU maxout

slide-15
SLIDE 15

A Typical Supervised CNN Architecture

 A typical deep convolutional network  Other layers

 Pooling  Normalization  Fully connected  etc. CONV POOL NORM CONV POOL NORM FC SOFTMAX

slide-16
SLIDE 16

Pooling Layer

 Aggregation over space or feature type.  Invariance to image transformation and increases compactness to

representation.

 Pooling types: Max, Average, L2 etc.

2 8 9 4 3 6 5 7 3 1 6 4 2 5 7 3 8 9 5 7 Pool Size: 2x2 Stride: 2 Type: Max Max pooling

CONV POOL NORM CONV POOL NORM FC SOFTMAX

slide-17
SLIDE 17

Normalization

 Local contrast normalization (Jarrett et.al ICCV‟09)

 reduce illumination artifacts.  performs local subtractive and divisive normalization.

 Local response normalization (Krizhevesky et.al. NIPS‟12)

 form of lateral inhibition across channels.

 Batch normalization (More later)

CONV POOL NORM CONV POOL NORM FC SOFTMAX

slide-18
SLIDE 18

Fully connected

 Multi layer perceptron  Role of an classifier**  Generally used in final layers to classify the object

represented in terms of discriminative parts and higher semantic entities.

CONV POOL NORM CONV POOL NORM FC SOFTMAX

slide-19
SLIDE 19

Case Study: AlexNet

 Winner of ImageNet LSVRC-2012.  Trained over 1.2M images using SGD with regularization.  Deep architecture (60M parameters.)  Optimized GPU implementation (cuda-convnet)

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012. - Cited by 11915

slide-20
SLIDE 20

Case Study: AlexNet

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012.

CONV 11x11x96 LRN MAX POOL 2x2 CONV 5x5x256 LRN MAX POOL 2x2 CONV 3x3x384 MAX POOL 2x2 FC - 4096 FC - 4096

SOFTMAX - 1000

CONV 3x3x384 MAX POOL 2x2 CONV 3x3x256 MAX POOL 2x2

slide-21
SLIDE 21

Training

 Learning: Minimizing the loss function (incl.

regularization) w.r.t. parameters of the network.

 Mini batch stochastic gradient descent

 Sample a batch of data.  Forward propagation  Backward propagation  Parameter update Filter weights

CONV POOL NORM CONV POOL NORM FC LOSS

xn yn

slide-22
SLIDE 22

Training

 Back propagation

 Consider an layer f with parameters w:

Here z is scalar which is the loss computed from loss function h. The derivative of loss function w.r.t to parameters is given as:

CONV POOL NORM CONV POOL NORM FC LOSS

xn yn Recursive eq. which is applicable to each layer

slide-23
SLIDE 23

Training

 Parameter update

 Stochastic gradient descent

Here η is the learning rate and θ is the set of all parameters

 Stochastic gradient descent with momentum

CONV POOL NORM CONV POOL NORM FC LOSS

xn yn More in coming slides…

slide-24
SLIDE 24

Training

 Loss functions.

Measures the compatibility between prediction and ground truth.

 one vs. rest classification

 Soft-max classifier (cross entropy loss)

Derivative w.r.t. xi

CONV POOL NORM CONV POOL NORM FC LOSS

xn yn Proof?

slide-25
SLIDE 25

Training

 Loss functions.

 one vs. rest classification

 Hinge Loss

Hinge loss is a convex function but not differentiable but sub-gradient exists. Sub-gradient w.r.t. xi

CONV POOL NORM CONV POOL NORM FC LOSS

xn yn

slide-26
SLIDE 26

Training

 Loss functions.

 Regression

 Euclidean loss / Squared loss

Derivative w.r.t. xi

CONV POOL NORM CONV POOL NORM FC LOSS

xn yn

slide-27
SLIDE 27

Training

 Visualization of loss function

Momentum Step size/learning rate Step direction Initialization Loss θ Typically viewed as highly non-convex function but more recently it‟s believed to have smoother surfaces but with many saddle regions !

slide-28
SLIDE 28

Training

 Momentum

 Better convergence rates.  Physical perspective: Affects velocity of the update.  Higher velocity in the consistent direction of gradient.  Momentum update: Position Velocity

Loss θ

slide-29
SLIDE 29

Training

 Learning Rates (η)

 Controls the kinetic energy

  • f the updates.

 Important to know when

the decay the η.

 Common methods

(Annealing):-

 Step decay  Exponential/log space decay  Manual

 Adaptive learning methods

 Adagrad  RMSprop

Figure courtesy: Fei Fei et al. , cs231n

Loss θ

slide-30
SLIDE 30

Training

Loss θ

 Initialization

 Never initialize weights to all zero‟s or same value. (Why?)  Popular techniques:-

 Random values sampled from N(0,1)  Xavier (Glorot et.al JMLR‟10)

 Scale of initialization is dependent on the number of input and output

neurons.

 Initial weights are sampled from N(0,var(w))

 Pre-training

 Using RBMs. (Hinton et.al, Science 2006)

Fan-in Fan-out

slide-31
SLIDE 31

Training

 Generalization

 How to prevent?

 Underfitting – Deeper n/w‟s  Overfitting

 Stopping at the right time.  Weight penalties.

 L1  L2  Max norm

 Dropout

 Model ensembles

 E.g. Same model, different

initializations.

epoch top5- error training accuracy val-1 accuracy (*) val-2 accuracy (overfitting)

slide-32
SLIDE 32

Generalization

 Dropouts

 Stochastic regularization.  Idea applicable to many other

networks.

 Dropping out hidden units randomly

with fixed probability „p‟ (say 0.5) temporarily while training.

 While testing the all units are

preserved but scaled with „p‟.

 Dropouts along with max norm

constraint is found to be useful.

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014 Before dropout After dropout

slide-33
SLIDE 33

Generalization

 Without dropout  With dropout

Sparsity Features learned with one hidden layers auto-encoder

  • n MNIST dataset.

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014

slide-34
SLIDE 34

Generalization

 Batch Normalization

 Covariate Shift

 defined as a change in the distribution of a function‟s domain  Mini-batches (randomized) reduces the effect of covariate shift

 Internal Covariate Shift

 Current layer parameters change the distribution of the input to

successive layers.

 Slows down training and careful initialization.

Image Credit: https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b go water the plants got water in your pants kite bang eat face monkey

slide-35
SLIDE 35

Generalization

 Batch Normalization

 Fixes the distribution of layer input as training progresses.  Faster convergence. Ioffe, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arxiv'15

slide-36
SLIDE 36

Some results on ImageNet

Source: Krizhevsky et.al. NIPS‟12 T

  • p-5 classification accuracy

GoogLeNet Clarifai AlexNet

slide-37
SLIDE 37

Visualizing CNNs

 CNNs are cool  but some of the

below questions need answers before we move forward :-

 How do I interpret the learned filters?  What is it that stimulates/excites a

neuron?

 How do I decide the architecture or

improve existing ones?

 To answer we need to probe the

learned a models:-

 Deconvolutional Networks. [Zeiler

et.al. ICCV‟11, ECCV‟14]

 Synthesize images [Simonyan et.al

ICLR‟14, Mahendran et.al CVPR‟15]

Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014 Visualizing the first conv. layer is possible but how about the later layers. Source: Krizhevsky et.al. NIPS‟12

slide-38
SLIDE 38

Visualizing CNNs

 Deconvnets

 Non-parametric approach.  Projects the feature activation

back to input space.

 Analyses a trained model and

use validation data to interpret the feature activation.

 Visualizes a single activation

and not the joint activity.

 Helps in understanding the

generalizing ability of CNNs.

Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014 Source: Zeiler e.t. al. ECCV‟14

slide-39
SLIDE 39

Visualizing CNNs

Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014 Grass ! Source: Zeiler e.t. al. ECCV‟14

  • A. How do I interpret the learned filters?
slide-40
SLIDE 40

Visualizing CNNs

Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014

  • A. What is it that

stimulates/excites a neuron?

slide-41
SLIDE 41

Visualizing CNNs

Class Model Visualization Image-Specific Class Saliency Visualization

Washing Machine Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. CoRR 2014

slide-42
SLIDE 42

Visualizing CNNs

 Class Model Visualization

 Find an L2 normalized image which maximizes the CI class

score Here Sc(I) is the score of class „c‟ before soft max.

 Initialize with mean image.  Back-propagate to update the input pixels, keeping the weights

  • f intermediate layer fixed.

Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. CoRR 2014 Some more results

slide-43
SLIDE 43

Visualizing CNNs

 Image-Specific Class Saliency

Visualization

 Understanding the spatial support of a class in a specific image.

Nonlinear mapping but approximated using first order Taylor expansion

Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. CoRR 2014

  • Orig. Image

Spatial Support Object localization mask Grab Cut

slide-44
SLIDE 44

Transfer Learning

 A key observation that we noticed in visualization:-

Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep neural networks? NIPS ‟14

CONV POOL NORM CONV POOL NORM FC

xn

SOFTMAX CONV POOL NORM

Gabor/Color blobs Dog Face General Specific

slide-45
SLIDE 45

Transfer Learning

 A key observation that we noticed in visualization:-  Further ques?

 Can we quantify the layer generality/specificity?  Where does the transition occur?  Is the transition sudden or spread over layers? Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep neural networks? NIPS ‟14

CONV POOL NORM CONV POOL NORM FC

xn

SOFTMAX CONV POOL NORM

General Specific

slide-46
SLIDE 46

Transfer Learning

 Transfer performance

experiment

 Task A and B  Types of networks

 Selffer (BnB/ BnB+)  Transfer (AnB+)

 Datasets

 Random split  Dissimilar split

 Observations

 Higher level neurons are

more specialized.

 There exists co-adapted

neurons between layers which makes optimization difficult.

Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep neural networks? NIPS ‟14 baseB BnB BnB+ AnB AnB+

slide-47
SLIDE 47

Transfer Learning

 Take away message

 Initializing a network with transferred features almost always

gives better generalization

CONV POOL NORM CONV POOL NORM FC

xn

SOFTMAX CONV POOL NORM

Notes If dataset is small retrain the softmax

CONV POOL NORM CONV POOL NORM FC

xn

SOFTMAX CONV POOL NORM

If dataset is reasonable retrain larger portion with fine tuning of initial layers

slide-48
SLIDE 48

Transfer Learning

 Benchmarks

Razavian et. al. CVPRW‟2014 Chatfield et. al. BMVC‟2014

slide-49
SLIDE 49

Case Study: JAZNet

 Scene text recognition

 Recognition of word image

holistically.

 Synthetic generation of training

data.

 State of art results in most of

document datasets.

Model 1: Encoding words model Synthetic data engine

 Challenges: Huge no of target classes ~90K (Size of english

vocabulary)

 Incremental training.  Uses 9M training data

Jaderberg, Max, et al. "Synthetic data and artificial neural networks for natural scene text recognition." arXivv 2014

slide-50
SLIDE 50

Case Study: JAZNet

Model 2: Sequence

  • f Chars

Model 3: Bag of n-grams Jaderberg, Max, et al. "Synthetic data and artificial neural networks for natural scene text recognition." arXivv 2014

slide-51
SLIDE 51

Practical Aspects

 Data Preprocessing

 (0,1) Normalization  Whitening

 Data augmentation

 Perturbation to image to make it more resilient to variations

 Cropping  Flipping  Jittering  Degradation models specific to modalities. (e.g. text )

 Applicable at both train/test time

slide-52
SLIDE 52

Practical Aspects

 Modern CNN libraries

 Theano  T

  • rch

 Caffe  Matconvnet  T

ensor flow

 and many more…

Pick your favorite and start building network !!!

slide-53
SLIDE 53

Thank You