Convolutional Neural Networks Computer Vision Jia-Bin Huang, - - PowerPoint PPT Presentation

convolutional neural networks
SMART_READER_LITE
LIVE PREVIEW

Convolutional Neural Networks Computer Vision Jia-Bin Huang, - - PowerPoint PPT Presentation

Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Todays class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization: Training phase Training


slide-1
SLIDE 1

Convolutional Neural Networks

Computer Vision Jia-Bin Huang, Virginia Tech

slide-2
SLIDE 2

Today’s class

  • Overview
  • Convolutional Neural Network (CNN)
  • Training CNN
  • Understanding and Visualizing CNN
slide-3
SLIDE 3
slide-4
SLIDE 4

Image Categorization: Training phase

Training Labels Training Images Classifier Training

Training

Image Features Trained Classifier

slide-5
SLIDE 5

Image Categorization: Testing phase

Training Labels Training Images Classifier Training

Training

Image Features Trained Classifier Image Features

Testing

Test Image Outdoor Prediction Trained Classifier

slide-6
SLIDE 6

Features are the Keys

SIFT [Loewe IJCV 04] HOG [Dalal and Triggs CVPR 05] SPM [Lazebnik et al. CVPR 06] DPM [Felzenszwalb et al. PAMI 10] Color Descriptor [Van De Sande et al. PAMI 10]

slide-7
SLIDE 7
  • Each layer of hierarchy extracts features from output
  • f previous layer
  • All the way from pixels  classifier
  • Layers have the (nearly) same structure

Learning a Hierarchy of Feature Extractors

Layer 1 Layer 2 Layer 3 Simple Classifier Image/Video Pixels

Image/video Labels

slide-8
SLIDE 8

Biological neuron and Perceptrons

A biological neuron

An artificial neuron (Perceptron)

  • a linear classifier
slide-9
SLIDE 9

Simple, Complex and Hypercomplex cells

David H. Hubel and Torsten Wiesel David Hubel's Eye, Brain, and Vision

Suggested a hierarchy of feature detectors in the visual cortex, with higher level features responding to patterns of activation in lower level cells, and propagating activation upwards to still higher level cells.

slide-10
SLIDE 10

Hubel/Wiesel Architecture and Multi-layer Neural Network

Hubel and Weisel’s architecture

Multi-layer Neural Network

  • A non-linear classifier
slide-11
SLIDE 11

Multi-layer Neural Network

  • A non-linear classifier
  • Training: find network weights w to minimize the

error between true training labels 𝑧𝑗 and estimated labels 𝑔

𝒙 𝒚𝒋

  • Minimization can be done by gradient descent

provided 𝑔 is differentiable

  • This training method is called

back-propagation

slide-12
SLIDE 12

Convolutional Neural Networks

  • Also known as CNN, ConvNet, DCN
  • CNN = a multi-layer neural network with
  • 1. Local connectivity
  • 2. Weight sharing
slide-13
SLIDE 13

CNN: Local Connectivity

  • # input units (neurons): 7
  • # hidden units: 3
  • Number of parameters

– Global connectivity: 3 x 7 = 21 – Local connectivity: 3 x 3 = 9 Input layer Hidden layer

Global connectivity Local connectivity

slide-14
SLIDE 14

CNN: Weight Sharing

Input layer Hidden layer

  • # input units (neurons): 7
  • # hidden units: 3
  • Number of parameters

– Without weight sharing: 3 x 3 = 9 – With weight sharing : 3 x 1 = 3

w1 w2 w3 w4 w5 w6 w7 w8 w9

Without weight sharing With weight sharing

w1 w2 w3 w1 w2 w3 w1 w2 w3

slide-15
SLIDE 15

CNN with multiple input channels

Input layer Hidden layer

Single input channel Multiple input channels

Channel 2 Channel 1

Filter weights Filter weights

slide-16
SLIDE 16

CNN with multiple output maps

Input layer Hidden layer

Single output map Multiple output maps

Filter weights

Map 1 Map 2

Filter 1 Filter 2 Filter weights

slide-17
SLIDE 17

Putting them together

  • Local connectivity
  • Weight sharing
  • Handling multiple input channels
  • Handling multiple output maps

Image credit: A. Karpathy

# output (activation) maps # input channels Local connectivity Weight sharing

slide-18
SLIDE 18

Neocognitron [Fukushima, Biological Cybernetics 1980]

Deformation-Resistant Recognition

S-cells: (simple)

  • extract local features

C-cells: (complex)

  • allow for positional errors
slide-19
SLIDE 19

LeNet [LeCun et al. 1998]

Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 1998]

LeNet-1 from 1993

slide-20
SLIDE 20

What is a Convolution?

  • Weighted moving sum

Input Feature Activation Map . . .

slide credit: S. Lazebnik

slide-21
SLIDE 21

Input Image Convolution (Learned) Non-linearity Spatial pooling Normalization

Convolutional Neural Networks

Feature maps

slide credit: S. Lazebnik

slide-22
SLIDE 22

Input Image Convolution (Learned) Non-linearity Spatial pooling Normalization Feature maps

Input Feature Map . . .

Convolutional Neural Networks

slide credit: S. Lazebnik

slide-23
SLIDE 23

Input Image Convolution (Learned) Non-linearity Spatial pooling Normalization Feature maps

Convolutional Neural Networks

Rectified Linear Unit (ReLU)

slide credit: S. Lazebnik

slide-24
SLIDE 24

Input Image Convolution (Learned) Non-linearity Spatial pooling Normalization Feature maps

Max pooling

Convolutional Neural Networks

slide credit: S. Lazebnik

Max-pooling: a non-linear down-sampling Provide translation invariance

slide-25
SLIDE 25

Input Image Convolution (Learned) Non-linearity Spatial pooling Normalization Feature maps Feature Maps Feature Maps After Contrast Normalization

Convolutional Neural Networks

slide credit: S. Lazebnik

slide-26
SLIDE 26

Input Image Convolution (Learned) Non-linearity Spatial pooling Normalization Feature maps

Convolutional Neural Networks

slide credit: S. Lazebnik

slide-27
SLIDE 27

Engineered vs. learned features

Image Feature extraction Pooling Classifier

Label

Image Convolution/pool Convolution/pool Convolution/pool Convolution/pool Convolution/pool Dense Dense Dense

Label

Convolutional filters are trained in a supervised manner by back-propagating classification error

slide-28
SLIDE 28

Imagenet Classification with Deep Convolutional Neural Networks, Krizhevsky, Sutskever, and Hinton, NIPS 2012 Gradient-Based Learning Applied to Document Recognition, LeCun, Bottou, Bengio and Haffner, Proc. of the IEEE, 1998

Slide Credit: L. Zitnick

slide-29
SLIDE 29

Imagenet Classification with Deep Convolutional Neural Networks, Krizhevsky, Sutskever, and Hinton, NIPS 2012 Gradient-Based Learning Applied to Document Recognition, LeCun, Bottou, Bengio and Haffner, Proc. of the IEEE, 1998

* Rectified activations and dropout

Slide Credit: L. Zitnick

slide-30
SLIDE 30

SIFT Descriptor

Image Pixels Apply gradient filters Spatial pool (Sum) Normalize to unit length Feature Vector

Lowe [IJCV 2004]

slide-31
SLIDE 31

SIFT Descriptor

Image Pixels Apply

  • riented filters

Spatial pool (Sum) Normalize to unit length Feature Vector

Lowe [IJCV 2004]

slide credit: R. Fergus

slide-32
SLIDE 32

Spatial Pyramid Matching

SIFT Features Filter with Visual Words Multi-scale spatial pool (Sum) Max Classifier

Lazebnik, Schmid, Ponce [CVPR 2006]

slide credit: R. Fergus

slide-33
SLIDE 33

Deformable Part Model

Deformable Part Models are Convolutional Neural Networks [Girshick et al. CVPR 15]

slide-34
SLIDE 34

AlexNet

  • Similar framework to LeCun’98 but:
  • Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
  • More data (106 vs. 103 images)
  • GPU implementation (50x speedup over CPU)
  • Trained on two GPUs for a week
  • A. Krizhevsky, I. Sutskever, and G. Hinton,

ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012

slide-35
SLIDE 35

Using CNN for Image Classification

AlexNet

Fully connected layer Fc7 d = 4096

d = 4096

Averaging Softmax Layer

“Jia-Bin”

Fixed input size: 224x224x3

slide-36
SLIDE 36

Progress on ImageNet

2012 AlexNet 2013 ZF 2014 VGG 2014 GoogLeNet 2015 ResNet 2016 GoogLeNet-v4 15 10 5

ImageNet Image Classification Top5 Error

slide-37
SLIDE 37
slide-38
SLIDE 38

VGG-Net

  • The deeper, the better
  • Key design choices:

– 3x3 conv. Kernels

  • very small

– conv. stride 1

  • no loss of information
  • Other details:

– Rectification (ReLU) non-linearity – 5 max-pool layers (x2 reduction) – no normalization – 3 fully-connected (FC) layers

slide-39
SLIDE 39

VGG-Net

  • Why 3x3 layers?

– Stacked conv. layers have a large receptive field – two 3x3 layers – 5x5 receptive field – three 3x3 layers – 7x7 receptive field

  • More non-linearity

– Less parameters to learn – ~140M per net

slide-40
SLIDE 40

ResNet

  • Can we just increase the #layer?
  • How can we train very deep network?
  • Residual learning
slide-41
SLIDE 41

DenseNet

  • Shorter connections (like ResNet) help
  • Why not just connect them all?
slide-42
SLIDE 42

Training Convolutional Neural Networks

  • Backpropagation + stochastic gradient descent

with momentum

– Neural Networks: Tricks of the Trade

  • Dropout
  • Data augmentation
  • Batch normalization
  • Initialization

– Transfer learning

slide-43
SLIDE 43

Training CNN with gradient descent

  • A CNN as composition of functions

𝑔

𝒙 𝒚 = 𝑔 𝑀(… (𝑔 2 𝑔 1 𝒚; 𝒙1 ; 𝒙2 … ; 𝒙𝑀)

  • Parameters

𝒙 = (𝒙𝟐, 𝒙𝟑, … 𝒙𝑴)

  • Empirical loss function

𝑀 𝒙 = 1 𝑜 ෍

𝑗

𝑚(𝑨𝑗, 𝑔

𝒙(𝒚𝒋))

  • Gradient descent

𝒙𝒖+𝟐 = 𝒙𝒖 − 𝜃𝑢 𝜖𝒈 𝜖𝒙 (𝒙𝒖)

Learning rate Gradient Old weight New weight

slide-44
SLIDE 44

An Illustrative example

𝑔 𝑦, 𝑧 = 𝑦𝑧, 𝜖𝑔 𝜖𝑦 = 𝑧, 𝜖𝑔 𝜖𝑧 = 𝑦 Example: 𝑦 = 4, 𝑧 = −3 ⇒ 𝑔 𝑦, 𝑧 = −12

Example credit: Andrej Karpathy

Partial derivatives

𝜖𝑔 𝜖𝑦 = −3, 𝜖𝑔 𝜖𝑧 = 4

Gradient

𝛼𝑔 = [𝜖𝑔 𝜖𝑦 , 𝜖𝑔 𝜖𝑧]

slide-45
SLIDE 45

𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 𝑨 = 𝑟𝑨

Example credit: Andrej Karpathy

𝑟 = 𝑦 + 𝑧 𝜖𝑟 𝜖𝑦 = 1, 𝜖𝑟 𝜖𝑧 = 1 𝑔 = 𝑟𝑨 𝜖𝑔 𝜖𝑟 = 𝑨, 𝜖𝑔 𝜖𝑨 = 𝑟 Goal: compute the gradient 𝛼𝑔 = [𝜖𝑔 𝜖𝑦 , 𝜖𝑔 𝜖𝑧 , 𝜖𝑔 𝜖𝑨]

slide-46
SLIDE 46

𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 𝑨 = 𝑟𝑨

Example credit: Andrej Karpathy

𝑔 = 𝑟𝑨 𝜖𝑔 𝜖𝑟 = 𝑨, 𝜖𝑔 𝜖𝑨 = 𝑟

Chain rule: 𝜖𝑔 𝜖𝑦 = 𝜖𝑔 𝜖𝑟 𝜖𝑟 𝜖𝑦

𝑟 = 𝑦 + 𝑧 𝜖𝑟 𝜖𝑦 = 1, 𝜖𝑟 𝜖𝑧 = 1

slide-47
SLIDE 47

Backpropagation (recursive chain rule)

𝑟 𝑥1 𝑥2 𝑥𝑜

𝜖𝑔 𝜖𝑟

𝜖𝑔 𝜖𝑥𝑗 = 𝜖𝑟 𝜖𝑥𝑗 𝜖𝑔 𝜖𝑟

Gate gradient Local gradient

The gate receives this during backprop Can be computed during forward pass

slide-48
SLIDE 48

Dropout

Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]

Intuition: successful conspiracies

  • 50 people planning a conspiracy
  • Strategy A: plan a big conspiracy involving 50 people
  • Likely to fail. 50 people need to play their parts correctly.
  • Strategy B: plan 10 conspiracies each involving 5 people
  • Likely to succeed!
slide-49
SLIDE 49

Dropout

Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]

Main Idea: approximately combining exponentially many different neural network architectures efficiently

slide-50
SLIDE 50

Data Augmentation (Jittering)

  • Create virtual training samples

– Horizontal flip – Random crop – Color casting – Geometric distortion

Deep Image [Wu et al. 2015]

slide-51
SLIDE 51

Parametric Rectified Linear Unit

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [He et al. 2015]

slide-52
SLIDE 52

Swish

The Swish activation function First derivatives of Swish

https://arxiv.org/abs/1710.05941

slide-53
SLIDE 53

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [Ioffe and Szegedy 2015]

slide-54
SLIDE 54

Understanding and Visualizing CNN

  • Find images that maximize some class scores
  • Individual neuron activation
  • Breaking CNNs
slide-55
SLIDE 55

Individual Neuron Activation

RCNN [Girshick et al. CVPR 2014]

slide-56
SLIDE 56

Individual Neuron Activation

RCNN [Girshick et al. CVPR 2014]

slide-57
SLIDE 57

Individual Neuron Activation

RCNN [Girshick et al. CVPR 2014]

slide-58
SLIDE 58

Map activation back to the input pixel space

  • What input pattern originally caused a given

activation in the feature maps?

Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

slide-59
SLIDE 59

Layer 1

Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

slide-60
SLIDE 60

Layer 2

Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

slide-61
SLIDE 61

Layer 3

Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

slide-62
SLIDE 62

Layer 4 and 5

Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

slide-63
SLIDE 63

Network Dissection

http://netdissect.csail.mit.edu/

slide-64
SLIDE 64

Deep learning library

  • TensorFlow

– Research + Production

  • PyTorch

– Research

  • Caffe2

– Production

slide-65
SLIDE 65

Things to remember

  • Convolutional neural networks

– A cascade of conv + ReLU + pool – Representation learning – Advanced architectures – Tricks for training CNN

  • Visualizing CNN

– Activation – Dissection

slide-66
SLIDE 66

Resources

  • http://deeplearning.net/

– Hub to many other deep learning resources

  • https://github.com/ChristosChristofidis/awesome-deep-

learning

– A resource collection deep learning

  • https://github.com/kjw0612/awesome-deep-vision

– A resource collection deep learning for computer vision

  • http://cs231n.stanford.edu/syllabus.html

– Nice course on CNN for visual recognition

slide-67
SLIDE 67

Things to remember

  • Overview

– Neuroscience, Perceptron, multi-layer neural networks

  • Convolutional neural network (CNN)

– Convolution, nonlinearity, max pooling – CNN for classification and beyond

  • Understanding and visualizing CNN

– Find images that maximize some class scores; visualize individual neuron activation, input pattern and images; breaking CNNs

  • Training CNN

– Dropout; data augmentation; batch normalization; transfer learning

slide-68
SLIDE 68