SLIDE 1 CNNs for Segmentation, Localization, and detection
Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from John Canny lectures, cs294-129, Berkeley, 2016. .
SLIDE 2 AlexNet
- ImageNet Classification with Deep Convolutional Neural Networks
[Krizhevsky, Sutskever, Hinton, 2012]
SLIDE 3
Image classification
SLIDE 4
Other Computer Vision Tasks
SLIDE 5 Classification and localization
What & where
SLIDE 6 Classification + Localization
Classification: C classes Input: Image Output: Class label Evaluation metric: Accuracy Localization: Input: Image Output: Box in the image (x, y, w, h) Evaluation metric: Intersection over Union Classification + Localization: Do both CAT (x, y, w, h)
SLIDE 7 Idea #1: Localization as Regression
Input: image Output: Box coordinates (4 numbers) Neural Net Correct output: box coordinates (4 numbers) Loss: L2 distance Only one object, simpler than detection
SLIDE 8 Simple Recipe for Classification + Localization
- Step 1: Train (or download) a classification model (e.g., VGG)
Image Convolution and Pooling Final conv feature map Fully-connected layers Class scores Softmax loss
SLIDE 9 Simple Recipe for Classification + Localization
- Step 1: Train (or download) a classification model (e.g., VGG)
- Step 2: Attach new fully-connected “regression head” to the network
Image Convolution and Pooling Final conv feature map
Fully-connected layers Class scores Fully-connected layers Box coordinates
“Classification head” “Regression head”
SLIDE 10 Simple Recipe for Classification + Localization
- Step 1: Train (or download) a classification model (e.g., VGG)
- Step 2: Attach new fully-connected “regression head” to the network
- Step 3: Train the regression head only with SGD and L2 loss
Image Convolution and Pooling Final conv feature map
Fully-connected layers Class scores
“Classification head”
Fully-connected layers Box coordinates
L2 loss
SLIDE 11 Simple Recipe for Classification + Localization
- Step 1: Train (or download) a classification model (e.g., VGG)
- Step 2: Attach new fully-connected “regression head” to the network
- Step 3: Train the regression head only with SGD and L2 loss
Image Convolution and Pooling Final conv feature map
Fully-connected layers Class scores
“Classification head”
Fully-connected layers Box coordinates
L2 loss
SLIDE 12 Simple Recipe for Classification + Localization
- Step 1: Train (or download) a classification model (e.g., VGG)
- Step 2: Attach new fully-connected “regression head” to the network
- Step 3: Train the regression head only with SGD and L2 loss
Image Convolution and Pooling Final conv feature map
Fully-connected layers Class scores Fully-connected layers Box coordinates
L2 loss
Softmax loss
SLIDE 13 Classification + Localization
Often pretrained on ImageNet (Transfer learning)
SLIDE 14
Aside: Human Pose Estimation
SLIDE 15
Aside: Human Pose Estimation
SLIDE 16 Where to attach the regression head?
Image Convolution and Pooling Final conv feature map Fully- connected layers Class scores Softmax loss After conv layers: Overfeat, VGG After last FC layer: DeepPose, R-CNN
SLIDE 17
Object detection
SLIDE 18
Object Detection: Impact of Deep Learning
SLIDE 19 Object Detection as Regression?
Each image needs a different number of outputs!
SLIDE 20
Object Detection as Classification: Sliding Window
SLIDE 21
Object Detection as Classification: Sliding Window
SLIDE 22 Object Detection as Classification: Sliding Window
Problem: Need to apply CNN to huge number of locations and scales, very computationally expensive!
SLIDE 23
Region Proposals
SLIDE 24
R-CNN
SLIDE 25
R-CNN
SLIDE 26
R-CNN
SLIDE 27
R-CNN
SLIDE 28 Image Convolution and Pooling Final conv feature map Fully-connected layers Class scores 1000 classes Softmax loss
R-CNN Training
- Step 1: Train (or download) a classification model for ImageNet
(AlexNet)
SLIDE 29 Image Convolution and Pooling Final conv feature map Fully-connected layers Class scores: 21 classes Softmax loss Re-initialize this layer: was 4096 x 1000, now will be 4096 x 21
R-CNN Training
- Step 2: Fine-tune model for detection
- Instead of 1000 ImageNet classes, want 20 object classes + background
- Throw away final fully-connected layer, reinitialize from scratch
- Keep training model using positive / negative regions from detection images
SLIDE 30 Image Convolution and Pooling pool5 features Region Proposals Crop + Warp Forward pass Save to disk
R-CNN Training
Step 3: Extract features
- Extract region proposals for all images
- For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk
- Have a big hard drive: features are ~200GB for PASCAL dataset!
SLIDE 31 Positive samples for cat SVM Negative samples for cat SVM Training image regions Cached region features
R-CNN Training
- Step 4: Train one binary SVM per class to classify region features
SLIDE 32 Training image regions Cached region features Regression targets (dx, dy, dw, dh) Normalized coordinates (0, 0, 0, 0) Proposal is good (.25, 0, 0, 0) Proposal too far to left (0, 0, -0.125, 0) Proposal too wide
R-CNN Training
- Step 5 (bbox regression): For each class, train a linear regression model to map
from cached features to offsets to GT boxes to make up for “slightly wrong” proposals
SLIDE 33 R-CNN: Problems
- Ad hoc training objectives
– Fine-tune network with softmax classifier (log loss) – Train post-hoc linear SVMs (hinge loss) – Train post-hoc bounding-box regressions (least squares)
- Training is slow (84h), takes a lot of disk space
- Inference (detection) is slow
– 47s / image with VGG16 [Simonyan & Zisserman. ICLR15] – Fixed by SPP-net [He et al. ECCV14]
SLIDE 34
Fast R-CNN
SLIDE 35 Fast R-CNN
Share computation of convolutional layers between proposals for an image
SLIDE 36 Fast R-CNN: Region of Interest Pooling
Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Problem: Fully-connected layers expect low-res conv features: C x h x w
SLIDE 37 Fast R-CNN: Region of Interest Pooling
Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Project region proposal onto conv feature map Problem: Fully-connected layers expect low-res conv features: C x h x w
SLIDE 38 Fast R-CNN: Region of Interest Pooling
Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Problem: Fully-connected layers expect low-res conv features: C x h x w Divide projected region into h x w grid
SLIDE 39 Fast R-CNN: Region of Interest Pooling
Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w
SLIDE 40 Fast R-CNN: Region of Interest Pooling
Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Can back propagate similar to max pooling RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w
SLIDE 41
Fast R-CNN: RoI Pooling
SLIDE 42 Fast R-CNN
Share computation of convolutional layers between proposals for an image
SLIDE 43 R-CNN vs SPP vs Fast R-CNN
Problem: Runtime dominated by region proposals!
SLIDE 44 Faster R-CNN
– Solely based on CNN – No external modules
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015
SLIDE 45 Faster R-CNN
Region Proposal Network (RPN) to predict proposals from features Jointly
– RPN classify object / not object – RPN regress box coordinates – Final classification score (object classes) – Final box coordinates
SLIDE 46
Faster R-CNN: Make CNN do proposals!
SLIDE 47 Object Detection
Source: http://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
SLIDE 48
SLIDE 49
Semantic Segmentation
SLIDE 50 Semantic Segmentation Idea: Sliding Window
Problem: Very inefficient! Not reusing shared features between
SLIDE 51
Semantic Segmentation Idea: Fully Convolutional
SLIDE 52
Semantic Segmentation Idea: Fully Convolutional
SLIDE 53
Semantic Segmentation Idea: Fully Convolutional
SLIDE 54
In-Network upsampling: “Unpooling”
SLIDE 55
In-Network upsampling: “Max Unpooling”
SLIDE 56
Learnable Upsampling: Transpose Convolution
SLIDE 57
Learnable Upsampling: Transpose Convolution
SLIDE 58 Learnable Upsampling: Transpose Convolution
Other names: Deconvolution (bad) Upconvolution Fractionally strided convolution Backward strided convolution
SLIDE 59
Transpose Convolution: 1D Example
SLIDE 60
Convolution as Matrix Multiplication (1D Example)
SLIDE 61
Convolution as Matrix Multiplication (1D Example)
SLIDE 62
Semantic Segmentation Idea: Fully Convolutional
SLIDE 63
Computer Vision Tasks