CNNs for Segmentation, Localization, and detection M. Soleymani - - PowerPoint PPT Presentation

cnns for
SMART_READER_LITE
LIVE PREVIEW

CNNs for Segmentation, Localization, and detection M. Soleymani - - PowerPoint PPT Presentation

CNNs for Segmentation, Localization, and detection M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from John Canny lectures,


slide-1
SLIDE 1

CNNs for Segmentation, Localization, and detection

  • M. Soleymani

Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from John Canny lectures, cs294-129, Berkeley, 2016. .

slide-2
SLIDE 2

AlexNet

  • ImageNet Classification with Deep Convolutional Neural Networks

[Krizhevsky, Sutskever, Hinton, 2012]

slide-3
SLIDE 3

Image classification

slide-4
SLIDE 4

Other Computer Vision Tasks

slide-5
SLIDE 5

Classification and localization

What & where

slide-6
SLIDE 6

Classification + Localization

Classification: C classes Input: Image Output: Class label Evaluation metric: Accuracy Localization: Input: Image Output: Box in the image (x, y, w, h) Evaluation metric: Intersection over Union Classification + Localization: Do both CAT (x, y, w, h)

slide-7
SLIDE 7

Idea #1: Localization as Regression

Input: image Output: Box coordinates (4 numbers) Neural Net Correct output: box coordinates (4 numbers) Loss: L2 distance Only one object, simpler than detection

slide-8
SLIDE 8

Simple Recipe for Classification + Localization

  • Step 1: Train (or download) a classification model (e.g., VGG)

Image Convolution and Pooling Final conv feature map Fully-connected layers Class scores Softmax loss

slide-9
SLIDE 9

Simple Recipe for Classification + Localization

  • Step 1: Train (or download) a classification model (e.g., VGG)
  • Step 2: Attach new fully-connected “regression head” to the network

Image Convolution and Pooling Final conv feature map

Fully-connected layers Class scores Fully-connected layers Box coordinates

“Classification head” “Regression head”

slide-10
SLIDE 10

Simple Recipe for Classification + Localization

  • Step 1: Train (or download) a classification model (e.g., VGG)
  • Step 2: Attach new fully-connected “regression head” to the network
  • Step 3: Train the regression head only with SGD and L2 loss

Image Convolution and Pooling Final conv feature map

Fully-connected layers Class scores

“Classification head”

Fully-connected layers Box coordinates

L2 loss

slide-11
SLIDE 11

Simple Recipe for Classification + Localization

  • Step 1: Train (or download) a classification model (e.g., VGG)
  • Step 2: Attach new fully-connected “regression head” to the network
  • Step 3: Train the regression head only with SGD and L2 loss

Image Convolution and Pooling Final conv feature map

Fully-connected layers Class scores

“Classification head”

Fully-connected layers Box coordinates

L2 loss

slide-12
SLIDE 12

Simple Recipe for Classification + Localization

  • Step 1: Train (or download) a classification model (e.g., VGG)
  • Step 2: Attach new fully-connected “regression head” to the network
  • Step 3: Train the regression head only with SGD and L2 loss

Image Convolution and Pooling Final conv feature map

Fully-connected layers Class scores Fully-connected layers Box coordinates

L2 loss

Softmax loss

slide-13
SLIDE 13

Classification + Localization

Often pretrained on ImageNet (Transfer learning)

slide-14
SLIDE 14

Aside: Human Pose Estimation

slide-15
SLIDE 15

Aside: Human Pose Estimation

slide-16
SLIDE 16

Where to attach the regression head?

Image Convolution and Pooling Final conv feature map Fully- connected layers Class scores Softmax loss After conv layers: Overfeat, VGG After last FC layer: DeepPose, R-CNN

slide-17
SLIDE 17

Object detection

slide-18
SLIDE 18

Object Detection: Impact of Deep Learning

slide-19
SLIDE 19

Object Detection as Regression?

Each image needs a different number of outputs!

slide-20
SLIDE 20

Object Detection as Classification: Sliding Window

slide-21
SLIDE 21

Object Detection as Classification: Sliding Window

slide-22
SLIDE 22

Object Detection as Classification: Sliding Window

Problem: Need to apply CNN to huge number of locations and scales, very computationally expensive!

slide-23
SLIDE 23

Region Proposals

slide-24
SLIDE 24

R-CNN

slide-25
SLIDE 25

R-CNN

slide-26
SLIDE 26

R-CNN

slide-27
SLIDE 27

R-CNN

slide-28
SLIDE 28

Image Convolution and Pooling Final conv feature map Fully-connected layers Class scores 1000 classes Softmax loss

R-CNN Training

  • Step 1: Train (or download) a classification model for ImageNet

(AlexNet)

slide-29
SLIDE 29

Image Convolution and Pooling Final conv feature map Fully-connected layers Class scores: 21 classes Softmax loss Re-initialize this layer: was 4096 x 1000, now will be 4096 x 21

R-CNN Training

  • Step 2: Fine-tune model for detection
  • Instead of 1000 ImageNet classes, want 20 object classes + background
  • Throw away final fully-connected layer, reinitialize from scratch
  • Keep training model using positive / negative regions from detection images
slide-30
SLIDE 30

Image Convolution and Pooling pool5 features Region Proposals Crop + Warp Forward pass Save to disk

R-CNN Training

Step 3: Extract features

  • Extract region proposals for all images
  • For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk
  • Have a big hard drive: features are ~200GB for PASCAL dataset!
slide-31
SLIDE 31

Positive samples for cat SVM Negative samples for cat SVM Training image regions Cached region features

R-CNN Training

  • Step 4: Train one binary SVM per class to classify region features
slide-32
SLIDE 32

Training image regions Cached region features Regression targets (dx, dy, dw, dh) Normalized coordinates (0, 0, 0, 0) Proposal is good (.25, 0, 0, 0) Proposal too far to left (0, 0, -0.125, 0) Proposal too wide

R-CNN Training

  • Step 5 (bbox regression): For each class, train a linear regression model to map

from cached features to offsets to GT boxes to make up for “slightly wrong” proposals

slide-33
SLIDE 33

R-CNN: Problems

  • Ad hoc training objectives

– Fine-tune network with softmax classifier (log loss) – Train post-hoc linear SVMs (hinge loss) – Train post-hoc bounding-box regressions (least squares)

  • Training is slow (84h), takes a lot of disk space
  • Inference (detection) is slow

– 47s / image with VGG16 [Simonyan & Zisserman. ICLR15] – Fixed by SPP-net [He et al. ECCV14]

slide-34
SLIDE 34

Fast R-CNN

slide-35
SLIDE 35

Fast R-CNN

Share computation of convolutional layers between proposals for an image

slide-36
SLIDE 36

Fast R-CNN: Region of Interest Pooling

Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Problem: Fully-connected layers expect low-res conv features: C x h x w

slide-37
SLIDE 37

Fast R-CNN: Region of Interest Pooling

Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Project region proposal onto conv feature map Problem: Fully-connected layers expect low-res conv features: C x h x w

slide-38
SLIDE 38

Fast R-CNN: Region of Interest Pooling

Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Problem: Fully-connected layers expect low-res conv features: C x h x w Divide projected region into h x w grid

slide-39
SLIDE 39

Fast R-CNN: Region of Interest Pooling

Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w

slide-40
SLIDE 40

Fast R-CNN: Region of Interest Pooling

Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Can back propagate similar to max pooling RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w

slide-41
SLIDE 41

Fast R-CNN: RoI Pooling

slide-42
SLIDE 42

Fast R-CNN

Share computation of convolutional layers between proposals for an image

slide-43
SLIDE 43

R-CNN vs SPP vs Fast R-CNN

Problem: Runtime dominated by region proposals!

slide-44
SLIDE 44

Faster R-CNN

  • Make CNN do proposals!

– Solely based on CNN – No external modules

  • Each step is end-to-end

Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015

slide-45
SLIDE 45

Faster R-CNN

  • Insert

Region Proposal Network (RPN) to predict proposals from features Jointly

  • Train with 4 losses:

– RPN classify object / not object – RPN regress box coordinates – Final classification score (object classes) – Final box coordinates

slide-46
SLIDE 46

Faster R-CNN: Make CNN do proposals!

slide-47
SLIDE 47

Object Detection

Source: http://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

slide-48
SLIDE 48
slide-49
SLIDE 49

Semantic Segmentation

slide-50
SLIDE 50

Semantic Segmentation Idea: Sliding Window

Problem: Very inefficient! Not reusing shared features between

  • verlapping patches
slide-51
SLIDE 51

Semantic Segmentation Idea: Fully Convolutional

slide-52
SLIDE 52

Semantic Segmentation Idea: Fully Convolutional

slide-53
SLIDE 53

Semantic Segmentation Idea: Fully Convolutional

slide-54
SLIDE 54

In-Network upsampling: “Unpooling”

slide-55
SLIDE 55

In-Network upsampling: “Max Unpooling”

slide-56
SLIDE 56

Learnable Upsampling: Transpose Convolution

slide-57
SLIDE 57

Learnable Upsampling: Transpose Convolution

slide-58
SLIDE 58

Learnable Upsampling: Transpose Convolution

Other names: Deconvolution (bad) Upconvolution Fractionally strided convolution Backward strided convolution

slide-59
SLIDE 59

Transpose Convolution: 1D Example

slide-60
SLIDE 60

Convolution as Matrix Multiplication (1D Example)

slide-61
SLIDE 61

Convolution as Matrix Multiplication (1D Example)

slide-62
SLIDE 62

Semantic Segmentation Idea: Fully Convolutional

slide-63
SLIDE 63

Computer Vision Tasks