Detection and Segmentation CS60010: Deep Learning Abir Das IIT - - PowerPoint PPT Presentation

detection and segmentation
SMART_READER_LITE
LIVE PREVIEW

Detection and Segmentation CS60010: Deep Learning Abir Das IIT - - PowerPoint PPT Presentation

Detection and Segmentation CS60010: Deep Learning Abir Das IIT Kharagpur March 04 and 05, 2020 Detection RCNN Architectures YOLO Segmentation Agenda To get introduced to two important tasks of computer vision - detection and segmentation


slide-1
SLIDE 1

Detection and Segmentation

CS60010: Deep Learning Abir Das

IIT Kharagpur

March 04 and 05, 2020

slide-2
SLIDE 2

Detection RCNN Architectures YOLO Segmentation

Agenda

To get introduced to two important tasks of computer vision - detection and segmentation along with deep neural network’s application in these areas in recent years.

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 2 / 106

slide-3
SLIDE 3

Detection RCNN Architectures YOLO Segmentation

Detection as Regression

§ In detection you don’t know the number of objects present § So, it is problematic to address detection as regression § How many output neurons to put?

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 3 / 106

slide-4
SLIDE 4

Detection RCNN Architectures YOLO Segmentation

Detection as Classification

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 47

Object Detection as Classification: Sliding Window

Dog? NO Cat? NO Background? YES

Apply a CNN to many different crops of the image, CNN classifies each crop as object

  • r background

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 4 / 106 CS231n course, Stanford University

slide-5
SLIDE 5

Detection RCNN Architectures YOLO Segmentation

Detection as Classification

Apply a CNN to many different crops of the image, CNN classifies each crop as object

  • r background

Dog? YES Cat? NO Background? NO

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 5 / 106 CS231n course, Stanford University

slide-6
SLIDE 6

Detection RCNN Architectures YOLO Segmentation

Detection as Classification

Apply a CNN to many different crops of the image, CNN classifies each crop as object

  • r background

Dog? YES Cat? NO Background? NO

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 6 / 106 CS231n course, Stanford University

slide-7
SLIDE 7

Detection RCNN Architectures YOLO Segmentation

Detection as Classification

Apply a CNN to many different crops of the image, CNN classifies each crop as object

  • r background

Dog? NO Cat? YES Background? NO

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 7 / 106 CS231n course, Stanford University

slide-8
SLIDE 8

Detection RCNN Architectures YOLO Segmentation

Detection as Classification

Apply a CNN to many different crops of the image, CNN classifies each crop as object

  • r background

Dog? NO Cat? YES Background? NO

Problem: Need to apply CNN to huge number

  • f locations, scales, and aspect ratios, very

computationally expensive! Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 8 / 106 CS231n course, Stanford University

slide-9
SLIDE 9

Detection RCNN Architectures YOLO Segmentation

Detection as Classification

§ Need to apply CNN to huge number of locations, scales and aspect ratios § If the classifier is fast enough, this is done. Pre Deep Learning approach. § Deep learning classifiers, first get a tiny subset of possible positions. Only these are passed through the deep classifiers. § The possible positions are called ‘candidate proposals’ or ‘region proposals’.

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 9 / 106

slide-10
SLIDE 10

Detection RCNN Architectures YOLO Segmentation

Detection with Region Proposals

Original Image Region Proposals Detections

Cat Dog Proposal Generation

Classification

§ Generate and evaluate a few (much less than exhaustive search) region proposals § Proposal mechanism can take advantage of low-level cues (e.g., edges

  • r connected components)

§ Classifier can be slower but more powerful

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 10 / 106

slide-11
SLIDE 11

Detection RCNN Architectures YOLO Segmentation

Selective Search

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 11 / 106 J Uijlings, K van de Sande, T Gevers, and A Smeulders, ‘Selective Search for Object Recognition’, IJCV 2013

slide-12
SLIDE 12

Detection RCNN Architectures YOLO Segmentation

Selective Search

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 12 / 106 J Uijlings, K van de Sande, T Gevers, and A Smeulders, ‘Selective Search for Object Recognition’, IJCV 2013

slide-13
SLIDE 13

Detection RCNN Architectures YOLO Segmentation

EdgeBoxes

§ Edgeboxes depend on a fast scoring/evaluating method for bounding boxes. § First edges are extracted for the whole image and they are grouped according to their similarity § The main idea of scoring boxes builds on the fact that edges tend to correspond to object boundaries and bounding boxes that tightly enclose a set of edges are likely to contain an object. § Gets 75% recall with 800 boxes (vs 1400 for Selective Search) and is 40 times faster

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 13 / 106 C Zitnick and P Dollar, ‘Edge Boxes: Locating Object Proposals from Edges’, ECCV 2014

slide-14
SLIDE 14

Detection RCNN Architectures YOLO Segmentation

Many Region Proposal Methods

Lecture 8 - 1 Feb 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016 52

Region Proposals: Many other choices

Hosang et al, “What makes for effective detection proposals?”, PAMI 2015 Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 14 / 106 J Hosang, R Benenson, P Dollar and B Schiele, ‘What makes for effective detection proposals?’, IEEE TPAMI 2016

slide-15
SLIDE 15

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features

§ R Girshick, J Donahue, T Darrell and J Malik, ‘R-CNN: Region-based Convolutional Neural Networks’, CVPR 2014

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 15 / 106 CS231n course, Stanford University

slide-16
SLIDE 16

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features

Regions of Interest __:/p- (Roi) from a proposal method (~2k)

Input image

R-CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 16 / 106 CS231n course, Stanford University

slide-17
SLIDE 17

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features R-CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 17 / 106 CS231n course, Stanford University

slide-18
SLIDE 18

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features R-CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 18 / 106 CS231n course, Stanford University

slide-19
SLIDE 19

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features R-CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 19 / 106 CS231n course, Stanford University

slide-20
SLIDE 20

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features R-CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 20 / 106 CS231n course, Stanford University

slide-21
SLIDE 21

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features

The parameters learned for this pipeline are: ConvNet, SVM Classifier and Bounding-Box regressors

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 21 / 106 CS231n course, Stanford University

slide-22
SLIDE 22

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features

Lecture 8 - 1 Feb 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016 63

R-CNN Results

Big improvement compared to pre-CNN methods

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 22 / 106 CS231n course, Stanford University

slide-23
SLIDE 23

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features

Lecture 8 - 1 Feb 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016 63

R-CNN Results

Bounding box regression helps a bit

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 23 / 106 CS231n course, Stanford University

slide-24
SLIDE 24

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features

Lecture 8 - 1 Feb 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016 63

R-CNN Results

Features from a deeper network help a lot

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 24 / 106 CS231n course, Stanford University

slide-25
SLIDE 25

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 59

R-CNN: Problems

  • Ad hoc training objectives
  • Fine-tune network with softmax classifier (log loss)
  • Train post-hoc linear SVMs (hinge loss)
  • Train post-hoc bounding-box regressions (least squares)
  • Training is slow (84h), takes a lot of disk space
  • Inference (detection) is slow
  • 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
  • Fixed by SPP-net [He et al. ECCV14]

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Slide copyright Ross Girshick, 2015; source. Reproduced with permission.

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 25 / 106 CS231n course, Stanford University

slide-26
SLIDE 26

Detection RCNN Architectures YOLO Segmentation

R-CNN: Region Proposals + CNN Features

Region proposals Feature extraction Classifier Pre 2012 RCNN

Region Proposals: Selective Search Feature Extraction: CNNs Classifier: Linear

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 26 / 106 CS7015 course, IIT Madras

slide-27
SLIDE 27

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 63

Fast R-CNN

  • source. Reproduced with permission.

§ R Girshick, ‘Fast R-CNN’, ICCV 2015

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 27 / 106 CS231n course, Stanford University

slide-28
SLIDE 28

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 63

Fast R-CNN

  • source. Reproduced with permission.

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 28 / 106 CS231n course, Stanford University

slide-29
SLIDE 29

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 63

Fast R-CNN

  • source. Reproduced with permission.

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 29 / 106 CS231n course, Stanford University

slide-30
SLIDE 30

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN Fast R-CNN: RoI Pooling

Hi-res input image: 3 x 640 x 480 with region proposal Hi-res conv features: 512 x 20 x 15; Projected region proposal is e.g. 512 x 18 x 8 (varies per proposal) Fully-connected layers

Divide projected proposal into 7x7 grid, max-pool within each cell

RoI conv features: 512 x 7 x 7 for region proposal Fully-connected layers expect low-res conv features: 512 x 7 x 7 CNN Project proposal

  • nto features

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 30 / 106 CS231n course, Stanford University

slide-31
SLIDE 31

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN Fast R-CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 31 / 106 CS231n course, Stanford University

slide-32
SLIDE 32

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN Fast R-CNN (Training)

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 32 / 106 CS231n course, Stanford University

slide-33
SLIDE 33

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 70

R-CNN vs SPP vs Fast R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014 Girshick, “Fast R-CNN”, ICCV 2015

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 33 / 106 CS231n course, Stanford University

slide-34
SLIDE 34

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 70

R-CNN vs SPP vs Fast R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014 Girshick, “Fast R-CNN”, ICCV 2015

Problem: Runtime dominated by region proposals!

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 34 / 106 CS231n course, Stanford University

slide-35
SLIDE 35

Detection RCNN Architectures YOLO Segmentation

Fast R-CNN

Region proposals Feature extraction Classifier Pre 2012 RCNN Fast RCNN

Region Proposals: Selective Search Feature Extraction: CNN Classifier: CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 35 / 106 CS7015 course, IIT Madras

slide-36
SLIDE 36

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ The bulk of the time at test time of Fast RCNN is dominated by the region proposal generation. § As Fast RCNN saved computation by sharing the feature generation for all proposals, can some sort of sharing of computation be done for generating region proposals? § The solution is to use the same CNN for region proposal generation too.

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106

slide-37
SLIDE 37

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ The bulk of the time at test time of Fast RCNN is dominated by the region proposal generation. § As Fast RCNN saved computation by sharing the feature generation for all proposals, can some sort of sharing of computation be done for generating region proposals? § The solution is to use the same CNN for region proposal generation too. § S Ren and K He and R Girshick and J Sun, ‘Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks’, NIPS 2015

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106

slide-38
SLIDE 38

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ The bulk of the time at test time of Fast RCNN is dominated by the region proposal generation. § As Fast RCNN saved computation by sharing the feature generation for all proposals, can some sort of sharing of computation be done for generating region proposals? § The solution is to use the same CNN for region proposal generation too. § S Ren and K He and R Girshick and J Sun, ‘Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks’, NIPS 2015 § The region proposal generation part is termed as the Region Proposal Network (RPN)

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106

slide-39
SLIDE 39

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ The bulk of the time at test time of Fast RCNN is dominated by the region proposal generation. § As Fast RCNN saved computation by sharing the feature generation for all proposals, can some sort of sharing of computation be done for generating region proposals? § The solution is to use the same CNN for region proposal generation too. § S Ren and K He and R Girshick and J Sun, ‘Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks’, NIPS 2015 § The region proposal generation part is termed as the Region Proposal Network (RPN)

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106

slide-40
SLIDE 40

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ The bulk of the time at test time of Fast RCNN is dominated by the region proposal generation. § As Fast RCNN saved computation by sharing the feature generation for all proposals, can some sort of sharing of computation be done for generating region proposals? § The solution is to use the same CNN for region proposal generation too. § S Ren and K He and R Girshick and J Sun, ‘Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks’, NIPS 2015 § The region proposal generation part is termed as the Region Proposal Network (RPN)

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106

slide-41
SLIDE 41

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ The RPN works as follows:

◮ A small 3x3 conv layer is applied on the last layer of the base conv-net ◮ it produces activation feature map of the same size as the base conv-net last layer feature map (7x7x512 in case of VGG base) ◮ At each of the feature positions (7x7=49 for VGG base), a set of bounding boxes (with different scale and aspect ratio) are evaluated for the following two questions

given the 512d feature at that position, what is the probability that each of the bounding boxes centered at the position contains an

  • bject? (Classification)

Given the same 512d feature can you predict the correct bounding box? (Regression)

◮ These boxes are called ‘anchor boxes’

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 37 / 106

slide-42
SLIDE 42

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ The RPN works as follows:

◮ A small 3x3 conv layer is applied on the last layer of the base conv-net ◮ it produces activation feature map of the same size as the base conv-net last layer feature map (7x7x512 in case of VGG base) ◮ At each of the feature positions (7x7=49 for VGG base), a set of bounding boxes (with different scale and aspect ratio) are evaluated for the following two questions

given the 512d feature at that position, what is the probability that each of the bounding boxes centered at the position contains an

  • bject? (Classification)

Given the same 512d feature can you predict the correct bounding box? (Regression)

◮ These boxes are called ‘anchor boxes’

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 37 / 106

slide-43
SLIDE 43

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ The RPN works as follows:

◮ A small 3x3 conv layer is applied on the last layer of the base conv-net ◮ it produces activation feature map of the same size as the base conv-net last layer feature map (7x7x512 in case of VGG base) ◮ At each of the feature positions (7x7=49 for VGG base), a set of bounding boxes (with different scale and aspect ratio) are evaluated for the following two questions

given the 512d feature at that position, what is the probability that each of the bounding boxes centered at the position contains an

  • bject? (Classification)

Given the same 512d feature can you predict the correct bounding box? (Regression)

◮ These boxes are called ‘anchor boxes’

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 37 / 106

slide-44
SLIDE 44

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 38 / 106

slide-45
SLIDE 45

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ But how do we get the ground truth data to train the RPN.

Input Input Input

Consider a ground truth object and its corresponding bounding box

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 39 / 106 CS7015 course, IIT Madras

slide-46
SLIDE 46

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ But how do we get the ground truth data to train the RPN.

Input Conv Max-pool Input Input

Consider a ground truth object and its corresponding bounding box Consider the projection of this image

  • nto the conv5 layer

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 40 / 106 CS7015 course, IIT Madras

slide-47
SLIDE 47

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ But how do we get the ground truth data to train the RPN.

Input Conv Max-pool Input Input

Consider a ground truth object and its corresponding bounding box Consider the projection of this image

  • nto the conv5 layer

Consider one such cell in the output

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 41 / 106 CS7015 course, IIT Madras

slide-48
SLIDE 48

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ But how do we get the ground truth data to train the RPN.

Input Conv Max-pool Input Input

Consider a ground truth object and its corresponding bounding box Consider the projection of this image

  • nto the conv5 layer

Consider one such cell in the output This cell corresponds to a patch in the

  • riginal image

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 42 / 106 CS7015 course, IIT Madras

slide-49
SLIDE 49

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ But how do we get the ground truth data to train the RPN.

Input Conv Max-pool Input Input

Consider a ground truth object and its corresponding bounding box Consider the projection of this image

  • nto the conv5 layer

Consider one such cell in the output This cell corresponds to a patch in the

  • riginal image

Consider the center of this patch

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 43 / 106 CS7015 course, IIT Madras

slide-50
SLIDE 50

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ But how do we get the ground truth data to train the RPN.

Input Conv Max-pool Input Input

Consider a ground truth object and its corresponding bounding box Consider the projection of this image

  • nto the conv5 layer

Consider one such cell in the output This cell corresponds to a patch in the

  • riginal image

Consider the center of this patch We consider anchor boxes of different sizes

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 44 / 106 CS7015 course, IIT Madras

slide-51
SLIDE 51

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ But how do we get the ground truth data to train the RPN.

Input Conv Max-pool Input Input

For each of these anchor boxes, we would want the classifier to predict 1 if this anchor box has a reason-able

  • verlap (IoU > 0.7) with the true

grounding box

classification

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 45 / 106 CS7015 course, IIT Madras

slide-52
SLIDE 52

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ But how do we get the ground truth data to train the RPN.

Input Conv Max-pool Input Input

classification

For each of these anchor boxes, we would want the classifier to predict 1 if this anchor box has a reason- able overlap (IoU > 0.7) with the true grounding box Similarly we would want the regres- sion model to predict the true box (red) from the anchor box (pink)

regression

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 46 / 106 CS7015 course, IIT Madras

slide-53
SLIDE 53

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 71

Jointly train with 4 losses: 1. RPN classify object / not object 2. RPN regress box coordinates 3. Final classification score (object classes) 4. Final box coordinates

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 47 / 106 CS231n course, Stanford University

slide-54
SLIDE 54

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

§ Faster R-CNN based architectures won a lot of challenges including:

◮ Imagenet Detection ◮ Imagenet Localization ◮ COCO Detection ◮ COCO Segmentation

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 48 / 106

slide-55
SLIDE 55

Detection RCNN Architectures YOLO Segmentation

Faster R-CNN

Region proposals Feature extraction Classifier Pre 2012 RCNN Fast RCNN Faster RCNN

Region Proposals: CNN Feature Extraction: CNN Classifier: CNN

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 49 / 106 CS7015 course, IIT Madras

slide-56
SLIDE 56

Detection RCNN Architectures YOLO Segmentation

YOLO

§ The R-CNN pipelines separate proposal generation and proposal classification into two separate stages. § Can we have an end-to-end architecture which does both proposal generation and clasification simultaneously? § The solution gives the YOLO (You Only Look Once) architectures.

◮ J Redmon, S Divvala, R Girshick and A Farhadi, ‘You Only Look Once: Unified, Real-Time Object Detection’, CVPR 2016 - YOLO v1 ◮ J Redmon and A Farhadi, ‘YOLO9000: Better, Faster, Stronger’, CVPR 2017 - YOLO v2 ◮ J Redmon and A Farhadi, ‘YOLOv3: An Incremental Improvement’ arXiv preprint 2018 - YOLO v3

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 50 / 106

slide-57
SLIDE 57

Detection RCNN Architectures YOLO Segmentation

YOLO

§ The R-CNN pipelines separate proposal generation and proposal classification into two separate stages. § Can we have an end-to-end architecture which does both proposal generation and clasification simultaneously? § The solution gives the YOLO (You Only Look Once) architectures.

◮ J Redmon, S Divvala, R Girshick and A Farhadi, ‘You Only Look Once: Unified, Real-Time Object Detection’, CVPR 2016 - YOLO v1 ◮ J Redmon and A Farhadi, ‘YOLO9000: Better, Faster, Stronger’, CVPR 2017 - YOLO v2 ◮ J Redmon and A Farhadi, ‘YOLOv3: An Incremental Improvement’ arXiv preprint 2018 - YOLO v3

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 50 / 106

slide-58
SLIDE 58

Detection RCNN Architectures YOLO Segmentation

YOLO

§ The R-CNN pipelines separate proposal generation and proposal classification into two separate stages. § Can we have an end-to-end architecture which does both proposal generation and clasification simultaneously? § The solution gives the YOLO (You Only Look Once) architectures.

◮ J Redmon, S Divvala, R Girshick and A Farhadi, ‘You Only Look Once: Unified, Real-Time Object Detection’, CVPR 2016 - YOLO v1 ◮ J Redmon and A Farhadi, ‘YOLO9000: Better, Faster, Stronger’, CVPR 2017 - YOLO v2 ◮ J Redmon and A Farhadi, ‘YOLOv3: An Incremental Improvement’ arXiv preprint 2018 - YOLO v3

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 50 / 106

slide-59
SLIDE 59

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 51 / 106 CS7015 course, IIT Madras

slide-60
SLIDE 60

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 52 / 106 CS7015 course, IIT Madras

slide-61
SLIDE 61

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 53 / 106 CS7015 course, IIT Madras

slide-62
SLIDE 62

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

  • Width of the bounding box containing the true
  • bject

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 54 / 106 CS7015 course, IIT Madras

slide-63
SLIDE 63

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

  • Width of the bounding box containing the true
  • bject
  • Height of the bounding box containing the true
  • bject

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 55 / 106 CS7015 course, IIT Madras

slide-64
SLIDE 64

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

  • Width of the bounding box containing the true
  • bject
  • Height of the bounding box containing the true
  • bject
  • Center (x,y) of the bounding box

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 56 / 106 CS7015 course, IIT Madras

slide-65
SLIDE 65

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

  • Width of the bounding box containing the true
  • bject
  • Height of the bounding box containing the true
  • bject
  • Center (x,y) of the bounding box
  • Probability of the object in the bounding box

belonging to the Kth class (C values)

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 57 / 106 CS7015 course, IIT Madras

slide-66
SLIDE 66

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

  • Width of the bounding box containing the true
  • bject
  • Height of the bounding box containing the true
  • bject
  • Center (x,y) of the bounding box
  • Probability of the object in the bounding box

belonging to the Kth class (C values)

  • The output layer should contain SxSxBx(5+C)

elements

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 58 / 106

slide-67
SLIDE 67

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

  • Width of the bounding box containing the true
  • bject
  • Height of the bounding box containing the true
  • bject
  • Center (x,y) of the bounding box
  • Probability of the object in the bounding box

belonging to the Kth class (C values)

  • The output layer should contain SxSxBx(5+C)

elements

  • However, each grid cell in YOLO predicts only
  • ne object even if there are B anchor boxes per

cell

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 59 / 106

slide-68
SLIDE 68

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

  • Width of the bounding box containing the true
  • bject
  • Height of the bounding box containing the true
  • bject
  • Center (x,y) of the bounding box
  • Probability of the object in the bounding box

belonging to the Kth class (C values)

  • The output layer should contain SxSxBx(5+C)

elements

  • The idea is each grid cell tries to make two

boundary box predictions to locate a single

  • bject

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 60 / 106

slide-69
SLIDE 69

Detection RCNN Architectures YOLO Segmentation

YOLO

c w h x y P(cow) P(dog)

· ·

P(truck) S × S grid on input P(cow) P(dog)

· ·

P(truck)

  • Divide an image into S × S grids (S=7) and

consider B (=2) anchor boxes per grid cell

  • For each such anchor box in each cell we are

interested in predicting 5 + C quantities

  • Probability (confidence) that this anchor box

contains a true object

  • Width of the bounding box containing the true
  • bject
  • Height of the bounding box containing the true
  • bject
  • Center (x,y) of the bounding box
  • Probability of the object in the bounding box

belonging to the Kth class (C values)

  • The output layer should contain SxSxBx(5+C)

elements

  • Thus the output layer contains SxSx(Bx5+C)

elements

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 61 / 106

slide-70
SLIDE 70

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 62 / 106 CS7015 course, IIT Madras

slide-71
SLIDE 71

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 63 / 106 CS7015 course, IIT Madras

slide-72
SLIDE 72

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 64 / 106 CS7015 course, IIT Madras

slide-73
SLIDE 73

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 65 / 106 CS7015 course, IIT Madras

slide-74
SLIDE 74

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 66 / 106 CS7015 course, IIT Madras

slide-75
SLIDE 75

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 67 / 106 CS7015 course, IIT Madras

slide-76
SLIDE 76

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 68 / 106 CS7015 course, IIT Madras

slide-77
SLIDE 77

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 69 / 106 CS7015 course, IIT Madras

slide-78
SLIDE 78

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 70 / 106 CS7015 course, IIT Madras

slide-79
SLIDE 79

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object in it and the type of the object

S × S grid on input

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 71 / 106 CS7015 course, IIT Madras

slide-80
SLIDE 80

Detection RCNN Architectures YOLO Segmentation

YOLO

§ During inference/test phase, how do we interpret these S × S × (B × 5 + C) outputs? § For each cell we compute the bounding box, its confidence about having any object it and the type of the object § NMS is then applied to retain the most confident boxes

Final detections

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 72 / 106 CS7015 course, IIT Madras

slide-81
SLIDE 81

Detection RCNN Architectures YOLO Segmentation

Training YOLO

§ How do we train this network § Consider a cell such that a true bounding box corresponds to this cell

S × S grid on input S × S grid on input

§ Initially the network with random weights will produce some values for these (5 + C) values § YOLO uses sum-squared error between the predictions and the ground truth to calculate loss. The following losses are computed

◮ Classification Loss ◮ Localization Loss ◮ Confidence Loss

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 73 / 106

slide-82
SLIDE 82

Detection RCNN Architectures YOLO Segmentation

Training YOLO

Classification Loss

S2

  • i=0

✶obj

i

  • c∈classes
  • pi(c) − ˆ

pi(c) 2 where, ✶obj

i

= 1, if a ground truth object is in cell i, otherwise 0. ˆ pi(c) is the predicted probability of an object of class c in the ith cell. pi(c) is the ground truth label.

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 74 / 106

slide-83
SLIDE 83

Detection RCNN Architectures YOLO Segmentation

Training YOLO

Localization Loss: It measures the errors in the predicted bounding box locations and size. The loss is computed for the one box that is responsible for detecting the object. λcoord

S2

  • i=0

B

  • j=0

✶obj

ij

  • xi − ˆ

xi 2 +

  • xi − ˆ

xi 2 +λcoord

S2

  • i=0

B

  • j=0

✶obj

ij

√wi −

  • ˆ

wi 2 +

  • hi −
  • ˆ

hi 2 where, ✶obj

ij = 1, if jth bounding box is responsible for detecting the

ground truth object in cell i, otherwise 0. By square rooting the box dimensions some parity is maintained for different size boxes. Absolute errors in large boxes and small boxes are not treated same.

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 75 / 106

slide-84
SLIDE 84

Detection RCNN Architectures YOLO Segmentation

Training YOLO

Confidence Loss: For a box responsible for predicting an object

S2

  • i=0

B

  • j=0

✶obj

ij

  • Ci − ˆ

Ci 2 where, ✶obj

ij = 1, if jth bounding box is responsible for detecting the

ground truth object in cell i, otherwise 0. ˆ Ci is the predicted probability that there is an object in the ith cell. Ci is the ground truth label (of whether an object is there).

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 76 / 106

slide-85
SLIDE 85

Detection RCNN Architectures YOLO Segmentation

Training YOLO

Confidence Loss: For a box that predicts ‘no object’ inside λnoobj

S2

  • i=0

B

  • j=0

✶noobj

ij

  • Ci − ˆ

Ci 2 where, ✶obj

i

= 1, if jth bounding box is responsible for predicting ‘no

  • bject’ in cell i, otherwise 0.

ˆ Ci is the predicted probability that there is an object in the ith cell. Ci is the ground truth label (of whether an object is there). The total loss is the sum of all the above losses

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 77 / 106

slide-86
SLIDE 86

Detection RCNN Architectures YOLO Segmentation

Training YOLO

Method Pascal 2007 mAP Speed DPM v5 33.7 0.07 FPS — 14 sec/ image RCNN 66.0 0.05 FPS — 20 sec/ image Fast RCNN 70.0 0.5 FPS — 2 sec/ image Faster RCNN 73.2 7 FPS — 140 msec/ image YOLO 69.0 45 FPS — 22 msec/ image

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 78 / 106 CS7015 course, IIT Madras

slide-87
SLIDE 87

Detection RCNN Architectures YOLO Segmentation

Segmentation

Semantic Segmentation

GRASS, CAT, TREE, SKY

No objects, just pixels

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 8

Other Computer Vision Tasks

Classification + Localization Semantic Segmentation Object Detection Instance Segmentation

CAT GRASS, CAT, TREE, SKY DOG, DOG, CAT DOG, DOG, CAT

Single Object Multiple Object No objects, just pixels

This image is CC0 public domain

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 79 / 106 Source: cs231n course, Stanford University

slide-88
SLIDE 88

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 13

Semantic Segmentation Idea: Sliding Window

Full image Extract patch Classify center pixel with CNN

Cow Cow Grass

Problem: Very inefficient! Not reusing shared features between

  • verlapping patches

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013 Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 80 / 106 Source: cs231n course, Stanford University

slide-89
SLIDE 89

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 15

Semantic Segmentation Idea: Fully Convolutional

Input: 3 x H x W Convolutions: D x H x W Conv Conv Conv Conv Scores: C x H x W argmax Predictions: H x W Design a network as a bunch of convolutional layers to make predictions for pixels all at once!

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 81 / 106 Source: cs231n course, Stanford University

slide-90
SLIDE 90

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 15

Semantic Segmentation Idea: Fully Convolutional

Input: 3 x H x W Convolutions: D x H x W Conv Conv Conv Conv Scores: C x H x W argmax Predictions: H x W Design a network as a bunch of convolutional layers to make predictions for pixels all at once! Problem: convolutions at

  • riginal image resolution will

be very expensive ...

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 82 / 106 Source: cs231n course, Stanford University

slide-91
SLIDE 91

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 17

Semantic Segmentation Idea: Fully Convolutional

Input: 3 x H x W Predictions: H x W Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network! High-res: D1 x H/2 x W/2 High-res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low-res: D3 x H/4 x W/4

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Downsampling: Pooling, strided convolution Upsampling: ???

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 83 / 106 Source: cs231n course, Stanford University

slide-92
SLIDE 92

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 18

In-Network upsampling: “Unpooling”

1 2 3 4 Input: 2 x 2 Output: 4 x 4 1 1 2 2 1 1 2 2 3 3 4 4 3 3 4 4 Nearest Neighbor 1 2 3 4 Input: 2 x 2 Output: 4 x 4 1 2 3 4 “Bed of Nails”

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 84 / 106 Source: cs231n course, Stanford University

slide-93
SLIDE 93

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 19

In-Network upsampling: “Max Unpooling”

Input: 4 x 4 1 2 6 3 3 5 2 1 1 2 2 1 7 3 4 8 1 2 3 4 Input: 2 x 2 Output: 4 x 4 2 1 3 4 Max Unpooling Use positions from pooling layer 5 6 7 8 Max Pooling Remember which element was max!

Rest of the network Output: 2 x 2 Corresponding pairs of downsampling and upsampling layers

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 85 / 106 Source: cs231n course, Stanford University

slide-94
SLIDE 94

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 22

Input: 4 x 4 Output: 4 x 4 Dot product between filter and input

Recall: Normal 3 x 3 convolution, stride 1 pad 1

Learnable Upsampling: Transpose Convolution

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 86 / 106 Source: cs231n course, Stanford University

slide-95
SLIDE 95

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 22

Input: 4 x 4 Output: 4 x 4 Dot product between filter and input

Recall: Normal 3 x 3 convolution, stride 1 pad 1

Learnable Upsampling: Transpose Convolution

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 87 / 106 Source: cs231n course, Stanford University

slide-96
SLIDE 96

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 22

Input: 4 x 4 Output: 4 x 4 Dot product between filter and input

Recall: Normal 3 x 3 convolution, stride 1 pad 1

Learnable Upsampling: Transpose Convolution

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 88 / 106 Source: cs231n course, Stanford University

slide-97
SLIDE 97

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 25

Input: 4 x 4 Output: 2 x 2 Dot product between filter and input

Recall: Normal 3 x 3 convolution, stride 2 pad 1

Learnable Upsampling: Transpose Convolution

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 89 / 106 Source: cs231n course, Stanford University

slide-98
SLIDE 98

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 25

Input: 4 x 4 Output: 2 x 2 Dot product between filter and input

Recall: Normal 3 x 3 convolution, stride 2 pad 1

Learnable Upsampling: Transpose Convolution

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 90 / 106 Source: cs231n course, Stanford University

slide-99
SLIDE 99

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 25

Input: 4 x 4 Output: 2 x 2 Dot product between filter and input

Recall: Normal 3 x 3 convolution, stride 2 pad 1

Learnable Upsampling: Transpose Convolution

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 91 / 106 Source: cs231n course, Stanford University

slide-100
SLIDE 100

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 28

Input: 2 x 2 Output: 4 x 4 Input gives weight for filter

3 x 3 transpose convolution, stride 1 pad 0

Learnable Upsampling: Transpose Convolution

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 92 / 106 Source: cs231n course, Stanford University

slide-101
SLIDE 101

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 28

Input: 2 x 2 Output: 4 x 4 Input gives weight for filter

3 x 3 transpose convolution, stride 1 pad 0

Learnable Upsampling: Transpose Convolution

x1

w1x1 w2x1 w3x1 w4x1 w5x1 w6x1 w7x1 w8x1 w9x1

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 93 / 106 Source: cs231n course, Stanford University

slide-102
SLIDE 102

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 28

Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where

  • utput overlaps

3 x 3 transpose convolution, stride 1 pad 0

Learnable Upsampling: Transpose Convolution

w1x1 w2x1 w1x2 +

x1 x2

w3x1 w2x2 + w3x2 w4x1 w5x1 w4x2 + w6x1 w5x2 + w6x2 w7x1 w8x1 w7x2 + w9x1 w8x2 + w9x2

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 94 / 106 Source: cs231n course, Stanford University

slide-103
SLIDE 103

Detection RCNN Architectures YOLO Segmentation

Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 11 - May 10, 2018 36

Semantic Segmentation Idea: Fully Convolutional

Input: 3 x H x W Predictions: H x W Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network! High-res: D1 x H/2 x W/2 High-res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low-res: D3 x H/4 x W/4

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Downsampling: Pooling, strided convolution Upsampling: Unpooling or strided transpose convolution

J Long, E Shelhamer and T Darrell, ‘Fully Convolutional Networks for Semantic Segmentation’, CVPR 2015

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 95 / 106 Source: cs231n course, Stanford University

slide-104
SLIDE 104

Detection RCNN Architectures YOLO Segmentation

Segmentation: Deconvolutional Network

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 96 / 106 H Noh, S Hong and B Han, ‘Learning Deconvolution Network for Semantic Segmentation’, ICCV 2015

slide-105
SLIDE 105

Detection RCNN Architectures YOLO Segmentation

Segmentation: SegNet

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 97 / 106 H Noh, S Hong and B Han, ‘SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation’, PAMI 2017

slide-106
SLIDE 106

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation

Person 1 Person 2 Person 3 Person 4 Person 5 Person

Object Detection Semantic Segmentation Instance Segmentation

✓ ✓ ?

§ Instance segmentation not only wants to detect individual object instances but also wants to have a segmentation mask of each instance § What can be a naive idea?

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 98 / 106 Source: Kaiming He, ICCV 2017

slide-107
SLIDE 107

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation

  • Mask R-CNN = Faster R-CNN with FCN on RoIs

Faster R-CNN FCN on RoI

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 99 / 106 Source: Kaiming He, ICCV 2017

slide-108
SLIDE 108

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation: Broad Strategies

? ? ? ? ?

Person 1 Person 2 Person 3 Person 4 Person 5

R-CNN driven

Instance Segmentation Methods

person

Person 1 Person 2 Person 3 Person 4 Person 5

FCN driven

(proposals)

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 100 / 106 Source: Kaiming He, ICCV 2017

slide-109
SLIDE 109

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation: Mask-RCNN

Parallel Heads

  • Easy, fast to implement and train

cls bbox reg mask

Feat.

(slow) R-CNN

cls bbox reg

Feat.

Fast/er R-CNN Mask R-CNN

Feat.

step1 cls step2 bbox reg

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 101 / 106 Source: Kaiming He, ICCV 2017

slide-110
SLIDE 110

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation: Mask-RCNN

§ Mask R-CNN adopts the same two-stage procedure with identical first stage [i.e., RPN] as R-CNN § In second stage in addition to class prediction and bounding box regression Mask-RCNN, in parallel, outputs a binary mask for each RoI § The mask branch has Km2 dimensional output for each RoI [binary mask of m × m resolution one for each K classes] boxes § RoIPool breaks pixel-to-pixel translation-equivariance

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 102 / 106

slide-111
SLIDE 111

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation: Mask-RCNN

  • RoIPool breaks pixel-to-pixel translation-equivariance

RoIPool coordinate quantization

😗 😗 😗

  • riginal RoI

quantized RoI

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 103 / 106 Source: Kaiming He, ICCV 2017

slide-112
SLIDE 112

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation: Mask-RCNN

RoIAlign

Grid points of bilinear interpolation RoIAlign

  • utput

(Variable size RoI)

(Fixed dimensional representation)

conv feat. map FAQs: how to sample grid points within a cell?

  • 4 regular points in 2x2 sub-cells
  • other implementation could work

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 104 / 106 Source: Kaiming He, ICCV 2017

slide-113
SLIDE 113

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation: Mask-RCNN

Mask R-CNN results on COCO disconnected

  • bject

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 105 / 106 Source: Kaiming He, ICCV 2017

slide-114
SLIDE 114

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation: Mask-RCNN

Mask R-CNN results on CityScapes

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 106 / 106 Source: Kaiming He, ICCV 2017

slide-115
SLIDE 115

Detection RCNN Architectures YOLO Segmentation

Instance Segmentation: Mask-RCNN

Mask R-CNN results on COCO

Failure case: recognition

not a kite

Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 107 / 106 Source: Kaiming He, ICCV 2017