Detection and Segmentation
CS60010: Deep Learning Abir Das
IIT Kharagpur
Detection and Segmentation CS60010: Deep Learning Abir Das IIT - - PowerPoint PPT Presentation
Detection and Segmentation CS60010: Deep Learning Abir Das IIT Kharagpur March 04 and 05, 2020 Detection RCNN Architectures YOLO Segmentation Agenda To get introduced to two important tasks of computer vision - detection and segmentation
IIT Kharagpur
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 2 / 106
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 3 / 106
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 47
Dog? NO Cat? NO Background? YES
Apply a CNN to many different crops of the image, CNN classifies each crop as object
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 4 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Apply a CNN to many different crops of the image, CNN classifies each crop as object
Dog? YES Cat? NO Background? NO
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 5 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Apply a CNN to many different crops of the image, CNN classifies each crop as object
Dog? YES Cat? NO Background? NO
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 6 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Apply a CNN to many different crops of the image, CNN classifies each crop as object
Dog? NO Cat? YES Background? NO
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 7 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Apply a CNN to many different crops of the image, CNN classifies each crop as object
Dog? NO Cat? YES Background? NO
Problem: Need to apply CNN to huge number
computationally expensive! Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 8 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 9 / 106
Detection RCNN Architectures YOLO Segmentation
Original Image Region Proposals Detections
Cat Dog Proposal Generation
Classification
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 10 / 106
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 11 / 106 J Uijlings, K van de Sande, T Gevers, and A Smeulders, ‘Selective Search for Object Recognition’, IJCV 2013
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 12 / 106 J Uijlings, K van de Sande, T Gevers, and A Smeulders, ‘Selective Search for Object Recognition’, IJCV 2013
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 13 / 106 C Zitnick and P Dollar, ‘Edge Boxes: Locating Object Proposals from Edges’, ECCV 2014
Detection RCNN Architectures YOLO Segmentation
Lecture 8 - 1 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 - 1 Feb 2016 52
Hosang et al, “What makes for effective detection proposals?”, PAMI 2015 Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 14 / 106 J Hosang, R Benenson, P Dollar and B Schiele, ‘What makes for effective detection proposals?’, IEEE TPAMI 2016
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 15 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Regions of Interest __:/p- (Roi) from a proposal method (~2k)
Input image
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 16 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 17 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 18 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 19 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 20 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 21 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Big improvement compared to pre-CNN methods
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 22 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Bounding box regression helps a bit
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 23 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Features from a deeper network help a lot
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 24 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 59
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Slide copyright Ross Girshick, 2015; source. Reproduced with permission.
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 25 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Region proposals Feature extraction Classifier Pre 2012 RCNN
Region Proposals: Selective Search Feature Extraction: CNNs Classifier: Linear
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 26 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 63
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 27 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 63
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 28 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 63
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 29 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Hi-res input image: 3 x 640 x 480 with region proposal Hi-res conv features: 512 x 20 x 15; Projected region proposal is e.g. 512 x 18 x 8 (varies per proposal) Fully-connected layers
Divide projected proposal into 7x7 grid, max-pool within each cell
RoI conv features: 512 x 7 x 7 for region proposal Fully-connected layers expect low-res conv features: 512 x 7 x 7 CNN Project proposal
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 30 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 31 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 32 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 70
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014 Girshick, “Fast R-CNN”, ICCV 2015
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 33 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 70
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014 Girshick, “Fast R-CNN”, ICCV 2015
Problem: Runtime dominated by region proposals!
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 34 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Region proposals Feature extraction Classifier Pre 2012 RCNN Fast RCNN
Region Proposals: Selective Search Feature Extraction: CNN Classifier: CNN
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 35 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 36 / 106
Detection RCNN Architectures YOLO Segmentation
◮ A small 3x3 conv layer is applied on the last layer of the base conv-net ◮ it produces activation feature map of the same size as the base conv-net last layer feature map (7x7x512 in case of VGG base) ◮ At each of the feature positions (7x7=49 for VGG base), a set of bounding boxes (with different scale and aspect ratio) are evaluated for the following two questions
given the 512d feature at that position, what is the probability that each of the bounding boxes centered at the position contains an
Given the same 512d feature can you predict the correct bounding box? (Regression)
◮ These boxes are called ‘anchor boxes’
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 37 / 106
Detection RCNN Architectures YOLO Segmentation
◮ A small 3x3 conv layer is applied on the last layer of the base conv-net ◮ it produces activation feature map of the same size as the base conv-net last layer feature map (7x7x512 in case of VGG base) ◮ At each of the feature positions (7x7=49 for VGG base), a set of bounding boxes (with different scale and aspect ratio) are evaluated for the following two questions
given the 512d feature at that position, what is the probability that each of the bounding boxes centered at the position contains an
Given the same 512d feature can you predict the correct bounding box? (Regression)
◮ These boxes are called ‘anchor boxes’
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 37 / 106
Detection RCNN Architectures YOLO Segmentation
◮ A small 3x3 conv layer is applied on the last layer of the base conv-net ◮ it produces activation feature map of the same size as the base conv-net last layer feature map (7x7x512 in case of VGG base) ◮ At each of the feature positions (7x7=49 for VGG base), a set of bounding boxes (with different scale and aspect ratio) are evaluated for the following two questions
given the 512d feature at that position, what is the probability that each of the bounding boxes centered at the position contains an
Given the same 512d feature can you predict the correct bounding box? (Regression)
◮ These boxes are called ‘anchor boxes’
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 37 / 106
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 38 / 106
Detection RCNN Architectures YOLO Segmentation
Input Input Input
Consider a ground truth object and its corresponding bounding box
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 39 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Input Conv Max-pool Input Input
Consider a ground truth object and its corresponding bounding box Consider the projection of this image
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 40 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Input Conv Max-pool Input Input
Consider a ground truth object and its corresponding bounding box Consider the projection of this image
Consider one such cell in the output
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 41 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Input Conv Max-pool Input Input
Consider a ground truth object and its corresponding bounding box Consider the projection of this image
Consider one such cell in the output This cell corresponds to a patch in the
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 42 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Input Conv Max-pool Input Input
Consider a ground truth object and its corresponding bounding box Consider the projection of this image
Consider one such cell in the output This cell corresponds to a patch in the
Consider the center of this patch
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 43 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Input Conv Max-pool Input Input
Consider a ground truth object and its corresponding bounding box Consider the projection of this image
Consider one such cell in the output This cell corresponds to a patch in the
Consider the center of this patch We consider anchor boxes of different sizes
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 44 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Input Conv Max-pool Input Input
For each of these anchor boxes, we would want the classifier to predict 1 if this anchor box has a reason-able
grounding box
classification
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 45 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Input Conv Max-pool Input Input
classification
For each of these anchor boxes, we would want the classifier to predict 1 if this anchor box has a reason- able overlap (IoU > 0.7) with the true grounding box Similarly we would want the regres- sion model to predict the true box (red) from the anchor box (pink)
regression
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 46 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 71
Jointly train with 4 losses: 1. RPN classify object / not object 2. RPN regress box coordinates 3. Final classification score (object classes) 4. Final box coordinates
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 47 / 106 CS231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
◮ Imagenet Detection ◮ Imagenet Localization ◮ COCO Detection ◮ COCO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 48 / 106
Detection RCNN Architectures YOLO Segmentation
Region proposals Feature extraction Classifier Pre 2012 RCNN Fast RCNN Faster RCNN
Region Proposals: CNN Feature Extraction: CNN Classifier: CNN
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 49 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
◮ J Redmon, S Divvala, R Girshick and A Farhadi, ‘You Only Look Once: Unified, Real-Time Object Detection’, CVPR 2016 - YOLO v1 ◮ J Redmon and A Farhadi, ‘YOLO9000: Better, Faster, Stronger’, CVPR 2017 - YOLO v2 ◮ J Redmon and A Farhadi, ‘YOLOv3: An Incremental Improvement’ arXiv preprint 2018 - YOLO v3
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 50 / 106
Detection RCNN Architectures YOLO Segmentation
◮ J Redmon, S Divvala, R Girshick and A Farhadi, ‘You Only Look Once: Unified, Real-Time Object Detection’, CVPR 2016 - YOLO v1 ◮ J Redmon and A Farhadi, ‘YOLO9000: Better, Faster, Stronger’, CVPR 2017 - YOLO v2 ◮ J Redmon and A Farhadi, ‘YOLOv3: An Incremental Improvement’ arXiv preprint 2018 - YOLO v3
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 50 / 106
Detection RCNN Architectures YOLO Segmentation
◮ J Redmon, S Divvala, R Girshick and A Farhadi, ‘You Only Look Once: Unified, Real-Time Object Detection’, CVPR 2016 - YOLO v1 ◮ J Redmon and A Farhadi, ‘YOLO9000: Better, Faster, Stronger’, CVPR 2017 - YOLO v2 ◮ J Redmon and A Farhadi, ‘YOLOv3: An Incremental Improvement’ arXiv preprint 2018 - YOLO v3
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 50 / 106
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 51 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 52 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 53 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 54 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 55 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 56 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
belonging to the Kth class (C values)
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 57 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
belonging to the Kth class (C values)
elements
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 58 / 106
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
belonging to the Kth class (C values)
elements
cell
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 59 / 106
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
belonging to the Kth class (C values)
elements
boundary box predictions to locate a single
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 60 / 106
Detection RCNN Architectures YOLO Segmentation
c w h x y P(cow) P(dog)
· ·
P(truck) S × S grid on input P(cow) P(dog)
· ·
P(truck)
consider B (=2) anchor boxes per grid cell
interested in predicting 5 + C quantities
contains a true object
belonging to the Kth class (C values)
elements
elements
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 61 / 106
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 62 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 63 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 64 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 65 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 66 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 67 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 68 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 69 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 70 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 71 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
Final detections
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 72 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
S × S grid on input S × S grid on input
◮ Classification Loss ◮ Localization Loss ◮ Confidence Loss
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 73 / 106
Detection RCNN Architectures YOLO Segmentation
S2
i
i
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 74 / 106
Detection RCNN Architectures YOLO Segmentation
S2
B
ij
S2
B
ij
ij = 1, if jth bounding box is responsible for detecting the
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 75 / 106
Detection RCNN Architectures YOLO Segmentation
S2
B
ij
ij = 1, if jth bounding box is responsible for detecting the
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 76 / 106
Detection RCNN Architectures YOLO Segmentation
S2
B
ij
i
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 77 / 106
Detection RCNN Architectures YOLO Segmentation
Method Pascal 2007 mAP Speed DPM v5 33.7 0.07 FPS — 14 sec/ image RCNN 66.0 0.05 FPS — 20 sec/ image Fast RCNN 70.0 0.5 FPS — 2 sec/ image Faster RCNN 73.2 7 FPS — 140 msec/ image YOLO 69.0 45 FPS — 22 msec/ image
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 78 / 106 CS7015 course, IIT Madras
Detection RCNN Architectures YOLO Segmentation
This image is CC0 public domain
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 79 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 13
Full image Extract patch Classify center pixel with CNN
Problem: Very inefficient! Not reusing shared features between
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013 Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 80 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 15
Input: 3 x H x W Convolutions: D x H x W Conv Conv Conv Conv Scores: C x H x W argmax Predictions: H x W Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 81 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 15
Input: 3 x H x W Convolutions: D x H x W Conv Conv Conv Conv Scores: C x H x W argmax Predictions: H x W Design a network as a bunch of convolutional layers to make predictions for pixels all at once! Problem: convolutions at
be very expensive ...
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 82 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 17
Input: 3 x H x W Predictions: H x W Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network! High-res: D1 x H/2 x W/2 High-res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low-res: D3 x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Downsampling: Pooling, strided convolution Upsampling: ???
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 83 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 18
1 2 3 4 Input: 2 x 2 Output: 4 x 4 1 1 2 2 1 1 2 2 3 3 4 4 3 3 4 4 Nearest Neighbor 1 2 3 4 Input: 2 x 2 Output: 4 x 4 1 2 3 4 “Bed of Nails”
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 84 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 19
Input: 4 x 4 1 2 6 3 3 5 2 1 1 2 2 1 7 3 4 8 1 2 3 4 Input: 2 x 2 Output: 4 x 4 2 1 3 4 Max Unpooling Use positions from pooling layer 5 6 7 8 Max Pooling Remember which element was max!
Rest of the network Output: 2 x 2 Corresponding pairs of downsampling and upsampling layers
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 85 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 4 x 4 Output: 4 x 4 Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 86 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 4 x 4 Output: 4 x 4 Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 87 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 4 x 4 Output: 4 x 4 Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 88 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 4 x 4 Output: 2 x 2 Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 89 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 4 x 4 Output: 2 x 2 Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 90 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 4 x 4 Output: 2 x 2 Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 91 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 2 x 2 Output: 4 x 4 Input gives weight for filter
3 x 3 transpose convolution, stride 1 pad 0
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 92 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 2 x 2 Output: 4 x 4 Input gives weight for filter
3 x 3 transpose convolution, stride 1 pad 0
x1
w1x1 w2x1 w3x1 w4x1 w5x1 w6x1 w7x1 w8x1 w9x1
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 93 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where
3 x 3 transpose convolution, stride 1 pad 0
w1x1 w2x1 w1x2 +
x1 x2
w3x1 w2x2 + w3x2 w4x1 w5x1 w4x2 + w6x1 w5x2 + w6x2 w7x1 w8x1 w7x2 + w9x1 w8x2 + w9x2
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 94 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 11 - May 10, 2018 36
Input: 3 x H x W Predictions: H x W Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network! High-res: D1 x H/2 x W/2 High-res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low-res: D3 x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Downsampling: Pooling, strided convolution Upsampling: Unpooling or strided transpose convolution
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 95 / 106 Source: cs231n course, Stanford University
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 96 / 106 H Noh, S Hong and B Han, ‘Learning Deconvolution Network for Semantic Segmentation’, ICCV 2015
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 97 / 106 H Noh, S Hong and B Han, ‘SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation’, PAMI 2017
Detection RCNN Architectures YOLO Segmentation
Person 1 Person 2 Person 3 Person 4 Person 5 Person
Object Detection Semantic Segmentation Instance Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 98 / 106 Source: Kaiming He, ICCV 2017
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 99 / 106 Source: Kaiming He, ICCV 2017
Detection RCNN Architectures YOLO Segmentation
Person 1 Person 2 Person 3 Person 4 Person 5
Person 1 Person 2 Person 3 Person 4 Person 5
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 100 / 106 Source: Kaiming He, ICCV 2017
Detection RCNN Architectures YOLO Segmentation
cls bbox reg mask
Feat.
(slow) R-CNN
cls bbox reg
Feat.
Fast/er R-CNN Mask R-CNN
Feat.
step1 cls step2 bbox reg
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 101 / 106 Source: Kaiming He, ICCV 2017
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 102 / 106
Detection RCNN Architectures YOLO Segmentation
RoIPool coordinate quantization
quantized RoI
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 103 / 106 Source: Kaiming He, ICCV 2017
Detection RCNN Architectures YOLO Segmentation
Grid points of bilinear interpolation RoIAlign
(Variable size RoI)
(Fixed dimensional representation)
conv feat. map FAQs: how to sample grid points within a cell?
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 104 / 106 Source: Kaiming He, ICCV 2017
Detection RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 105 / 106 Source: Kaiming He, ICCV 2017
Detection RCNN Architectures YOLO Segmentation
Mask R-CNN results on CityScapes
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 106 / 106 Source: Kaiming He, ICCV 2017
Detection RCNN Architectures YOLO Segmentation
Mask R-CNN results on COCO
not a kite
Abir Das (IIT Kharagpur) CS60010 March 04 and 05, 2020 107 / 106 Source: Kaiming He, ICCV 2017