1 CV3DST | Prof. Leal-Taixé
Ins Instanc nce segm segmen enta tati tion
- n
Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. - - PowerPoint PPT Presentation
Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. Leal-Taix 1 Se Semanti ntic c segmenta ntati tion Label every pixel, including the background (sky, grass, road) Do not differentiate between the pixels coming from
1 CV3DST | Prof. Leal-Taixé
2 CV3DST | Prof. Leal-Taixé
Label every pixel, including the background (sky, grass, road) Do not differentiate between the pixels coming from instances of the same class
3 CV3DST | Prof. Leal-Taixé
Label every pixel, including the background (sky, grass, road) Do not differentiate between the pixels coming from instances of the same class
Do not label pixels coming from uncountable objects (sky, grass, road) Differentiate between the pixels coming from instances
4 CV3DST | Prof. Leal-Taixé
vs. Proposal-based FCN-based
class
segmentation
instances
5 CV3DST | Prof. Leal-Taixé
vs. Proposal-based FCN-based
class
segmentation
instances
6 CV3DST | Prof. Leal-Taixé
A semantic map… We already know how to obtain this!
7 CV3DST | Prof. Leal-Taixé
Long, Shelhamer, Darrell - Fully Convolutional Networks for Semantic Segmentation, CVPR 2015, PAMI 2016
8 CV3DST | Prof. Leal-Taixé
level Object Segmentation“. Arxiv 2015
with MultiCut“. CVPR 2017
Instance Segmentation “. CVPR 2017
9 CV3DST | Prof. Leal-Taixé
10 CV3DST | Prof. Leal-Taixé
vs. Proposal-based FCN-based
class
segmentation
instances
11 CV3DST | Prof. Leal-Taixé
Bounding boxes….. We already know how to obtain those!
12 CV3DST | Prof. Leal-Taixé
Segmentation“. ECCV 2014
– Follow-up work: B. Hariharan et al. “Hypercolumns for Object Segmentation and Fine-grained Localization ”. CVPR 2015
Multi-task Network Cascades“. CVPR 2016
– Previous work: Dai et al. “Convolutional Feature Masking for Joint Object and Stuff Segmentation“. CVPR 2015
13 CV3DST | Prof. Leal-Taixé
14 CV3DST | Prof. Leal-Taixé
Dai et al. „Instance-aware Semantic Segmentation via Multi-task Network Cascades“. CVPR 2016
15 CV3DST | Prof. Leal-Taixé
Proposal-based FCN-based
class
segmentation
instances
16 CV3DST | Prof. Leal-Taixé
17 CV3DST | Prof. Leal-Taixé
Bounding box regression head CNN Classification head Image Region Proposal Network
18 CV3DST | Prof. Leal-Taixé
Bounding box regression head CNN Classification head Image Region Proposal Network Instance segmentation head (FCN)
19 CV3DST | Prof. Leal-Taixé
Faster R-CNN FCN-like
He at al. “Mask R-CNN” ICCV 2017
Mask loss = binary cross entropy per pixel for the k semantic classes
20 CV3DST | Prof. Leal-Taixé
Object recognition head Segmentation head Most of features are shared
inv nvar ariant ant representations
21 CV3DST | Prof. Leal-Taixé
Translation invariance: wherever the penguin is in the image, I still want to have “penguin” as my classification output
inv nvar ariant ant representations
ariant ant representations
– Translated object à Translated mask – Scaled object à scaled mask – For semantic segmentation, small objects are less important (less pixels), but for instance segmentation, all
22 CV3DST | Prof. Leal-Taixé
23 CV3DST | Prof. Leal-Taixé
Features extraction = convolutional layers à equivariant Segmentation head is a fully convolutional network à equivariant
24 CV3DST | Prof. Leal-Taixé
Features extraction = convolutional layers à equivariant Segmentation head is a fully convolutional network à equivariant Fully connected layers and global pooling layers give invariance!
25 CV3DST | Prof. Leal-Taixé
CNN Image Feature map Zoom in We put a HxW grid on top Feature map (HxwxC)
26 CV3DST | Prof. Leal-Taixé
CNN Image Feature map (65x65xC) Zoom in We put a 4x6 grid on top Feature map (4x6xC) 400x400 Box 300x150 Box height 65*300/400=48.75 Quantization effect = chose 48 Quantization effect Not suitable to extract pixel-wise precise masks
27 CV3DST | Prof. Leal-Taixé
Fully connected layers and global pooling layers give invariance! Exchange RoI pooling by an equivariant operation = RoI Align
28 CV3DST | Prof. Leal-Taixé
CNN Image Feature map (65x65xC) Zoom in We put a 4x6 grid on top Feature map (4x6xC) 400x400 Box 300x150 Box height 65*300/400=48.75 Chose 48.75
29 CV3DST | Prof. Leal-Taixé
Image: Kaiming He
Feature map Grid points for bilinear interpolation Max pooling on the 4 positions to obtain
Each unit is sampled 4 times To obtain the value use bilinear interpolation
30 CV3DST | Prof. Leal-Taixé
31 CV3DST | Prof. Leal-Taixé
32 CV3DST | Prof. Leal-Taixé
33 CV3DST | Prof. Leal-Taixé
34 CV3DST | Prof. Leal-Taixé
Model a keypoint's location as a one-hot mask, and adopt Mask R-CNN to predict K masks (which are in the end only 1 pixel),
This demonstrates the flexibility of Mask R-CNN.
computed as the confidence score for the bounding box
35 CV3DST | Prof. Leal-Taixé
Recall the mask loss just evaluates if the pixels have the correct semantic class, not the correct instance! Both instances have the same class = person The only way the “instance” is evaluated is through the box loss
36 CV3DST | Prof. Leal-Taixé
Huang et al., “Mask Scoring R-CNN”, CVPR 2019
Measure the intersection over union between the predicted mask and ground truth mask
37 CV3DST | Prof. Leal-Taixé
Typically, Mask scoring R-CNN gives lower confidence scores than Mask R-CNN, which corresponds to masks not being perfect (IoU < 1.0). This tiny modification achieves SOTA results.
38 CV3DST | Prof. Leal-Taixé
39 CV3DST | Prof. Leal-Taixé
Faster R-CNN YOLO Slower, but has higher performance Faster, but has lower performance
On One-st stage e vs s two-st stage e inst nstanc ance e seg segment enter ers
40 CV3DST | Prof. Leal-Taixé
Mask R-CNN YOLACT Slower, but has higher performance Faster, but has lower performance
“Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.”
– Joseph Redmon, YOLOv3
41 CV3DST | Prof. Leal-Taixé
42 CV3DST | Prof. Leal-Taixé
*You Only Look At CoefficienTs
43 CV3DST | Prof. Leal-Taixé
44 CV3DST | Prof. Leal-Taixé
1) Generate mask prototypes
45 CV3DST | Prof. Leal-Taixé
1) Generate mask prototypes 2) Generate mask coefficients
46 CV3DST | Prof. Leal-Taixé
1) Generate mask prototypes 2) Generate mask coefficients 3) Combine (1) and (2)
47 CV3DST | Prof. Leal-Taixé
ResNet-101 Features computed in different scales
48 CV3DST | Prof. Leal-Taixé
Generate k prototype masks. k is not the number of classes, but is a hyperparameter.
49 CV3DST | Prof. Leal-Taixé
Similar to the mask branch in Mask R-CNN. However, no loss function is applied on this stage. 1x1 conv 3x3 conv
50 CV3DST | Prof. Leal-Taixé
Predict a coefficient for every predicted mask.
51 CV3DST | Prof. Leal-Taixé
Predict k coefficients (one per prototype mask) per anchor Predict one class per anchor box Predict the regression per anchor box The network is similar but shallower than RetinaNet
1. Do a linear combination between the mask coefficients and the mask prototypes. 2. Predict the mask as M = 𝜏(𝑄𝐷!) where P is a (HxWxK) matrix of prototype masks, C is a (NxK) matrix of mask coefficients surviving NMS, and 𝜏 is a nonlinearity.
52 CV3DST | Prof. Leal-Taixé
53 CV3DST | Prof. Leal-Taixé
Cross-entropy between the assembled masks and the ground truth, in addition to the standard losses (regression for the bounding box, and classification for the class of the
54 CV3DST | Prof. Leal-Taixé
55 CV3DST | Prof. Leal-Taixé
For large objects, the quality of the masks is even better than those of two- stage detectors
56 CV3DST | Prof. Leal-Taixé
YOLACT
make the procedure faster.
performed on the final features of the FPN. The module is not used during the inference stage.
segmentation”. arXiv:1912.06218 2019
57 CV3DST | Prof. Leal-Taixé
58 CV3DST | Prof. Leal-Taixé
59 CV3DST | Prof. Leal-Taixé
Semantic segmentation Instance segmentation +
60 CV3DST | Prof. Leal-Taixé
Semantic segmentation Instance segmentation FCN-like Mask R-CNN +
61 CV3DST | Prof. Leal-Taixé
Semantic segmentation Instance segmentation FCN-like Mask R-CNN + = UPSNet Panoptic segmentation
62 CV3DST | Prof. Leal-Taixé
It gives labels to uncountable
etc), similar to FCN-like networks. It differentiates between pixels coming from different instances
pedestrians, etc).
63 CV3DST | Prof. Leal-Taixé
Problem: some pixels might get classified as stuff from FCN network, while at the same time being classified as instances of some class from Mask R-CNN (conflicting results)!
64 CV3DST | Prof. Leal-Taixé
Solution: Parametric-free panoptic head which combines the information from the FCN and Mask R-CNN, giving final predictions.
Xiong et al., “UPSNet: A Unified Panoptic Segmentation Network”. CVPR 2019
65 CV3DST | Prof. Leal-Taixé
Shared features Separate heads Putting it together
66 CV3DST | Prof. Leal-Taixé
Shared features Separate heads Putting it together
67 CV3DST | Prof. Leal-Taixé
As all semantic heads à fully convolutional network. New: deformable convolutions!
Re Reca call: Di Dilated (at atrous) ) con
ions 2D
68 CV3DST | Prof. Leal-Taixé
(a) the dilation parameter is 1, and each element produced by this filter has receptive field of 3x3. (b) the dilation parameter is 2, and each element produced by it has receptive field of 7x7. (c ) the dilation parameter is 4, and each element produced by it has receptive field of 15x15.
69 CV3DST | Prof. Leal-Taixé
Deformable convolutions: generalization of dilated convolutions when you learn the offset
70 CV3DST | Prof. Leal-Taixé
71 CV3DST | Prof. Leal-Taixé
The deformable convolution will pick the values at different locations for convolutions conditioned on the input image
72 CV3DST | Prof. Leal-Taixé
Mask logits from the instance head Object logits coming from the semantic head (e.g., car) Stuff logits coming from the semantic head (e.g., sky)
73 CV3DST | Prof. Leal-Taixé
Mask logits from the instance head Object logits coming from the semantic head (e.g., car) Stuff logits coming from the semantic head (e.g., sky) This can be evaluated directly Objects need to be masked by the instance
74 CV3DST | Prof. Leal-Taixé
Perform softmax over the panoptic
into the first stuff channels, then it belongs to one of the stuff classes. Otherwise the index of the maximum value tells us the instance ID the pixel belongs to.
Xiong et al., “UPSNet: A Unified Panoptic Segmentation Network”. CVPR 2019
Read the details on how to use the unknown class
CV3DST | Prof. Leal-Taixé 75
segments are to the ground truth segment (does not take into account bad predictions!)
76 CV3DST | Prof. Leal-Taixé
TP = True positive, FN = False negative, FP = false positive
want to know if we are missing any instances (FN) or we are predicting more instances (FP).
77 CV3DST | Prof. Leal-Taixé
TP = True positive, FN = False negative, FP = false positive
to two predicted segments.
78 CV3DST | Prof. Leal-Taixé
Predictions Ground truth IoU measures
FP TP
79 CV3DST | Prof. Leal-Taixé
80 CV3DST | Prof. Leal-Taixé
CV3DST | Prof. Leal-Taixé 81
engineered
CV3DST | Prof. Leal-Taixé 82
Before DPM, RCNN dominance: detection-as-voting
83 CV3DST | Prof. Leal-Taixé
dual parametric space
space
CV3DST | Prof. Leal-Taixé 84
CV3DST | Prof. Leal-Taixé 85
CV3DST | Prof. Leal-Taixé 86
(discretized) parameter space
CV3DST | Prof. Leal-Taixé 87
configurations of the observed parts (visual words)
Leibe et al., Robust Object Detection with Interleaved Categorization and Segmentation, IJCV’08
CV3DST | Prof. Leal-Taixé 88
CV3DST | Prof. Leal-Taixé 89
Leibe et al., Robust Object Detection with Interleaved Categorization and Segmentation, IJCV’08
Interest point detection (SIFT, SURF) Center point voting
CV3DST | Prof. Leal-Taixé 90
segmentation (CVPR 20)
CV3DST | Prof. Leal-Taixé 91
CV3DST | Prof. Leal-Taixé 92
The instance voting branch predicts for every pixel whether the pixel is part of an instance mask, and if so, the relative location of the instance mask centroid.
“stuff”) over a set of grid cells.
93 CV3DST | Prof. Leal-Taixé
to accumulator space via (dilated) transposed convolutions
94 CV3DST | Prof. Leal-Taixé
an instance masks
semantic segmentation head
95 CV3DST | Prof. Leal-Taixé
converted into K=17 indices.
CV3DST | Prof. Leal-Taixé 96
red pixel, which corresponds to position 16.
CV3DST | Prof. Leal-Taixé 97
size [H,W,K+1]
How?
Example: for the blue pixel, we get a vote for index 16 with 0.9 probability (softmax output)
tr transposed convoluti tion
0.1 -- av averag age p pooling
CV3DST | Prof. Leal-Taixé 98
map
being distributed to each of the output cells
fix the kernel parameters to 1-hot across each channel that marks the target location.
CV3DST | Prof. Leal-Taixé 99
[3,3,9], stride=1 and [3,3,8], stride=3
that marks the target location
CV3DST | Prof. Leal-Taixé 100
CV3DST | Prof. Leal-Taixé 101
above all others
CV3DST | Prof. Leal-Taixé 102
specific object center
My center is at pixel 8! Bottom-left pixel should have voted for ‘8’ if I’m the instance center!
CV3DST | Prof. Leal-Taixé 103
CV3DST | Prof. Leal-Taixé 104
drivability of surfaces.
and obstacles.
agents in the vicinity.
CV3DST | Prof. Leal-Taixé 105
106 CV3DST | Prof. Leal-Taixé