Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. - - PowerPoint PPT Presentation

ins instanc nce segm segmen enta tati tion on
SMART_READER_LITE
LIVE PREVIEW

Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. - - PowerPoint PPT Presentation

Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. Leal-Taix 1 Se Semanti ntic c segmenta ntati tion Label every pixel, including the background (sky, grass, road) Do not differentiate between the pixels coming from


slide-1
SLIDE 1

1 CV3DST | Prof. Leal-Taixé

Ins Instanc nce segm segmen enta tati tion

  • n
slide-2
SLIDE 2

2 CV3DST | Prof. Leal-Taixé

Label every pixel, including the background (sky, grass, road) Do not differentiate between the pixels coming from instances of the same class

Se Semanti ntic c segmenta ntati tion

slide-3
SLIDE 3

3 CV3DST | Prof. Leal-Taixé

Label every pixel, including the background (sky, grass, road) Do not differentiate between the pixels coming from instances of the same class

In Instance segmentation

Do not label pixels coming from uncountable objects (sky, grass, road) Differentiate between the pixels coming from instances

  • f the same class
slide-4
SLIDE 4

In Instance segmentation methods

4 CV3DST | Prof. Leal-Taixé

vs. Proposal-based FCN-based

  • 1. Proposals
  • 2. Assign a

class

  • 1. Semantic

segmentation

  • 2. Find

instances

slide-5
SLIDE 5

In Instance segmentation methods

5 CV3DST | Prof. Leal-Taixé

vs. Proposal-based FCN-based

  • 1. Proposals
  • 2. Assign a

class

  • 1. Semantic

segmentation

  • 2. Find

instances

slide-6
SLIDE 6

6 CV3DST | Prof. Leal-Taixé

FC FCN-ba based methods

A semantic map… We already know how to obtain this!

slide-7
SLIDE 7

Wh Why FCN-ba based?

  • Fully Convolutional Networks for Semantic Segmentation

7 CV3DST | Prof. Leal-Taixé

Long, Shelhamer, Darrell - Fully Convolutional Networks for Semantic Segmentation, CVPR 2015, PAMI 2016

slide-8
SLIDE 8

8 CV3DST | Prof. Leal-Taixé

FC FCN-ba based methods

  • X. Liang et al. “Proposal-free Network for Instance-

level Object Segmentation“. Arxiv 2015

  • A. Kirillov et al. „InstanceCut: from Edges to Instances

with MultiCut“. CVPR 2017

  • M. Bai and R. Urtasun “Deep Watershed Transform for

Instance Segmentation “. CVPR 2017

slide-9
SLIDE 9

9 CV3DST | Prof. Leal-Taixé

In Instances through clustering

  • A. Kirillov et al. „InstanceCut: from Edges to Instances with MultiCut“. CVPR 2017
slide-10
SLIDE 10

In Instance segmentation methods

10 CV3DST | Prof. Leal-Taixé

vs. Proposal-based FCN-based

  • 1. Proposals
  • 2. Assign a

class

  • 1. Semantic

segmentation

  • 2. Find

instances

slide-11
SLIDE 11

11 CV3DST | Prof. Leal-Taixé

Pr Propo posal-ba based methods

Bounding boxes….. We already know how to obtain those!

slide-12
SLIDE 12

12 CV3DST | Prof. Leal-Taixé

Pr Propo posal-ba based methods

  • B. Hariharan et al. “Simultaneous Detection and

Segmentation“. ECCV 2014

– Follow-up work: B. Hariharan et al. “Hypercolumns for Object Segmentation and Fine-grained Localization ”. CVPR 2015

  • Dai et al. „Instance-aware Semantic Segmentation via

Multi-task Network Cascades“. CVPR 2016

– Previous work: Dai et al. “Convolutional Feature Masking for Joint Object and Stuff Segmentation“. CVPR 2015

slide-13
SLIDE 13

13 CV3DST | Prof. Leal-Taixé

SDS SDS

  • SDS: Simultaneous Detection and Segmentation
  • B. Hariharan et al. “Simultaneous Detection and Segmentation“. ECCV 2014
slide-14
SLIDE 14

14 CV3DST | Prof. Leal-Taixé

MN MNC

  • MNC: Multi-task network cascades

Dai et al. „Instance-aware Semantic Segmentation via Multi-task Network Cascades“. CVPR 2016

slide-15
SLIDE 15

IS IS: the best of both worlds

15 CV3DST | Prof. Leal-Taixé

+

Proposal-based FCN-based

  • 1. Proposals
  • 2. Assign a

class

  • 1. Semantic

segmentation

  • 2. Find

instances

slide-16
SLIDE 16

Ma Mask R R-CNN CNN

16 CV3DST | Prof. Leal-Taixé

slide-17
SLIDE 17

Wh What at is Mas Mask-RCNN? NN?

17 CV3DST | Prof. Leal-Taixé

  • Starting from the Faster R-CNN architecture

Bounding box regression head CNN Classification head Image Region Proposal Network

slide-18
SLIDE 18

Wh What at is Mas Mask-RCNN? NN?

18 CV3DST | Prof. Leal-Taixé

  • Faster R-CNN + FCN for segmentation

Bounding box regression head CNN Classification head Image Region Proposal Network Instance segmentation head (FCN)

slide-19
SLIDE 19

Wh What at is Mas Mask-RCNN? NN?

19 CV3DST | Prof. Leal-Taixé

  • Faster R-CNN + FCN for segmentation

Faster R-CNN FCN-like

He at al. “Mask R-CNN” ICCV 2017

Mask loss = binary cross entropy per pixel for the k semantic classes

slide-20
SLIDE 20

20 CV3DST | Prof. Leal-Taixé

Object recognition head Segmentation head Most of features are shared

Ma Mask sk R-CNN NN

slide-21
SLIDE 21

De Dete tecti ction vs

  • vs. segm

gmenta tati tion

  • Detection: for object classification, you require

inv nvar ariant ant representations

21 CV3DST | Prof. Leal-Taixé

Translation invariance: wherever the penguin is in the image, I still want to have “penguin” as my classification output

slide-22
SLIDE 22

De Dete tecti ction vs

  • vs. segm

gmenta tati tion

  • Detection: for object classification, you require

inv nvar ariant ant representations

  • Segmentation: you require equivar

ariant ant representations

– Translated object à Translated mask – Scaled object à scaled mask – For semantic segmentation, small objects are less important (less pixels), but for instance segmentation, all

  • bjects (no matter the size) are equally important

22 CV3DST | Prof. Leal-Taixé

slide-23
SLIDE 23

Mas Mask-RCNN: NN: operations

23 CV3DST | Prof. Leal-Taixé

  • What operations are equivariant?

Features extraction = convolutional layers à equivariant Segmentation head is a fully convolutional network à equivariant

slide-24
SLIDE 24

Mas Mask-RCNN: NN: operations

24 CV3DST | Prof. Leal-Taixé

  • What operations are equivariant?

Features extraction = convolutional layers à equivariant Segmentation head is a fully convolutional network à equivariant Fully connected layers and global pooling layers give invariance!

slide-25
SLIDE 25

Rec Recal all: RoI RoI po pooli ling

25 CV3DST | Prof. Leal-Taixé

  • Region of Interest Pooling: for every proposal

CNN Image Feature map Zoom in We put a HxW grid on top Feature map (HxwxC)

slide-26
SLIDE 26

Rec Recal all: RoI RoI po pooli ling

26 CV3DST | Prof. Leal-Taixé

  • Let us look at sizes

CNN Image Feature map (65x65xC) Zoom in We put a 4x6 grid on top Feature map (4x6xC) 400x400 Box 300x150 Box height 65*300/400=48.75 Quantization effect = chose 48 Quantization effect Not suitable to extract pixel-wise precise masks

slide-27
SLIDE 27

Mas Mask-RCNN: NN: operations

27 CV3DST | Prof. Leal-Taixé

  • Make all operations equivariant

Fully connected layers and global pooling layers give invariance! Exchange RoI pooling by an equivariant operation = RoI Align

slide-28
SLIDE 28

RoIA RoIAlign

28 CV3DST | Prof. Leal-Taixé

  • Erase quantization effects

CNN Image Feature map (65x65xC) Zoom in We put a 4x6 grid on top Feature map (4x6xC) 400x400 Box 300x150 Box height 65*300/400=48.75 Chose 48.75

slide-29
SLIDE 29

ROIA IAlign

29 CV3DST | Prof. Leal-Taixé

Image: Kaiming He

Feature map Grid points for bilinear interpolation Max pooling on the 4 positions to obtain

  • ne output value

Each unit is sampled 4 times To obtain the value use bilinear interpolation

slide-30
SLIDE 30

Mas Mask R-CN CNN: qualit itativ ive e res esults

30 CV3DST | Prof. Leal-Taixé

slide-31
SLIDE 31

Mas Mask R-CN CNN: qualit itativ ive e res esults

31 CV3DST | Prof. Leal-Taixé

slide-32
SLIDE 32

Mas Mask R-CN CNN: qualit itativ ive e res esults

32 CV3DST | Prof. Leal-Taixé

slide-33
SLIDE 33

Mas Mask R-CN CNN: qualit itativ ive e res esults

33 CV3DST | Prof. Leal-Taixé

slide-34
SLIDE 34

Mas Mask R-CN CNN: ex exten ended ed for

  • r join

joints

34 CV3DST | Prof. Leal-Taixé

Model a keypoint's location as a one-hot mask, and adopt Mask R-CNN to predict K masks (which are in the end only 1 pixel),

  • ne for each of K keypoint types (e.g., left shoulder, right elbow).

This demonstrates the flexibility of Mask R-CNN.

slide-35
SLIDE 35

Imp Impro roving Mask-RCN RCNN

  • One problem with Mask R-CNN is that the mask quality score is

computed as the confidence score for the bounding box

35 CV3DST | Prof. Leal-Taixé

Recall the mask loss just evaluates if the pixels have the correct semantic class, not the correct instance! Both instances have the same class = person The only way the “instance” is evaluated is through the box loss

slide-36
SLIDE 36

Mas Mask Io IoU head ead

36 CV3DST | Prof. Leal-Taixé

Huang et al., “Mask Scoring R-CNN”, CVPR 2019

Measure the intersection over union between the predicted mask and ground truth mask

slide-37
SLIDE 37

Mask confid idence score

37 CV3DST | Prof. Leal-Taixé

Typically, Mask scoring R-CNN gives lower confidence scores than Mask R-CNN, which corresponds to masks not being perfect (IoU < 1.0). This tiny modification achieves SOTA results.

slide-38
SLIDE 38

Is Is one-stage vs two wo- stage stage al also so appl applicabl icable to to mask masks? s?

38 CV3DST | Prof. Leal-Taixé

slide-39
SLIDE 39

On One-st stage vs s two-st stage detectors

39 CV3DST | Prof. Leal-Taixé

Faster R-CNN YOLO Slower, but has higher performance Faster, but has lower performance

slide-40
SLIDE 40

On One-st stage e vs s two-st stage e inst nstanc ance e seg segment enter ers

40 CV3DST | Prof. Leal-Taixé

Mask R-CNN YOLACT Slower, but has higher performance Faster, but has lower performance

slide-41
SLIDE 41

YO YOLO w wit ith ma mask sks? s?

“Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.”

– Joseph Redmon, YOLOv3

41 CV3DST | Prof. Leal-Taixé

slide-42
SLIDE 42

YolA lACT*

42 CV3DST | Prof. Leal-Taixé

*You Only Look At CoefficienTs

slide-43
SLIDE 43

YOLACT: id idea

43 CV3DST | Prof. Leal-Taixé

  • D. Bolya et al. “YOLACT: Real-time Instance Segmentation”. ICCV 2019
slide-44
SLIDE 44

YOLACT: id idea

44 CV3DST | Prof. Leal-Taixé

1) Generate mask prototypes

slide-45
SLIDE 45

YOLACT: id idea

45 CV3DST | Prof. Leal-Taixé

1) Generate mask prototypes 2) Generate mask coefficients

slide-46
SLIDE 46

YOLACT: id idea

46 CV3DST | Prof. Leal-Taixé

1) Generate mask prototypes 2) Generate mask coefficients 3) Combine (1) and (2)

slide-47
SLIDE 47

YO YOLA LACT: : bac ackbone

47 CV3DST | Prof. Leal-Taixé

ResNet-101 Features computed in different scales

slide-48
SLIDE 48

YO YOLA LACT: : pr protonet

48 CV3DST | Prof. Leal-Taixé

Generate k prototype masks. k is not the number of classes, but is a hyperparameter.

slide-49
SLIDE 49

YO YOLA LACT: : pr protonet

  • Fully convolutional network

49 CV3DST | Prof. Leal-Taixé

Similar to the mask branch in Mask R-CNN. However, no loss function is applied on this stage. 1x1 conv 3x3 conv

slide-50
SLIDE 50

YOLACT: mask coeffic icie ients

50 CV3DST | Prof. Leal-Taixé

Predict a coefficient for every predicted mask.

slide-51
SLIDE 51

YOLACT: mask coeffic icie ients

51 CV3DST | Prof. Leal-Taixé

Predict k coefficients (one per prototype mask) per anchor Predict one class per anchor box Predict the regression per anchor box The network is similar but shallower than RetinaNet

slide-52
SLIDE 52

YO YOLA LACT: : mas ask as assembly

1. Do a linear combination between the mask coefficients and the mask prototypes. 2. Predict the mask as M = 𝜏(𝑄𝐷!) where P is a (HxWxK) matrix of prototype masks, C is a (NxK) matrix of mask coefficients surviving NMS, and 𝜏 is a nonlinearity.

52 CV3DST | Prof. Leal-Taixé

slide-53
SLIDE 53

YOLACT: loss functio ion

53 CV3DST | Prof. Leal-Taixé

Cross-entropy between the assembled masks and the ground truth, in addition to the standard losses (regression for the bounding box, and classification for the class of the

  • bject/mask).
slide-54
SLIDE 54

YO YOLACT: q : qualit itativ ive r resu sults

54 CV3DST | Prof. Leal-Taixé

slide-55
SLIDE 55

YO YOLACT: q : qualit itativ ive r resu sults

55 CV3DST | Prof. Leal-Taixé

For large objects, the quality of the masks is even better than those of two- stage detectors

slide-56
SLIDE 56

So, , which se segmenter to to use?

56 CV3DST | Prof. Leal-Taixé

YOLACT

slide-57
SLIDE 57

YOLACT: im improvements

  • A specially designed version of NMS, in order to

make the procedure faster.

  • An auxiliary semantic segmentation loss function

performed on the final features of the FPN. The module is not used during the inference stage.

  • D. Boyla et al. “YOLACT++: Better real-time instance

segmentation”. arXiv:1912.06218 2019

57 CV3DST | Prof. Leal-Taixé

slide-58
SLIDE 58

58 CV3DST | Prof. Leal-Taixé

Pa Pano noptic ic segm segmen enta tati tion

  • n
slide-59
SLIDE 59

59 CV3DST | Prof. Leal-Taixé

Semantic segmentation Instance segmentation +

Pa Panopt ptic c segm gmentation

slide-60
SLIDE 60

60 CV3DST | Prof. Leal-Taixé

Semantic segmentation Instance segmentation FCN-like Mask R-CNN +

Pa Panopt ptic c segm gmentation

slide-61
SLIDE 61

61 CV3DST | Prof. Leal-Taixé

Semantic segmentation Instance segmentation FCN-like Mask R-CNN + = UPSNet Panoptic segmentation

Pa Panopt ptic c segm gmentation

slide-62
SLIDE 62

Pa Panopt ptic c segm gmentation

62 CV3DST | Prof. Leal-Taixé

It gives labels to uncountable

  • bjects called "stuff" (sky, road,

etc), similar to FCN-like networks. It differentiates between pixels coming from different instances

  • f the same class (countable
  • bjects) called "things" (cars,

pedestrians, etc).

slide-63
SLIDE 63

Pa Panopt ptic c segm gmentation

63 CV3DST | Prof. Leal-Taixé

Problem: some pixels might get classified as stuff from FCN network, while at the same time being classified as instances of some class from Mask R-CNN (conflicting results)!

slide-64
SLIDE 64

Pa Panopt ptic c segm gmentation

64 CV3DST | Prof. Leal-Taixé

Solution: Parametric-free panoptic head which combines the information from the FCN and Mask R-CNN, giving final predictions.

Xiong et al., “UPSNet: A Unified Panoptic Segmentation Network”. CVPR 2019

slide-65
SLIDE 65

Ne Network rk a arc rchitecture ure

65 CV3DST | Prof. Leal-Taixé

Shared features Separate heads Putting it together

slide-66
SLIDE 66

Ne Network rk a arc rchitecture ure

66 CV3DST | Prof. Leal-Taixé

Shared features Separate heads Putting it together

slide-67
SLIDE 67

The The semanti ntic c he head

67 CV3DST | Prof. Leal-Taixé

As all semantic heads à fully convolutional network. New: deformable convolutions!

slide-68
SLIDE 68

Re Reca call: Di Dilated (at atrous) ) con

  • nvol
  • lution

ions 2D

68 CV3DST | Prof. Leal-Taixé

(a) the dilation parameter is 1, and each element produced by this filter has receptive field of 3x3. (b) the dilation parameter is 2, and each element produced by it has receptive field of 7x7. (c ) the dilation parameter is 4, and each element produced by it has receptive field of 15x15.

slide-69
SLIDE 69

De Deformable ble co convo volu luti tions

69 CV3DST | Prof. Leal-Taixé

Deformable convolutions: generalization of dilated convolutions when you learn the offset

slide-70
SLIDE 70

De Deformable ble co convo volu luti tions

70 CV3DST | Prof. Leal-Taixé

slide-71
SLIDE 71

De Deformable ble co convo volu luti tions

71 CV3DST | Prof. Leal-Taixé

The deformable convolution will pick the values at different locations for convolutions conditioned on the input image

  • f the feature maps.
slide-72
SLIDE 72

The The Pano nopti tic c he head

72 CV3DST | Prof. Leal-Taixé

Mask logits from the instance head Object logits coming from the semantic head (e.g., car) Stuff logits coming from the semantic head (e.g., sky)

slide-73
SLIDE 73

The The Pano nopti tic c he head

73 CV3DST | Prof. Leal-Taixé

Mask logits from the instance head Object logits coming from the semantic head (e.g., car) Stuff logits coming from the semantic head (e.g., sky) This can be evaluated directly Objects need to be masked by the instance

slide-74
SLIDE 74

The The Pano nopti tic c he head

74 CV3DST | Prof. Leal-Taixé

Perform softmax over the panoptic

  • logits. If the maximum value falls

into the first stuff channels, then it belongs to one of the stuff classes. Otherwise the index of the maximum value tells us the instance ID the pixel belongs to.

Xiong et al., “UPSNet: A Unified Panoptic Segmentation Network”. CVPR 2019

Read the details on how to use the unknown class

slide-75
SLIDE 75

Me Metr trics ics

CV3DST | Prof. Leal-Taixé 75

slide-76
SLIDE 76

Pa Panopt ptic c qu quali lity

  • SQ: Segmentation Quality = how close the predicted

segments are to the ground truth segment (does not take into account bad predictions!)

76 CV3DST | Prof. Leal-Taixé

TP = True positive, FN = False negative, FP = false positive

slide-77
SLIDE 77

Pa Panopt ptic c qu quali lity

  • RQ: Recognition Quality = just like for detection, we

want to know if we are missing any instances (FN) or we are predicting more instances (FP).

77 CV3DST | Prof. Leal-Taixé

TP = True positive, FN = False negative, FP = false positive

slide-78
SLIDE 78

Pa Panopt ptic c qu quali lity

  • As in detection, we have to “match ground truth and
  • predictions. In this case we have segment matching.
  • Segment is matched if IoU>0.5. No pixel can belong

to two predicted segments.

78 CV3DST | Prof. Leal-Taixé

Predictions Ground truth IoU measures

FP TP

slide-79
SLIDE 79

Pa Panopt ptic c segm gmentation: qu quali litative

79 CV3DST | Prof. Leal-Taixé

slide-80
SLIDE 80

Pa Panopt ptic c segm gmentation: qu quali litative

80 CV3DST | Prof. Leal-Taixé

slide-81
SLIDE 81

Ob Object ct Ins Instance nce Segm Segmen enta tati tion

  • n as

s Voti Voting

CV3DST | Prof. Leal-Taixé 81

slide-82
SLIDE 82

Sli Sliding ng Wind ndow w Ap Approach ch

  • DPM, RCNN families
  • Densely enumerate box proposals + classify
  • Tremendously successful paradigm, very well

engineered

  • SOTA methods are still based on this paradigm

CV3DST | Prof. Leal-Taixé 82

slide-83
SLIDE 83

Ge Genera ralize zed H Hough ugh T Tra ransfo sform rm

Before DPM, RCNN dominance: detection-as-voting

83 CV3DST | Prof. Leal-Taixé

slide-84
SLIDE 84

Ho Hough V Votin ing

  • Detect analytical shapes (e.g., lines) as peaks in the

dual parametric space

  • Each pixel casts a vote in this dual space
  • Detect peaks and 'back-project' them to the image

space

CV3DST | Prof. Leal-Taixé 84

slide-85
SLIDE 85

Ex Examp mple: e: Lin ine e Det etec ection ion

  • Each edge point in image space casts a vote

CV3DST | Prof. Leal-Taixé 85

slide-86
SLIDE 86

Ex Examp mple: e: Lin ine e Det etec ection ion

  • Each edge point in image space casts a vote
  • The vote is in the form of a line that crosses the point

CV3DST | Prof. Leal-Taixé 86

slide-87
SLIDE 87

Ex Examp mple: e: Lin ine e Det etec ection ion

  • Accumulate votes from different points in

(discretized) parameter space

  • Read-out maxima (peaks) from the accumulator

CV3DST | Prof. Leal-Taixé 87

slide-88
SLIDE 88

Obj Object ct De Detect ction as Voting

  • Idea: Objects are detected as consistent

configurations of the observed parts (visual words)

Leibe et al., Robust Object Detection with Interleaved Categorization and Segmentation, IJCV’08

CV3DST | Prof. Leal-Taixé 88

slide-89
SLIDE 89

Obj Object ct De Detect ction

  • Training

CV3DST | Prof. Leal-Taixé 89

Leibe et al., Robust Object Detection with Interleaved Categorization and Segmentation, IJCV’08

Interest point detection (SIFT, SURF) Center point voting

slide-90
SLIDE 90

Obj Object ct De Detect ction

  • Inference (test time)

CV3DST | Prof. Leal-Taixé 90

slide-91
SLIDE 91

Bac Back to

  • the

e future

  • Back to 2020…
  • We can use pixel consensus voting for panoptic

segmentation (CVPR 20)

CV3DST | Prof. Leal-Taixé 91

slide-92
SLIDE 92

Ov Overv rview

CV3DST | Prof. Leal-Taixé 92

  • H. Wang et al. “Pixel Consensus Voting for Panoptic Segmentation” CVPR 2020

The instance voting branch predicts for every pixel whether the pixel is part of an instance mask, and if so, the relative location of the instance mask centroid.

slide-93
SLIDE 93

In In a Nutshell

  • 1. Discretize regions around each pixel.
  • 2. Every pixel votes for a centroid (or no centroid for

“stuff”) over a set of grid cells.

93 CV3DST | Prof. Leal-Taixé

slide-94
SLIDE 94

In In a Nutshell

  • 3. Vote aggregation probabilities at each pixel are cast

to accumulator space via (dilated) transposed convolutions

  • 4. Detect objects as 'peaks' in the accumulator space

94 CV3DST | Prof. Leal-Taixé

slide-95
SLIDE 95

In In a Nutshell

  • 5. Back-projection of 'peaks' back to the image to get

an instance masks

  • 6. Category information provided by the parallel

semantic segmentation head

95 CV3DST | Prof. Leal-Taixé

slide-96
SLIDE 96

Vot Voting Look

  • okup Tab

able

  • Discretize region around the pixel: M×M cells

converted into K=17 indices.

CV3DST | Prof. Leal-Taixé 96

slide-97
SLIDE 97

Vot Voting Look

  • okup Tab

able

  • The vote should be cast to the center, which is the

red pixel, which corresponds to position 16.

CV3DST | Prof. Leal-Taixé 97

slide-98
SLIDE 98

Vot Voting

  • At inference, instance voting branch provides tensor of

size [H,W,K+1]

  • Softly accumulate votes in the voting accumulator. Ho

How?

Example: for the blue pixel, we get a vote for index 16 with 0.9 probability (softmax output)

  • Transfer 0.9 to cell 16 -- (dilated)

tr transposed convoluti tion

  • Evenly distribute among pixels, each gets

0.1 -- av averag age p pooling

CV3DST | Prof. Leal-Taixé 98

slide-99
SLIDE 99

Tr Trans nsposed Convo nvolu luti tions ns

  • Take a single value in the input
  • Multiply with a kernel and distribute in the output

map

  • Kernel defines the amount of the input value that is

being distributed to each of the output cells

  • For the purpose of vote aggregation, however, we

fix the kernel parameters to 1-hot across each channel that marks the target location.

CV3DST | Prof. Leal-Taixé 99

slide-100
SLIDE 100

Vot Voting - Im Implementation

  • Output tensor: [H,W,K+1]
  • Example: 9 inner, 8 outer bins, K=17
  • Split the output tensor to two tensors: [H,W,9],[H,W,8]
  • Apply two transposed convolutions, with kernel of size

[3,3,9], stride=1 and [3,3,8], stride=3

  • Pre-fixed kernel parameters; 1-hot across each channel

that marks the target location

  • Dilation => spread votes to the outer ring
  • Smooth votes evenly via average pooling

CV3DST | Prof. Leal-Taixé 100

slide-101
SLIDE 101

Obj Object ct De Detect ction

  • Peaks in the heatmap -- consensus detections
  • Thresholding + connected components

CV3DST | Prof. Leal-Taixé 101

slide-102
SLIDE 102

Obj Object ct Loca cali lization

  • Vote back-projection
  • For every peak, determine pixels that favor this region

above all others

CV3DST | Prof. Leal-Taixé 102

slide-103
SLIDE 103

Obj Object ct Loca cali lization

  • Idea: determine which pixels could have voted for a

specific object center

  • Query filter
  • Examine votes
  • Vote argmax
  • Find ``consensus``
  • Equality test

My center is at pixel 8! Bottom-left pixel should have voted for ‘8’ if I’m the instance center!

CV3DST | Prof. Leal-Taixé 103

slide-104
SLIDE 104

Qua Quali litative Result ults

CV3DST | Prof. Leal-Taixé 104

slide-105
SLIDE 105

Fin Fine-grained Scene In Interpretation

  • Individual objects, surfaces (things and stuff)
  • Mobile robots
  • Reason about the

drivability of surfaces.

  • The type of objects

and obstacles.

  • The intent of other

agents in the vicinity.

CV3DST | Prof. Leal-Taixé 105

slide-106
SLIDE 106

106 CV3DST | Prof. Leal-Taixé

Ins Instanc nce segm segmen enta tati tion

  • n