Paper Motivation Fixed geometric structures of CNN models CNNs are - - PowerPoint PPT Presentation

paper motivation
SMART_READER_LITE
LIVE PREVIEW

Paper Motivation Fixed geometric structures of CNN models CNNs are - - PowerPoint PPT Presentation

Paper Motivation Fixed geometric structures of CNN models CNNs are inherently limited to model geometric transformations Higher-level features combine lower-level features at fixed positions as a weighted sum Pooling chooses


slide-1
SLIDE 1
slide-2
SLIDE 2

Tomas Jenicek, CMP, CVUT 2

Paper Motivation

  • Fixed geometric structures of CNN models

– “CNNs are inherently limited to model geometric

transformations”

  • Higher-level features combine lower-level

features at fixed positions as a weighted sum

  • Pooling chooses the dominating features /

averages features at fixed positions

slide-3
SLIDE 3

Tomas Jenicek, CMP, CVUT 3

Invariance to Geometric Transformations

  • Learned from data

augmentation

  • Using transformation-

invariant features and algorithms

  • “Unknown or complex

geometric transformations not learned or modeled”

slide-4
SLIDE 4

Tomas Jenicek, CMP, CVUT 4

Standard Convolution and RoI Pooling

  • Convolution samples feature map at fixed

locations

  • RoI pooling reduces the spatial resolution at a

fixed ratio

  • “The higher the layer, the less desired

behaviour”

slide-5
SLIDE 5

Tomas Jenicek, CMP, CVUT 5

Deformable Convolution

  • Adds 2D offset to the regular grid sampling

locations

  • Free form deformation of the sampling grid
slide-6
SLIDE 6

Tomas Jenicek, CMP, CVUT 6

Deformable Convolution

  • Offsets are learned from the preceding feature

maps via additional convolutional layers

slide-7
SLIDE 7

Tomas Jenicek, CMP, CVUT 7

Deformable RoI Pooling

  • Adds 2D offset to each bin position in the regular

bin partition

  • Adaptive part localization for objects with different

shapes

slide-8
SLIDE 8

Tomas Jenicek, CMP, CVUT 8

Deformable RoI Pooling

  • Offsets are learned from the preceding feature

maps via additional RoI and a fully connected layer

slide-9
SLIDE 9

Tomas Jenicek, CMP, CVUT 9

Deformable Position-Sensitive RoI Pooling

  • Differs by having a different set of feature maps

for each bin position

slide-10
SLIDE 10

Tomas Jenicek, CMP, CVUT 10

Deformable Convolution and RoI Pooling Summary

  • Inference: offsets depend on the input features
  • Learning: offsets are learned from data
  • Filters are differentiable
slide-11
SLIDE 11

Tomas Jenicek, CMP, CVUT 11

Method Details

  • Offsets are fractional → bilinear interpolation
  • For (PS) RoI pooling, normalized offsets must

be used

  • The number of additional parameters

– Convolution and RoI pooling: – PS RoI pooling:

  • Learning rate for offsets can be different
slide-12
SLIDE 12

Tomas Jenicek, CMP, CVUT 12

PS RoI Offsets Examples

  • One 3x3 deformable PS RoI pooling layer
  • Input: a bounding box with a label
slide-13
SLIDE 13

Tomas Jenicek, CMP, CVUT 13

PS RoI Offsets Examples

slide-14
SLIDE 14

Tomas Jenicek, CMP, CVUT 14

Conv Offsets Examples

  • Three consecutive

3x3 deformable convolutional layers = 9^3 points

slide-15
SLIDE 15

Tomas Jenicek, CMP, CVUT 15

Conv Example – Man and a Goat

  • Blue dots – standard

convolution sample locations

  • Red dots – deformable

convolution sample locations

  • For 1, 2 and 3 consequent

layers

slide-16
SLIDE 16

Tomas Jenicek, CMP, CVUT 16

Conv Example – Man and a Goat

  • Center of convolution on a

man, sky and grass

  • For 3 consequent layers
slide-17
SLIDE 17

Tomas Jenicek, CMP, CVUT 17

Conv Example – Man and a Goat

  • The magnitude of offsets
  • For 3 consequent layers –

res5a, res5b and res5c

slide-18
SLIDE 18

Tomas Jenicek, CMP, CVUT 18

Conv Example – Man and a Goat

  • The anisotropic scale HSV

visualization

  • Red – horizontal, Green –

vertical

  • For 3 consequent layers
slide-19
SLIDE 19

Tomas Jenicek, CMP, CVUT 19

Conv Example – Man and a Goat

  • Offsets

HSV visual.

  • For 3

layers

slide-20
SLIDE 20

Tomas Jenicek, CMP, CVUT 20

Conv Example – Cars

  • The magnitude of offsets
  • For 3 consequent layers
  • The foreground-

background separation can be seen

slide-21
SLIDE 21

Tomas Jenicek, CMP, CVUT 21

Affine Transformation Approximation

  • The “unknown and complex” transformation was approximated by

an affine transformation

  • Format is MEAN (STD), the first is vertical axis
  • Unit is pixels in the feature map
  • Other tested images had similar results

Man and a Goat Cars Mean squared error 3.1 (1.5) 2.7 (1.4) Scale 3.4, 3.7 (0.8, 1.1) 2.9, 3.6 (1.0, 1.1) Translation 0.8, 0.0 (1.3, 0.2) 0.3, 0.0 (1.2, 0.1) Rotation

  • 0.1 (0.0)
  • 0.1 (0.0)

Shear 0.0 (0.0) 0.0 (0.0)

slide-22
SLIDE 22

Tomas Jenicek, CMP, CVUT 22

Statistics of Learned Scale - Effective Dilation

  • The mean of the distances between all adjacent

pairs of sampling locations in the deformable convolution filter

slide-23
SLIDE 23

Tomas Jenicek, CMP, CVUT 23

Remarks

  • The shift is a function of feature maps and not

constrained by any (e.g. affine) transformation

  • surprisingly no need for shift regularization
slide-24
SLIDE 24

Tomas Jenicek, CMP, CVUT 24

Relation to Deformable Part Models

  • Maximizing the similarity of parts while minimizing the inter-

part connection cost

  • Inference can be converted to CNN, learning not end-to-end
  • Deformable convolutions: no spatial relations between

parts, unlimited in modeling deformations

slide-25
SLIDE 25

Tomas Jenicek, CMP, CVUT 25

Relation to Spatial Transform Networks

  • 1. Localization net
  • Input: feature map
  • Output: affine transformation
  • 2. Grid generator
  • Generate a sampling grid according to transformation
  • 3. Sampler
slide-26
SLIDE 26

Tomas Jenicek, CMP, CVUT 26

Relation to Spatial Transform Networks

  • Can be inserted between any two layers
  • Deformable convolutions:

– No global parametric transformation – Easier training

slide-27
SLIDE 27

Tomas Jenicek, CMP, CVUT 27

Relation to Atrous / Dilated Convolutions

  • Exponential expansion of the receptive field
  • Deformable convolutions: input-dependent and

learnable dilated convolution

  • Both can replace filters with larger receptive field

while constraining their connectivity

slide-28
SLIDE 28

Tomas Jenicek, CMP, CVUT 28

Relation to Active Convolution

  • Learning the shape of convolution during training
  • Deformable convolutions: input-dependent
  • ffsets
slide-29
SLIDE 29

Tomas Jenicek, CMP, CVUT 29

Relation to Dynamic Filter Network

  • Weights for convolution are generated from the

input feature map

  • Deformable convolutions: the same but for
  • ffsets
slide-30
SLIDE 30

Tomas Jenicek, CMP, CVUT 30

Their Task

  • Semantic segmentation
  • Object detection
slide-31
SLIDE 31

Tomas Jenicek, CMP, CVUT 31

Their Setup

SoA object detection and semantic segmentation CNNs:

  • 1. Deep network generates feature maps

– Replace last 3 conv layers with deformable

  • 2. Shallow task specific network generates results

– Replace (PS) RoI pooling with deformable

Convolutions and offsets are learned simultaneously

slide-32
SLIDE 32

Tomas Jenicek, CMP, CVUT 32

Results

  • Object detection

– VOC 07: 82.3 vs. 79.6 mAP@0.5 – COCO: 56.8 vs. 54.3 mAP@0.5

  • Semantic segmentation

– Cityscapes: 75.2 vs. 70.3 mIoU – VOC 12: 75.9 vs. 70.7 mIoU

  • Others’ results

– COCO (with Soft-NMS): 62.8 mAP@0.5

slide-33
SLIDE 33

Tomas Jenicek, CMP, CVUT 33

Paper Evaluation – Formal Objections

  • Page 2 formula (2) - notation is

misleading since depends on

  • Page 3 paragraph 3 – scalar gamma further

scales normalized offsets, empirically set to 0.1

  • Page 5 figure 4 – figure is misleading, the
  • utput feature map has depth (C+1)
slide-34
SLIDE 34

Tomas Jenicek, CMP, CVUT 34

Paper Evaluation - Subjective Objections

  • Page 3 paragraph 1 and 2 – notation

is ambiguous

  • Max pooling application is missing
slide-35
SLIDE 35

Tomas Jenicek, CMP, CVUT 35

References

  • Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial

transformer networks." Advances in Neural Information Processing

  • Systems. 2015.
  • Jeon, Yunho, and Junmo Kim. "Active Convolution: Learning the Shape of

Convolution for Image Classification." arXiv preprint arXiv:1703.09076 (2017).

  • Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated

convolutions." arXiv preprint arXiv:1511.07122 (2015).

  • Felzenszwalb, Pedro F., et al. "Object detection with discriminatively

trained part-based models." IEEE transactions on pattern analysis and machine intelligence 32.9 (2010): 1627-1645.

  • De Brabandere, Bert, et al. "Dynamic filter networks." Neural Information

Processing Systems (NIPS). 2016.