Paper Motivation Fixed geometric structures of CNN models CNNs are - - PowerPoint PPT Presentation
Paper Motivation Fixed geometric structures of CNN models CNNs are - - PowerPoint PPT Presentation
Paper Motivation Fixed geometric structures of CNN models CNNs are inherently limited to model geometric transformations Higher-level features combine lower-level features at fixed positions as a weighted sum Pooling chooses
Tomas Jenicek, CMP, CVUT 2
Paper Motivation
- Fixed geometric structures of CNN models
– “CNNs are inherently limited to model geometric
transformations”
- Higher-level features combine lower-level
features at fixed positions as a weighted sum
- Pooling chooses the dominating features /
averages features at fixed positions
Tomas Jenicek, CMP, CVUT 3
Invariance to Geometric Transformations
- Learned from data
augmentation
- Using transformation-
invariant features and algorithms
- “Unknown or complex
geometric transformations not learned or modeled”
Tomas Jenicek, CMP, CVUT 4
Standard Convolution and RoI Pooling
- Convolution samples feature map at fixed
locations
- RoI pooling reduces the spatial resolution at a
fixed ratio
- “The higher the layer, the less desired
behaviour”
Tomas Jenicek, CMP, CVUT 5
Deformable Convolution
- Adds 2D offset to the regular grid sampling
locations
- Free form deformation of the sampling grid
Tomas Jenicek, CMP, CVUT 6
Deformable Convolution
- Offsets are learned from the preceding feature
maps via additional convolutional layers
Tomas Jenicek, CMP, CVUT 7
Deformable RoI Pooling
- Adds 2D offset to each bin position in the regular
bin partition
- Adaptive part localization for objects with different
shapes
Tomas Jenicek, CMP, CVUT 8
Deformable RoI Pooling
- Offsets are learned from the preceding feature
maps via additional RoI and a fully connected layer
Tomas Jenicek, CMP, CVUT 9
Deformable Position-Sensitive RoI Pooling
- Differs by having a different set of feature maps
for each bin position
Tomas Jenicek, CMP, CVUT 10
Deformable Convolution and RoI Pooling Summary
- Inference: offsets depend on the input features
- Learning: offsets are learned from data
- Filters are differentiable
Tomas Jenicek, CMP, CVUT 11
Method Details
- Offsets are fractional → bilinear interpolation
- For (PS) RoI pooling, normalized offsets must
be used
- The number of additional parameters
– Convolution and RoI pooling: – PS RoI pooling:
- Learning rate for offsets can be different
Tomas Jenicek, CMP, CVUT 12
PS RoI Offsets Examples
- One 3x3 deformable PS RoI pooling layer
- Input: a bounding box with a label
Tomas Jenicek, CMP, CVUT 13
PS RoI Offsets Examples
Tomas Jenicek, CMP, CVUT 14
Conv Offsets Examples
- Three consecutive
3x3 deformable convolutional layers = 9^3 points
Tomas Jenicek, CMP, CVUT 15
Conv Example – Man and a Goat
- Blue dots – standard
convolution sample locations
- Red dots – deformable
convolution sample locations
- For 1, 2 and 3 consequent
layers
Tomas Jenicek, CMP, CVUT 16
Conv Example – Man and a Goat
- Center of convolution on a
man, sky and grass
- For 3 consequent layers
Tomas Jenicek, CMP, CVUT 17
Conv Example – Man and a Goat
- The magnitude of offsets
- For 3 consequent layers –
res5a, res5b and res5c
Tomas Jenicek, CMP, CVUT 18
Conv Example – Man and a Goat
- The anisotropic scale HSV
visualization
- Red – horizontal, Green –
vertical
- For 3 consequent layers
Tomas Jenicek, CMP, CVUT 19
Conv Example – Man and a Goat
- Offsets
HSV visual.
- For 3
layers
Tomas Jenicek, CMP, CVUT 20
Conv Example – Cars
- The magnitude of offsets
- For 3 consequent layers
- The foreground-
background separation can be seen
Tomas Jenicek, CMP, CVUT 21
Affine Transformation Approximation
- The “unknown and complex” transformation was approximated by
an affine transformation
- Format is MEAN (STD), the first is vertical axis
- Unit is pixels in the feature map
- Other tested images had similar results
Man and a Goat Cars Mean squared error 3.1 (1.5) 2.7 (1.4) Scale 3.4, 3.7 (0.8, 1.1) 2.9, 3.6 (1.0, 1.1) Translation 0.8, 0.0 (1.3, 0.2) 0.3, 0.0 (1.2, 0.1) Rotation
- 0.1 (0.0)
- 0.1 (0.0)
Shear 0.0 (0.0) 0.0 (0.0)
Tomas Jenicek, CMP, CVUT 22
Statistics of Learned Scale - Effective Dilation
- The mean of the distances between all adjacent
pairs of sampling locations in the deformable convolution filter
Tomas Jenicek, CMP, CVUT 23
Remarks
- The shift is a function of feature maps and not
constrained by any (e.g. affine) transformation
- surprisingly no need for shift regularization
Tomas Jenicek, CMP, CVUT 24
Relation to Deformable Part Models
- Maximizing the similarity of parts while minimizing the inter-
part connection cost
- Inference can be converted to CNN, learning not end-to-end
- Deformable convolutions: no spatial relations between
parts, unlimited in modeling deformations
Tomas Jenicek, CMP, CVUT 25
Relation to Spatial Transform Networks
- 1. Localization net
- Input: feature map
- Output: affine transformation
- 2. Grid generator
- Generate a sampling grid according to transformation
- 3. Sampler
Tomas Jenicek, CMP, CVUT 26
Relation to Spatial Transform Networks
- Can be inserted between any two layers
- Deformable convolutions:
– No global parametric transformation – Easier training
Tomas Jenicek, CMP, CVUT 27
Relation to Atrous / Dilated Convolutions
- Exponential expansion of the receptive field
- Deformable convolutions: input-dependent and
learnable dilated convolution
- Both can replace filters with larger receptive field
while constraining their connectivity
Tomas Jenicek, CMP, CVUT 28
Relation to Active Convolution
- Learning the shape of convolution during training
- Deformable convolutions: input-dependent
- ffsets
Tomas Jenicek, CMP, CVUT 29
Relation to Dynamic Filter Network
- Weights for convolution are generated from the
input feature map
- Deformable convolutions: the same but for
- ffsets
Tomas Jenicek, CMP, CVUT 30
Their Task
- Semantic segmentation
- Object detection
Tomas Jenicek, CMP, CVUT 31
Their Setup
SoA object detection and semantic segmentation CNNs:
- 1. Deep network generates feature maps
– Replace last 3 conv layers with deformable
- 2. Shallow task specific network generates results
– Replace (PS) RoI pooling with deformable
Convolutions and offsets are learned simultaneously
Tomas Jenicek, CMP, CVUT 32
Results
- Object detection
– VOC 07: 82.3 vs. 79.6 mAP@0.5 – COCO: 56.8 vs. 54.3 mAP@0.5
- Semantic segmentation
– Cityscapes: 75.2 vs. 70.3 mIoU – VOC 12: 75.9 vs. 70.7 mIoU
- Others’ results
– COCO (with Soft-NMS): 62.8 mAP@0.5
Tomas Jenicek, CMP, CVUT 33
Paper Evaluation – Formal Objections
- Page 2 formula (2) - notation is
misleading since depends on
- Page 3 paragraph 3 – scalar gamma further
scales normalized offsets, empirically set to 0.1
- Page 5 figure 4 – figure is misleading, the
- utput feature map has depth (C+1)
Tomas Jenicek, CMP, CVUT 34
Paper Evaluation - Subjective Objections
- Page 3 paragraph 1 and 2 – notation
is ambiguous
- Max pooling application is missing
Tomas Jenicek, CMP, CVUT 35
References
- Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial
transformer networks." Advances in Neural Information Processing
- Systems. 2015.
- Jeon, Yunho, and Junmo Kim. "Active Convolution: Learning the Shape of
Convolution for Image Classification." arXiv preprint arXiv:1703.09076 (2017).
- Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated
convolutions." arXiv preprint arXiv:1511.07122 (2015).
- Felzenszwalb, Pedro F., et al. "Object detection with discriminatively
trained part-based models." IEEE transactions on pattern analysis and machine intelligence 32.9 (2010): 1627-1645.
- De Brabandere, Bert, et al. "Dynamic filter networks." Neural Information
Processing Systems (NIPS). 2016.