Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen - - PowerPoint PPT Presentation

spatial transformers in feed forward networks
SMART_READER_LITE
LIVE PREVIEW

Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen - - PowerPoint PPT Presentation

Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu Google DeepMind and University of Oxford ConvNets Interleaving convolutional layers with max-pooling layers allows


slide-1
SLIDE 1

Spatial Transformers in Feed-Forward Networks

Max Jaederberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu Google DeepMind and University of Oxford

slide-2
SLIDE 2

ConvNets

  • Interleaving convolutional layers with max-pooling layers

allows translation invariance.

  • Pooling is simplistic.
  • Only small invariances per pooling layer
  • Limited spatial transformation
  • Pools across entire image

+ Exceptionally effective

  • Can we do better?
slide-3
SLIDE 3

Motivation 1: transformations of input data

Rotated MNIST (+/- 90°)

slide-4
SLIDE 4

Motivation 2: attention

slide-5
SLIDE 5

Conditional Spatial Warping

  • Conditional on input featuremap, spatially warp image.

+ Transforms data to a space expected by subsequent layers + Intelligently select features of interest (attention) + Invariant to more generic warping

transform transform

slide-6
SLIDE 6

Conditional Spatial Warping

network input Spatial transform er

  • utput
slide-7
SLIDE 7

U V

Localisation net Sampler Spatial Transformer Grid generator

A differentiable module for spatially transforming data, conditional on the data itself

slide-8
SLIDE 8

U V Can parameterise, e.g. affine transformation

Sampling Grid

Warp regular grid by an affine transformation

à xs

i

ys

i

! = Tθ(Gi) = Aθ ⎛ ⎜ ⎝ xt

i

yt

i

1 ⎞ ⎟ ⎠ = " θ11 θ12 θ13 θ21 θ22 θ23 # ⎛ ⎜ ⎝ xt

i

yt

i

1 ⎞ ⎟ ⎠

slide-9
SLIDE 9

Can parameterise attention

Sampling Grid

Warp regular grid by an affine transformation

à xs

i

ys

i

! = Tθ(Gi) = Aθ ⎛ ⎜ ⎝ xt

i

yt

i

1 ⎞ ⎟ ⎠ = · s tx s ty ¸ ⎛ ⎜ ⎝ xt

i

yt

i

1 ⎞ ⎟ ⎠

U V

slide-10
SLIDE 10

Identity transformation affine transformation

slide-11
SLIDE 11

Sampler

and gradients are defined to allow backprop, eg: Sample input featuremap U to produce output feature map V (i.e. texture mapping)

U V

V c

i = H

X

n W

X

m

U c

nm max(0, 1 − |xs i − m|) max(0, 1 − |ys i − n|)

e.g. for bilinear interpolation:

∂V c

i

∂Uc

nm

=

H

X

n W

X

m

max(0, 1 − |xs

i − m|) max(0, 1 − |ys i − n|)

slide-12
SLIDE 12

U V

Localisation net Sampler Spatial Transformer Grid generator

A differentiable module for spatially transforming data, conditional on the data itself

slide-13
SLIDE 13

Spatial Transformer Networks

  • Spatial Transformers is differentiable, and so can be inserted at any

point in a feed forward network and trained by back propogration

CNN

=

9

Example:

  • digit classification, loss: cross-entropy for 10 way classification

CNN ST

9

slide-14
SLIDE 14

MNIST Digit Classification

Training data: 6000 examples of each digit Testing data: 10k images Can achieve testing error of 0.23%

slide-15
SLIDE 15
  • Training and test randomly rotated by (+/- 90°)
  • Fully connected network with affine ST on input

Task: classify MNIST digits

network input Spatial transformer

  • utput

Performance:

  • FCN 2.1
  • CNN 1.2
  • ST-FCN 1.2
  • ST-CNN 0.7
slide-16
SLIDE 16

Generalizations 1: transformations

  • Affine transformation – 6 parameters
  • Projective transformation – 8 parameters
  • Thin plate spline transformation
  • Etc

Any transformation where parameters can be regressed

slide-17
SLIDE 17

Rotated MNIST

7 6 2 7 8 7 3 2 9 7

Input ST

ST-FCN Affine ST-FCN Thin Plate Spline

Output Input ST Output

slide-18
SLIDE 18

Rotated, Translated & Scaled MNIST

6 7 3 1 7 9 5 8 5

Input ST

ST-FCN Projective ST-FCN Thin Plate Spline

Output Input ST Output

slide-19
SLIDE 19

Translated Cluttered MNIST

5 1 2 4 4 6 3 5 5 6

Input ST

ST-FCN Affine ST-CNN Affine

Output Input ST Output

slide-20
SLIDE 20

Results on performance

R: rotation RTS: rotation, translation, scale P: projective E: elastic

slide-21
SLIDE 21
  • Spatial Transformers can be inserted before/after conv layers,

before/after max-pooling

  • Can also have multiple Spatial Transformers at the same level

conv1 conv2 conv3 ST1 ST2a ST3 ST2b

=

9

Generalization 2: Multiple spatial transformers

slide-22
SLIDE 22

Task: Add digits in two images

MNIST digits under rotation, translation and scale

Architecture

slide-23
SLIDE 23

input (2 channels)

MNIST 2-channel addition Add up the digits. One per channel. Random per-channel rotation, scale and translation.

chan1 chan2 SpatialTransformer2 SpatialTransformer1 chan1 chan2 chan2 chan1 SpatialTransformer1 automatically specialises to rectify channel 1. SpatialTransformer2 automatically specialises to rectify channel 2.

Task: Add digits in two images

slide-24
SLIDE 24

Task: Add digits in two images

Performance % error

MNIST digits under rotation, translation and scale

slide-25
SLIDE 25

Applications and comparisons with the state of the art

slide-26
SLIDE 26

2 2 2 null null

Street View House Numbers (SVHN)

conv1 conv2 fc3 ST1 ST2

…. 200k real images of house numbers collected from Street View Between 1 and 5 digits in each number

4 spatial transformer + conv layers, 4 conv layers, 3 fc layers, 5 character output layers

Architecture:

slide-27
SLIDE 27

SVHN 64x64

  • CNN: 4.0% error
  • (single model) Goodfellow et al 2013
  • Attention: 3.9% error
  • (ensemble with MC averaging)

Ba et al, ICLR 2015

  • ST net: 3.6% error
  • (single model)
slide-28
SLIDE 28

SVHN 128x128

  • CNN: 5.6% error
  • (single model)
  • Attention: 4.5% error
  • (ensemble with MC averaging)

Ba et al, ICLR 2015

  • ST net: 3.9% error
  • (single model)
slide-29
SLIDE 29

Fine Grained Visual Categorization

CUB-200-2011 birds dataset

  • 200 species of birds
  • 6k training images
  • 5.8k test images
slide-30
SLIDE 30

Spatial Transformer Network

  • Pre-train inception networks on ImageNet
  • Train spatial transformer network on fine grained multi-way classification
slide-31
SLIDE 31

CUB Performance

slide-32
SLIDE 32

Summary

  • Spatial Transformers allow dynamic, conditional cropping and

warping of images/feature maps.

  • Can be constrained and used as very fast attention mechanism.
  • Spatial Transformer Networks localise and rectify objects
  • automatically. Achieve state of the art results.
  • Can be used as a generic localisation mechanism which can be

learnt with backprop.