spatial transformers in feed forward networks
play

Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen - PowerPoint PPT Presentation

Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu Google DeepMind and University of Oxford ConvNets Interleaving convolutional layers with max-pooling layers allows


  1. Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu Google DeepMind and University of Oxford

  2. ConvNets • Interleaving convolutional layers with max-pooling layers allows translation invariance. - Pooling is simplistic. - Only small invariances per pooling layer - Limited spatial transformation - Pools across entire image + Exceptionally effective • Can we do better?

  3. Motivation 1: transformations of input data Rotated MNIST (+/- 90°)

  4. Motivation 2: attention

  5. Conditional Spatial Warping • Conditional on input featuremap, spatially warp image. + Transforms data to a space expected by subsequent layers + Intelligently select features of interest (attention) + Invariant to more generic warping transform transform

  6. Conditional Spatial Warping network Spatial output input transform er

  7. A differentiable module for spatially transforming data, conditional on the data itself Grid Localisation net generator V U Sampler Spatial Transformer

  8. Sampling Grid Warp regular grid by an affine Can parameterise, e.g. affine transformation transformation ⎛ ⎞ # ⎛ ⎞ Ã ! " x t x t i i x s θ 11 θ 12 θ 13 ⎜ ⎟ ⎜ ⎟ i y t y t = T θ ( G i ) = A θ ⎠ = ⎝ ⎝ ⎠ i i y s θ 21 θ 22 θ 23 i 1 1 V U

  9. Sampling Grid Warp regular grid by an affine Can parameterise attention transformation ⎛ ⎞ ⎛ ⎞ Ã ! x t x t · ¸ i i x s ⎜ ⎟ s 0 t x ⎜ ⎟ i y t y t = T θ ( G i ) = A θ ⎠ = ⎝ ⎝ ⎠ i i y s 0 s t y i 1 1 V U

  10. Identity affine transformation transformation

  11. Sampler Sample input featuremap U to produce output feature map V (i.e. texture mapping) e.g. for bilinear interpolation: H W X X V c U c nm max(0 , 1 − | x s i − m | ) max(0 , 1 − | y s i − n | ) i = n m and gradients are defined to allow backprop, eg: H W X X ∂ V c V U i max(0 , 1 − | x s i − m | ) max(0 , 1 − | y s i − n | ) = ∂ U c nm n m

  12. A differentiable module for spatially transforming data, conditional on the data itself Grid Localisation net generator V U Sampler Spatial Transformer

  13. Spatial Transformer Networks • Spatial Transformers is differentiable, and so can be inserted at any point in a feed forward network and trained by back propogration = Example: • digit classification, loss: cross-entropy for 10 way classification 0 0 CNN CNN ST 9 9

  14. MNIST Digit Classification Training data: 6000 examples of each digit Testing data: 10k images Can achieve testing error of 0.23%

  15. Task: classify MNIST digits • Training and test randomly rotated by (+/- 90°) • Fully connected network with affine ST on input Spatial network output transformer input Performance: • FCN 2.1 • CNN 1.2 • ST-FCN 1.2 • ST-CNN 0.7

  16. Generalizations 1: transformations • Affine transformation – 6 parameters • Projective transformation – 8 parameters • Thin plate spline transformation • Etc Any transformation where parameters can be regressed

  17. Rotated MNIST ST-FCN Affine ST-FCN Thin Plate Spline Input ST Output Input ST Output 7 7 3 6 2 2 9 7 7 8

  18. Rotated, Translated & Scaled MNIST ST-FCN Projective ST-FCN Thin Plate Spline Input ST Output Input ST Output 9 6 0 7 5 3 8 1 5 7

  19. Translated Cluttered MNIST ST-FCN Affine ST-CNN Affine Input ST Output Input ST Output 6 5 3 1 5 2 5 4 6 4

  20. Results on performance R: rotation P: projective RTS: rotation, translation, scale E: elastic

  21. Generalization 2: Multiple spatial transformers • Spatial Transformers can be inserted before/after conv layers, before/after max-pooling • Can also have multiple Spatial Transformers at the same level = 0 conv3 ST2a conv2 conv1 ST3 9 ST1 ST2b

  22. Task: Add digits in two images MNIST digits under rotation, translation and scale Architecture

  23. Task: Add digits in two images input (2 channels) MNIST 2-channel addition chan1 chan2 Add up the digits. One per channel. Random per-channel rotation, scale and translation. SpatialTransformer1 SpatialTransformer2 SpatialTransformer1 automatically specialises to rectify channel 1. SpatialTransformer2 automatically specialises to rectify channel 2. chan1 chan2 chan1 chan2

  24. Task: Add digits in two images MNIST digits under rotation, translation and scale Performance % error

  25. Applications and comparisons with the state of the art

  26. Street View House Numbers (SVHN) 200k real images of house numbers collected from Street View Between 1 and 5 digits in each number 2 2 Architecture: 2 …. fc3 null conv2 conv1 null ST2 ST1 4 spatial transformer + conv layers, 4 conv layers, 3 fc layers, 5 character output layers

  27. SVHN 64x64 • CNN: 4.0% error • (single model) Goodfellow et al 2013 • Attention: 3.9% error • (ensemble with MC averaging) Ba et al, ICLR 2015 • ST net: 3.6% error • (single model)

  28. SVHN 128x128 • CNN: 5.6% error • (single model) • Attention: 4.5% error • (ensemble with MC averaging) Ba et al, ICLR 2015 • ST net: 3.9% error • (single model)

  29. Fine Grained Visual Categorization CUB-200-2011 birds dataset • 200 species of birds • 6k training images • 5.8k test images

  30. Spatial Transformer Network • Pre-train inception networks on ImageNet • Train spatial transformer network on fine grained multi-way classification

  31. CUB Performance

  32. Summary ● Spatial Transformers allow dynamic, conditional cropping and warping of images/feature maps. ● Can be constrained and used as very fast attention mechanism. ● Spatial Transformer Networks localise and rectify objects automatically. Achieve state of the art results. ● Can be used as a generic localisation mechanism which can be learnt with backprop.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend