capsule networks
play

Capsule Networks Eric Mintun Motivation An improvement* to - PowerPoint PPT Presentation

Capsule Networks Eric Mintun Motivation An improvement* to regular Convolutional Neural Networks. Two goals: Replace max-pooling operation with something more intuitive. Keep more info about an activated feature. Not new,


  1. Capsule Networks Eric Mintun

  2. Motivation • An improvement* to regular Convolutional Neural Networks. • Two goals: • Replace max-pooling operation with something more intuitive. • Keep more info about an activated feature. • Not new, but recent interest because of state-of-the-art results in image segmentation and 3D object recognition. *Your milage may vary

  3. CNN Review 0 0 0 0 -1 -1 0 0 1 1 1 • CNN architecture bakes in -1 0 1 translation invariance. 1 2 2 2 0 1 1 2 2 2 2 • Convolution looks for same feature at each pixel. 1 2 2 1 5 4 4 5 4 1 • Max-pooling throws out location information. 6 4 2 -2 6 2 -3 1 -4 -6

  4. CNN Issues • Only involves position of feature, not orientation. 0 0 0 0 -1 -1 0 0 1 1 1 • Translation is a linear -1 0 1 1 2 2 2 transform, but CNN doesn’t 0 1 1 2 2 2 2 represent this. • Grid representation inefficient when features are 1 2 1 2 5 4 rare. 4 5 4 1 6 4 2 -2 6 2 • Intermediate translation 1 -3 -4 -6 invariance is bad.

  5. Capsules x • Capsules have two steps: y p ~ θ • Apply pose transform between all lower v j capsules and upper capsules: Ω � u ij = W ij ~ u i ˆ • Transformation matrices learned by back propagation. Capsule • Route lower level capsules to higher level capsules: X X v j = c ij ˆ c ij = 1 � u ij , x x x x i i y y y y • Weights determined dynamically. p p p p θ θ θ θ ~ u i • Activations factor into this step. Ω Ω Ω Ω

  6. Pose Transformations • : given pose of feature , what is predicted pose i W ij of higher level feature ? j j = 1 j = 2 i = 1 W 11 : rotate 135˚ CCW, rescale by 1, translate (0,-1). W 12 : rotate 45˚ CCW, rescale by 2, translate (0,-4).

  7. Routing • : which feature does feature think it is a part of. j i c ij • Determined via “routing by agreement”: if many features predict the same pose for feature , it is j i more likely is the correct higher level feature. j ˆ ˆ u i 1 u i 2 i = 1 Increase , decrease . c 11 c 12

  8. Specific Models • Two separate papers give different explicit models. • Model 1, from “Dynamic Routing Between Capsules”, Sabour, Frosst, Hinton, 1710.09829) • State-of-the-art image segmentation • Few capsule layers • Generic poses with simple routing • Model 2, from “Matrix capsules with EM routing,” anonymous authors, openreview.net/ pdf?id=HJWLfGWRb • State-of-the-art 3D object recognition • More capsule layers • Structured poses with more advanced routing

  9. Model 1 • From “Dynamic Routing Between Capsules”, Sabour, Frosst, Hinton, 1710.09829 � � � � • Get pixels into capsule poses using convolutions and backprop. • ReLU between convolutions. Second convolution has stride 2.

  10. Primary Capsules • Cool visual description of primary capsules in Aurélien Géron’s “How to implement CapsNets using TensorFlow" (youtube.com/watch? v=2Kawrd5szHE) • One class detects line beginnings where pose is line direction: Input* Primary capsule activation *background gradient not part of input, but is because I took a screenshot of a youtube video.

  11. Routing • No separate activation probability, stored in length of pose vector. Squash pose vector to [0,1]: s j || 2 || ~ ~ s j X ~ � ~ s j = c ij ˆ v j = u ij 1 + || ~ s j || 2 || ~ s j || i • Assume uniform initial routing priors, calculate . ~ v j � b ij = 0 c ij = softmax( b ij ) • Update routing coefficients: � b ij ← b ij + ~ v j · ˆ u ij • Iterate 3 times. u ij = W ij ~ u i ~ ~ ˆ v j u i

  12. Loss • Two forms of loss. Margin loss: v j || ) 2 + 0 . 5(1 − T j )max(0 , || ~ X v j || − 0 . 1) 2 ⇤ ⇥ L = T j max(0 , 0 . 9 − || ~ � j ⇢ 1 � if digit present T j = � 0 otherwise • Reconstruction loss: ~ v j

  13. Results on MNIST • 0.25% error rate, competitive with CNNs. • Examples of capsule pose parameters: � � � • On unseen affine transformed digits (affNIST), 79% accuracy vs 66% for CNN.

  14. Image Segmentation • Trained on two MNIST digits with ~80% overlap, classifies pairs with 5.2% error rate, compared to CNN error of 8.1%. Original Reconstruction Correctly Forced wrong Incorrectly classified reconstruction classified

  15. Model 2 • From “Matrix capsules with EM routing,” anonymous authors (openreview.net/pdf? id=HJWLfGWRb) � � � � � • Organize pose as 4x4 matrix + activation logit instead of vector. Transformation weights are a 4x4 matrix. • Primary capsules’ poses are learned linear transform of local features. Activation is sigmoid of learned weighted sum of local features. • Convolutional capsules share transformation weights and see poses from a local kernel.

  16. EM Routing • Model higher layer as mixture of Gaussians that explains lower layer’s poses. • Start with uniform routing priors , weight by the activations of the lower capsules : a i c ij � r ij = c ij a i • Determine mean and variance: M u ijh − µ jh ) 2 P i r ij ˆ P i r ij (ˆ u ijh � σ 2 µ jh = jh = per pose component h P P i r ij i r ij • Activate upper capsule as: " !# β a , β v learned by backprop. X X � a j = sigmoid ( β v + log( σ jh )) λ β a − r ij λ fixed schedule. i h • Calculate new routing coefficients: uijh − µjh )2 � − P h (ˆ E 1 2 σ 2 a j p ij p ij = ijh e c ij = q P � h σ 2 2 π P j a j p ij ijh • Iterate 3 times.

  17. Last Layer and Loss • Connection to class capsules uses coordinate addition scheme: • Weights shared across locations, like convolutional layer. • Explicit (x,y) offset of kernel added to first two elements of pose passed to class capsules. • Spread loss: � X (max(0 , m − ( a t − a j )) 2 L = for target class t � j 6 = t • Margin increases linearly from 0.2 to 0.9 during training. m

  18. Test Dataset • smallNORB dataset: 96x96 greyscale images of 5 classes of toy (airplanes, cars, trucks, humans, animals) with 10 physical instances of each toy, 18 azimuthal angles, 9 elevation angles, and 6 lighting conditions per training and test set. Total of 48,600 images each.

  19. Results • Downscale smallNORB to 48x48, randomly crop to 32x32. 2 Loss from model 1

  20. Novel Viewpoints • Case 1: train on middle 1/3 azimuthal angles, test on remaining 2/3 azimuthal angles. • Case 2: train on lower 1/3 elevation angles, test on higher 2/3 elevation angles.

  21. Adversarial Robustness • FGSM adversarial attack: compute gradient of output w.r.t. change in pixel intensity, then modify each pixel by small ε in direction that either (1) maximizes loss, or (2) maximizes classification probability of wrong class. • BIM adversarial attack: same thing but with several steps. � � (1) (2) � • No improvement on images generated by adversarial CNN.

  22. Downsides • Capsule networks are really slow. Shallow EM routed network take 2 days to train on laptop, comparable CNN takes 30 minutes. • Poor performance (~11% error) on CIFAR10; generally bad at complex images. • Can’t handle multiple copies of the same object (crowding).

  23. Conclusions • Capsule networks explicitly learn the relative poses of objects. • State-of-the-art performance on image segmentation and 3D object recognition • Poor performance on complicated images, also very slow. • Little studied… unknown if these issues can be improved upon.

  24. Transforming Auto-encoders • With unlabeled data and ability to explicitly transform poses, can learn capsules via auto-encoder: � � � � • Then connect capsules to factor analyzers, can get competitive error rate on MNIST with ~25 labelled examples.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend