Capsule Networks
Eric Mintun
Capsule Networks Eric Mintun Motivation An improvement* to - - PowerPoint PPT Presentation
Capsule Networks Eric Mintun Motivation An improvement* to regular Convolutional Neural Networks. Two goals: Replace max-pooling operation with something more intuitive. Keep more info about an activated feature. Not new,
Eric Mintun
Networks.
intuitive.
results in image segmentation and 3D object recognition.
*Your milage may vary
translation invariance.
same feature at each pixel.
location information.
1 1 1
1 2 1 2 1 2 1 2 2 2 2 1 4 6 1 5
4 4 2 2 1 1 2
feature, not orientation.
transform, but CNN doesn’t represent this.
inefficient when features are rare.
invariance is bad.
1 1 1
1 2 1 2 1 2 1 2 2 2 2 1 4 6 1 5
4 4 2 2 1 1 2
capsules and upper capsules:
back propagation.
level capsules:
Capsule
p x y θ Ω p x y θ Ω p x y θ Ω p x y θ Ω p x y θ Ω
~ ui ~ vj
vj = X
i
cij ˆ uij , X
i
cij = 1 ˆ uij = Wij~ ui
Wij i j i = 1 j = 1 j = 2 W11 : rotate 135˚ CCW, rescale by 1, translate (0,-1). W12 : rotate 45˚ CCW, rescale by 2, translate (0,-4).
features predict the same pose for feature , it is more likely is the correct higher level feature. cij i j i j j ˆ ui1 ˆ ui2 Increase , decrease . i = 1 c11 c12
1710.09829)
pdf?id=HJWLfGWRb
Hinton, 1710.09829
implement CapsNets using TensorFlow" (youtube.com/watch? v=2Kawrd5szHE)
Input* Primary capsule activation
*background gradient not part of input, but is because I took a screenshot of a youtube video.
~ vj = ||~ sj||2 1 + ||~ sj||2 ~ sj ||~ sj|| ~ sj = X
i
cij ˆ uij
~ vj
bij = 0 cij = softmax(bij) bij ← bij + ~ vj · ˆ uij ~ ui ~ vj ˆ uij = Wij~ ui
~ vj Tj = ⇢ 1 if digit present
X
j
⇥ Tjmax(0, 0.9 − ||~ vj||)2 + 0.5(1 − Tj)max(0, ||~ vj|| − 0.1)2⇤
accuracy vs 66% for CNN.
pairs with 5.2% error rate, compared to CNN error of 8.1%.
Original Reconstruction Correctly classified Forced wrong reconstruction Incorrectly classified
id=HJWLfGWRb)
are a 4x4 matrix.
sigmoid of learned weighted sum of local features.
h ai cij rij = cijai µjh = P
i rij ˆ
uijh P
i rij
σ2
jh =
P
i rij (ˆ
uijh − µjh)2 P
i rij
per pose component
aj = sigmoid " λ βa − X
h
(βv + log(σjh)) X
i
rij !#
learned by backprop. fixed schedule.
βa, βv λ cij = ajpij P
j ajpij
pij = 1 q 2π P
h σ2 ijh
e
− P h(ˆ uijh−µjh)2 2σ2 ijh
M E
pose passed to class capsules.
L = X
j6=t
(max(0, m − (at − aj))2 m t
for target class
classes of toy (airplanes, cars, trucks, humans, animals) with 10 physical instances of each toy, 18 azimuthal angles, 9 elevation angles, and 6 lighting conditions per training and test set. Total of 48,600 images each.
2 Loss from model 1
higher 2/3 elevation angles.
change in pixel intensity, then modify each pixel by small ε in direction that either (1) maximizes loss, or (2) maximizes classification probability of wrong class.
(1) (2)
routed network take 2 days to train on laptop, comparable CNN takes 30 minutes.
generally bad at complex images.
(crowding).
segmentation and 3D object recognition
very slow.
improved upon.
can learn capsules via auto-encoder:
competitive error rate on MNIST with ~25 labelled examples.