Understanding image representations by measuring their equivariance - - PowerPoint PPT Presentation

โ–ถ
understanding image representations
SMART_READER_LITE
LIVE PREVIEW

Understanding image representations by measuring their equivariance - - PowerPoint PPT Presentation

Visual Geometry Group, Department of Engineering Science Understanding image representations by measuring their equivariance and equivalence Karel Lenc, Andrea Vedaldi Representations for image understanding 2 image feature semantic


slide-1
SLIDE 1

Understanding image representations by measuring their equivariance and equivalence

Karel Lenc, Andrea Vedaldi

Visual Geometry Group, Department of Engineering Science

slide-2
SLIDE 2

Representations for image understanding

๐œ” image space feature space semantic space

๐œš(๐’š) ๐œš(๐’›) ๐œš(๐’œ) ๐’š ๐’›

๐œš ๐œš ๐œš bike ๐œ” bike ๐œ” dog representation classifier

2 Ultimate goal of a representation: simplify a task such as image classification Many representations

โ–ถ

Local image descriptors

SIFT [Lowe 04], HOG [Dalal et al. 05], SURF [Bay et al. 06], LBP [Ojala et al. 02], โ€ฆ

โ–ถ

Feature encoders

BoVW [Sivic et al. 02], Fisher Vector [Perronnin et al. 07], VLAD [Jegou et al. 10], sparse coding, โ€ฆ

โ–ถ

Deep convolutional neural networks

[Fukushima 1974-1982, LeCun et al. 89, Krizhevsky et al. 12, โ€ฆ]

๐’œ

slide-3
SLIDE 3

Design of representations

Many designs are empirical, the main theoretical design principle is invariance 3

๐œš ๐‘• ๐œš

image space feature space representation

๐’š ๐‘•๐’š

Invariant ๐œš ๐’š = ๐œš(๐‘•๐’š)

slide-4
SLIDE 4

Design of representations

However, many representations such as HOG are not invariant, even to simple transformations 4

๐‘• ๐’š ๐‘•๐’š

Not invariant ๐œš ๐’š โ‰  ๐œš(๐‘•๐’š)

image space feature space representation HOG HOG

โ‰ 

slide-5
SLIDE 5

Design of representations

But they often transform in a simple and predictable manner 5

๐‘• ๐’š ๐‘•๐’š

Equivariant โˆ€๐’š: ๐œš ๐’š = ๐‘๐‘• ๐œš(๐‘•๐’š)

image space feature space representation HOG HOG

slide-6
SLIDE 6

Design of representations

But what happens with more complex transformations like affine ones? 6

๐‘• ๐’š ๐‘•๐’š

image space feature space representation HOG HOG

?

slide-7
SLIDE 7

Design of representations

What happens with more complex representations like CNNs?

Invariance of CNN rep. studied in. [Goodfellow et al. 09] or [Zeiler, Fergus 13]

7

๐‘• ๐’š ๐‘•๐’š

Contribution: transformations in CNNs

image space feature space representation CNN CNN

?

slide-8
SLIDE 8

Representation properties Equivariance

How does a representation reflect image transformations? 8 ๐œš ๐œš ?

slide-9
SLIDE 9

When are two representations the same?

Learning representations means that there is an endless number of them Variants obtained by learning on different datasets, or different local optima 9

๐’š

Equivalence ๐œš๐ถ ๐’š = ๐น ๐œš๐ต(๐’š)

representations CNN-A CNN-B

slide-10
SLIDE 10

Representation properties Equivariance

How does a representation reflect image transformations? 10 ๐œš ๐œš ? ๐œš๐ถ ๐œš๐ต ?

Equivalence

Do different representations have different meanings?

slide-11
SLIDE 11

Regularized linear regression

Finding equivariance empirically

11

๐œš ๐œš ๐‘๐‘•

โ‰ƒ

๐‘• ๐ต๐‘•๐œš ๐’š + ๐‘๐‘•

๐œš(๐ฒ)

๐‘๐‘• ๐œš ๐’š ๐œš ๐‘•๐’š (learned empirically)

slide-12
SLIDE 12

Convolutional structure

Finding equivariance empirically

12

๐œš ๐œš ๐‘๐‘• ๐‘• ๐œš ๐‘•๐’š

permutation convolution by 1โจ‰1 filter bank

โˆ— ๐ต๐‘• ๐œš ๐‘•๐’š ๐ต๐‘•๐œš ๐’š + ๐‘๐‘• (learned empirically)

โ‰ƒ

slide-13
SLIDE 13

HOG features

Finding equivariance empirically

13

๐œš ๐œš ๐‘๐‘• Rotation 45ยบ ๐œš ๐œš ๐‘๐‘• ๐œš ๐œš ๐‘๐‘•

โ‰ƒ โ‰ƒ โ‰ƒ

slide-14
SLIDE 14

HOG features โ€“ inverse with MIT HOGgles [Vondrick et al. 13]

Finding equivariance empirically

14

๐œš ๐œš ๐‘๐‘• ๐œš ๐œš ๐‘๐‘• ๐œš ๐œš ๐‘๐‘•

โ‰ƒ โ‰ƒ โ‰ƒ

Rotation 45ยบ

slide-15
SLIDE 15

Finding equivariance empirically

15

๐œš ๐œš ๐‘๐‘• ๐œš ๐œš ๐‘๐‘• ๐œš ๐œš ๐‘๐‘•

โ‰ƒ โ‰ƒ โ‰ƒ

1.25x Upscale HOG features โ€“ inverse with MIT HOGgles [Vondrick 13]

slide-16
SLIDE 16

Equivariance of representations

Transformations

โ–ถ

scaling, rotation, flipping, translation Equivariant representations

โ–ถ

HOG Findings 16

slide-17
SLIDE 17

Finding equivariance empirically

We run the same analysis on a typical CNN architecture

โ–ถ

AlexNet [Krizevsky et al. 12]

โ–ถ

5 convolutional layers + fully-connected layers

โ–ถ

Trained on ImageNet ILSVRC 17 CNN case convolutional layers fully-connected layers

๐’š

1 2 3 4 5 FC

Label ๐‘ง dog ๐œš ๐œ”

slide-18
SLIDE 18

CNN case

Learning mappings empirically

18

๐’š ๐œš(๐’š)

1 2 3 4 5 FC

1 2 3 4 5 2 3 4 5 FC

Label ๐‘ง dog ๐’š

Classif. Loss โ„“

Label ๐‘ง ๐œš ๐œ”

slide-19
SLIDE 19

CNN case

Learning mappings empirically

19

๐‘•๐’š ๐‘๐‘•โˆ’1๐œš(๐‘•๐’š) ๐’š ๐œš(๐‘•๐’š) ๐‘•

๐‘๐‘•โˆ’1

๐ท๐‘๐‘œ๐‘ค3

1 2 3 4 5 FC

๐’š

1 2 3 4 5 FC

Label ๐‘ง dog

Classif. Loss โ„“

Label ๐‘ง

learned empirically

slide-20
SLIDE 20

Results โ€“ Vertical Flip

๐‘•

1 2345 12 345 123 45 1234 5 12345

๐‘๐‘•โˆ’1

๐ท๐‘๐‘œ๐‘ค1

๐‘๐‘•โˆ’1

๐ท๐‘๐‘œ๐‘ค2

๐‘๐‘•โˆ’1

๐ท๐‘๐‘œ๐‘ค3

๐‘๐‘•โˆ’1

๐ท๐‘๐‘œ๐‘ค4 ๐‘๐‘•โˆ’1 ๐ท๐‘๐‘œ๐‘ค5

Original Classif., no TF Original Classif. + TF Before Training

1 2 3 4 5 FC

๐‘•๐’š After Training ๐‘๐‘•โˆ’1

๐ท๐‘๐‘œ๐‘ค3

1 2 3 4 5 FC

๐‘•๐’š

1 2 3 4 5 FC

๐’š

1 2 3 4 5 FC

๐‘•๐’š โˆ—

20

10 20 30 40 50 60 Top-5 Error [%] Original Classif., no TF Original Classif. + TF Before Training After Training

slide-21
SLIDE 21

Equivariance of representations

Transformations

โ–ถ

scaling, rotation, flipping, translation Equivariant representations

โ–ถ

HOG

โ–ถ

Early convolutional layers in CNNs Equivariant to a lesser degree

โ–ถ

Deeper convolutional layers in CNNs Findings 21

slide-22
SLIDE 22

Representation properties Equivariance

How does a representation reflect image transformations? 22 ๐œš ๐œš ? ๐œš๐ถ ๐œš๐ต ?

Equivalence

Do different representations have different meanings?

slide-23
SLIDE 23

CNN transplantation crash course

Equivalence

23

CNN-A ๐œš Are ๐œš and ๐œšโ€ฒ equivalent features?

1 2 3 4 5 FC

CNN-B ๐œšโ€ฒ

1 2 3 4 5 FC

AlexNet [Krizhevsky et al. 12], same training data, different parametrization

slide-24
SLIDE 24

5 FC 1 2 3 4

24

stitching layer (linear convolution) CNN transplantation crash course

Equivalence

Classif. Loss โ„“

Label ๐‘ง Same training data, different parametrization CNN-A

1 2 3 4 5 FC

CNN-B

1 2 3 4 5 FC

Train with SGD โˆ— ๐น

slide-25
SLIDE 25

Stitch CNN-A ๏‚ฎ CNN-B

Franken-network

Training data is the same, but parametrization is entirely different

25

10 20 30 40 50 60 70 80 90 100 Top-5 Error [%]

1 2 3 4 5 FC

Baseline

1 2 3 4 5 FC

Before Training

1 2 3 4 5 FC

๐น After Training

CNN-B CNN-B CNN-B CNN-A CNN-A

1 2345 12 345 123 45 1234 5 12345

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค1

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค2

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค3

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค4

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค5

slide-26
SLIDE 26

Compare training on the same or different data

Equivalence of similar architecture

26

Places dataset ILSVRC12 dataset CNN-IMNET CNN-PLACES

1 2 3 4 5 FC 1 2 3 4 5 FC

slide-27
SLIDE 27

Stitch CNN-PLACES ๏‚ฎ CNN-IMNET

Franken-network

Now even the training sets differ 27

10 20 30 40 50 60 70 80 90 100 Top-5 Error [%]

Baseline

CNN-IMNET

1 2 3 4 5 FC

Before Training

CNN-IMNET CNN-PLCS

1 2 3 4 5 FC

๐น After Training

CNN-IMNET CNN-PLCS

1 2 3 4 5 FC

1 2345 12 345 123 45 1234 5 12345

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค1

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค2

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค3

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค4

๐น๐œšโ†’๐œšโ€ฒ

๐ท๐‘๐‘œ๐‘ค5

slide-28
SLIDE 28

Structured-output pose detection

Example application

Equivariant maps โ€“ Transform features instead of images Allows significant speedup at test time 28

๐‘•โˆ— = argmax๐‘• โˆˆ ๐ป ๐’™, ๐œš ๐‘•โˆ’1๐’š = argmax๐‘• โˆˆ ๐ป ๐’™, ๐‘๐‘•โˆ’1๐œš ๐’š

slide-29
SLIDE 29

Conclusions

Representing geometry

โ–ถ

Beyond invariance: equivariance

โ–ถ

Transforming the image results in a simple and predictable transformation

  • f HOG and early CNN layers

โ–ถ

Application to accelerated structured output regression Representation equivalence

โ–ถ

CNN trained from different random seeds are very different, but only on the surface

โ–ถ

Early CNN layers are interchangeable even between tasks General idea

โ–ถ

study mathematical properties of representations empirically 29

slide-30
SLIDE 30

References

[Lowe 04] Lowe, David G. "Distinctive image features from scale-invariant keypoints." International journal of computer vision 60.2 (2004): 91-110. [Dalal et al. 05] Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE, 2005. [Bay et al. 06] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. "Surf: Speeded up robust features." Computer visionโ€“ ECCV 2006. Springer Berlin Heidelberg, 2006. 404-417. [Ojala et al. 02] Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." Pattern Analysis and Machine Intelligence, IEEE Transactions on 24.7 (2002): 971-987. [Sivic et al. 02] Sivic, Josef, and Andrew Zisserman. "Video Google: A text retrieval approach to object matching in videos." Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003. [Perronnin et al. 07] Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007. [Jegou et al. 10] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a compact image

  • representation. In Proc. CVPR, 2010.

[LeCun et al. 98] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324. [Krizhevsky et al. 12] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [Goodfellow et al. 09] Goodfellow, Ian, et al. "Measuring invariances in deep networks." Advances in neural information processing systems. 2009. [Zeiler, Fergus 13] Zeiler, Matthew D., and Rob Fergus. "Visualizing and Understanding Convolutional Networks." arXiv preprint arXiv:1311.2901 (2013). [Vondrick et al. 13] C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. "HOGgles: Visualizing Object Detection Features" International Conference on Computer Vision (ICCV), Sydney, Australia, December 2013. [Simonyan et al. 14] K. Simonyan, A. Vedaldi, A. Zisserman. โ€œDeep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.โ€ ICLR Workshop 2014

30