Understanding image representations by measuring their equivariance and equivalence
Karel Lenc, Andrea Vedaldi
Visual Geometry Group, Department of Engineering Science
Understanding image representations by measuring their equivariance - - PowerPoint PPT Presentation
Visual Geometry Group, Department of Engineering Science Understanding image representations by measuring their equivariance and equivalence Karel Lenc, Andrea Vedaldi Representations for image understanding 2 image feature semantic
Visual Geometry Group, Department of Engineering Science
2 Ultimate goal of a representation: simplify a task such as image classification Many representations
โถ
Local image descriptors
SIFT [Lowe 04], HOG [Dalal et al. 05], SURF [Bay et al. 06], LBP [Ojala et al. 02], โฆ
โถ
Feature encoders
BoVW [Sivic et al. 02], Fisher Vector [Perronnin et al. 07], VLAD [Jegou et al. 10], sparse coding, โฆ
โถ
Deep convolutional neural networks
[Fukushima 1974-1982, LeCun et al. 89, Krizhevsky et al. 12, โฆ]
๐
Many designs are empirical, the main theoretical design principle is invariance 3
However, many representations such as HOG are not invariant, even to simple transformations 4
But they often transform in a simple and predictable manner 5
But what happens with more complex transformations like affine ones? 6
What happens with more complex representations like CNNs?
Invariance of CNN rep. studied in. [Goodfellow et al. 09] or [Zeiler, Fergus 13]
7
How does a representation reflect image transformations? 8 ๐ ๐ ?
Learning representations means that there is an endless number of them Variants obtained by learning on different datasets, or different local optima 9
How does a representation reflect image transformations? 10 ๐ ๐ ? ๐๐ถ ๐๐ต ?
Do different representations have different meanings?
11
๐(๐ฒ)
12
permutation convolution by 1โจ1 filter bank
13
14
15
Transformations
โถ
โถ
HOG Findings 16
We run the same analysis on a typical CNN architecture
โถ
AlexNet [Krizevsky et al. 12]
โถ
5 convolutional layers + fully-connected layers
โถ
Trained on ImageNet ILSVRC 17 CNN case convolutional layers fully-connected layers
18
Classif. Loss โ
19
๐๐โ1
๐ท๐๐๐ค3
1 2 3 4 5 FC
Classif. Loss โ
learned empirically
1 2345 12 345 123 45 1234 5 12345
๐ท๐๐๐ค1
๐ท๐๐๐ค2
๐ท๐๐๐ค3
๐ท๐๐๐ค4 ๐๐โ1 ๐ท๐๐๐ค5
1 2 3 4 5 FC
๐ท๐๐๐ค3
1 2 3 4 5 FC
1 2 3 4 5 FC
20
10 20 30 40 50 60 Top-5 Error [%] Original Classif., no TF Original Classif. + TF Before Training After Training
Transformations
โถ
โถ
HOG
โถ
Early convolutional layers in CNNs Equivariant to a lesser degree
โถ
Deeper convolutional layers in CNNs Findings 21
How does a representation reflect image transformations? 22 ๐ ๐ ? ๐๐ถ ๐๐ต ?
Do different representations have different meanings?
23
24
Classif. Loss โ
25
10 20 30 40 50 60 70 80 90 100 Top-5 Error [%]
1 2 3 4 5 FC
1 2 3 4 5 FC
CNN-B CNN-B CNN-B CNN-A CNN-A
1 2345 12 345 123 45 1234 5 12345
๐ท๐๐๐ค1
๐ท๐๐๐ค2
๐ท๐๐๐ค3
๐ท๐๐๐ค4
๐ท๐๐๐ค5
26
1 2 3 4 5 FC 1 2 3 4 5 FC
Now even the training sets differ 27
10 20 30 40 50 60 70 80 90 100 Top-5 Error [%]
CNN-IMNET
CNN-IMNET CNN-PLCS
1 2 3 4 5 FC
CNN-IMNET CNN-PLCS
1 2 3 4 5 FC
1 2345 12 345 123 45 1234 5 12345
๐ท๐๐๐ค1
๐ท๐๐๐ค2
๐ท๐๐๐ค3
๐ท๐๐๐ค4
๐ท๐๐๐ค5
Equivariant maps โ Transform features instead of images Allows significant speedup at test time 28
Representing geometry
โถ
Beyond invariance: equivariance
โถ
โถ
Application to accelerated structured output regression Representation equivalence
โถ
โถ
Early CNN layers are interchangeable even between tasks General idea
โถ
study mathematical properties of representations empirically 29
[Lowe 04] Lowe, David G. "Distinctive image features from scale-invariant keypoints." International journal of computer vision 60.2 (2004): 91-110. [Dalal et al. 05] Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE, 2005. [Bay et al. 06] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. "Surf: Speeded up robust features." Computer visionโ ECCV 2006. Springer Berlin Heidelberg, 2006. 404-417. [Ojala et al. 02] Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." Pattern Analysis and Machine Intelligence, IEEE Transactions on 24.7 (2002): 971-987. [Sivic et al. 02] Sivic, Josef, and Andrew Zisserman. "Video Google: A text retrieval approach to object matching in videos." Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003. [Perronnin et al. 07] Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007. [Jegou et al. 10] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a compact image
[LeCun et al. 98] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324. [Krizhevsky et al. 12] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [Goodfellow et al. 09] Goodfellow, Ian, et al. "Measuring invariances in deep networks." Advances in neural information processing systems. 2009. [Zeiler, Fergus 13] Zeiler, Matthew D., and Rob Fergus. "Visualizing and Understanding Convolutional Networks." arXiv preprint arXiv:1311.2901 (2013). [Vondrick et al. 13] C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. "HOGgles: Visualizing Object Detection Features" International Conference on Computer Vision (ICCV), Sydney, Australia, December 2013. [Simonyan et al. 14] K. Simonyan, A. Vedaldi, A. Zisserman. โDeep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.โ ICLR Workshop 2014
30