Evaluation of neural code compression techniques for image retrieval
Feature compression for Image Retrieval
Gabriel Nieves-Ponce (nieves1@umbc.edu) University of Maryland Baltimore County CMSC-676 Information Retrieval
Evaluation of neural code compression techniques for image retrieval - - PowerPoint PPT Presentation
Evaluation of neural code compression techniques for image retrieval Feature compression for Image Retrieval Gabriel Nieves-Ponce (nieves1@umbc.edu) University of Maryland Baltimore County CMSC-676 Information Retrieval Intro to image
Gabriel Nieves-Ponce (nieves1@umbc.edu) University of Maryland Baltimore County CMSC-676 Information Retrieval
○ Handcrafted Descriptors: SIFT, SURF, ORB, etc... ○ Learned Feature Vectors (neural codes): ResNet, VGG, etc…
○ Descriptors are small but inflexible ○ Neural Codes are large but flexible
○ Geometric Verification
Fig 1. SIFT Keypoints mapping
SIFT is a popular algorithm for computing image descriptors. As the name implies, one of the benefit of SIFT is scale invarience. Because of the scale invariance offered by SIFT we can use both closeups and far away images of an object and be confident that it will capture Keypoint Descriptors shared by both images. In the next couple of slides I will provide a quick overview of how SIFT achieve this. In the next slide we will briefly introduce the concepts of convolutions. Convolutions are one of the cornerstones of image processing and it’s widely used on most popular algorithms, including SIFT.
Imagine you have a 100x100 image. Now imagine that you look at subsets of the image with a pixel area of 10x10, scanning the image as seen in Fig. 2. For every quadrant we perform some computation that returns a real number value. Once we scan the whole image, we should end up with a lower resolution matrix which is the result of all the individual computations perform on the 10x10 “patches” within our image. What I just described here is known as a Convolution.
Remember that 10x10 pixel area we mentioned before? This is called a kernel. The proper definition of kernels are beyond the scope of this presentation but for the purpose of understanding SIFT we can define this kernel to be a matrix that contains a set of real numbers that when multiplied (convolved) against the image, it augments - or diminishes - certain
kernels before. These kernels are also called filters; you can think of a filter as a collection of kernels. SIFT uses these filters to make edges on an image more pronounced and everything else less pronounced. This makes it easy for the algorithm to find the edges within the image. SIFT uses a Gaussian Filter
descriptors.
The next series of steps are quite involved and beyond the scope of this presentation, so I will quickly mention the steps without diving into the math. SIFT applies this preprocessing step a number of times, each time rescaling the image to achieve scale invariance. Furthermore it looks for areas of interest known as keypoint. Each keypoint it computed as a function of their surroundings pixels - are there a lot of white pixels around these black pixels? Then this might be an edge. After all the keypoints are located, they are assign them an orientation to achieve rotational
SURF is a popular algorithm that was inspired by the SIFT paper [1]. SURF is very similar to SIFT and
both have comparable retrieval performance. The main takeaway is that SURF is up to three time faster than SIFT and provides better rotational invariance while still achieving similar scale invariance.
○ Very large vector
Fig 3. VGG16 Convolutional Activation Map (CAM)
Remember earlier when we mention convolutions. A CNN is a neural network architecture that takes this concept to the extreme. See Fig. 4
A CNN perform a series of convolutions followed by pooling. The output of each convolution produces a list of matrices called Feature Maps (FM) similar to the preprocessing step in SIFT. You can further convolve FM to extract lower level representations at the expense of resolution.
A convolution layer can be described, very naively, as a series of convolutions and pooling packaged into one phase. There are a number of different CNN architectures such as ResNet50, and VGG16 and they all have their own unique implementations. The type of features maps you get will depend not only on the architecture you decide to use, but also on the convolutional layer you choose. Un like SIFT, the convolution kernels, or filters, in a CNN are updated and improved via back propagation to achieve maximum class activation, see
Finally, the rule of thumb is that the early convolutional layers learn simple edge detection while the lower layers can detect very complex patterns. Take a look at fig. 5 [5] to see a visualization of the kernels that were learned by the last VGG16 convolutional layer.
descriptors [3]
codes.
consumers
○ A. Babenko et al [3] demonstrated some success using compressed neural codes with minimal loss in performance. ○ Can we improve upon his findings?
consideration by obtaining a set of principal variables.
○ Principal Component Analysis (PCA) [6] ○ Linear Discriminant Analysis (LDA) [7][8]
○ Product Quantization (PQ) [4] ○ Optimized Product Quantization (OPQ) [5]
performing low resolution descriptors.
○ Both PCA and Discriminative Dimensionality Reduction (DDR) where used to generate low resolution codes.
○ Unsupervised
○ Optimizes for highest variance
○ Supervised ○ Optimizes for highest separation of classes
* = PCA is applied to codes before DDR to avoid overfitting
○ ResNet, DenseNet, InceptionV3
○ Linear Discriminant Analysis (LDA)[7][8]
○ Product Quantization (PQ)[4] ○ Optimize Product Quantization (OPQ)[5]
○ Convolutional Activation Map (CAM) [10] ○ Kernel Deconvolution [11]
Fig 6. VGG16 Kernel Deconvolution
Vision 60, 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Pinz A. (eds) Computer Vision – ECCV 2006. ECCV 2006. Lecture Notes in Computer Science, vol 3951. Springer, Berlin, Heidelberg.
Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8689. Springer, Cham.
Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011.
Search," 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, 2013, pp. 2946-2953.
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf, February 26, 2002.
1938.
https://gitlab.com/nievespg/umbc/-/tree/10-term-paper/CMSC_676/term_paper