Evaluation of neural code compression techniques for image retrieval - - PowerPoint PPT Presentation

evaluation of neural code compression techniques for
SMART_READER_LITE
LIVE PREVIEW

Evaluation of neural code compression techniques for image retrieval - - PowerPoint PPT Presentation

Evaluation of neural code compression techniques for image retrieval Feature compression for Image Retrieval Gabriel Nieves-Ponce (nieves1@umbc.edu) University of Maryland Baltimore County CMSC-676 Information Retrieval Intro to image


slide-1
SLIDE 1

Evaluation of neural code compression techniques for image retrieval

Feature compression for Image Retrieval

Gabriel Nieves-Ponce (nieves1@umbc.edu) University of Maryland Baltimore County CMSC-676 Information Retrieval

slide-2
SLIDE 2

Intro to image retrieval

  • Image retrieval (IR) is a subset of information retrieval.
  • Aims to retrieve semantically similar images to user define queries.
  • Commonly used image representation techniques:

○ Handcrafted Descriptors: SIFT, SURF, ORB, etc... ○ Learned Feature Vectors (neural codes): ResNet, VGG, etc…

  • No free lunch:

○ Descriptors are small but inflexible ○ Neural Codes are large but flexible

slide-3
SLIDE 3

Descriptors

  • Small and easy to compute
  • Rotations and translations may

skew results

  • Requires additional steps to

perform similarity metric

○ Geometric Verification

Fig 1. SIFT Keypoints mapping

slide-4
SLIDE 4

Scale Invariant Feature Transform (SIFT)

SIFT is a popular algorithm for computing image descriptors. As the name implies, one of the benefit of SIFT is scale invarience. Because of the scale invariance offered by SIFT we can use both closeups and far away images of an object and be confident that it will capture Keypoint Descriptors shared by both images. In the next couple of slides I will provide a quick overview of how SIFT achieve this. In the next slide we will briefly introduce the concepts of convolutions. Convolutions are one of the cornerstones of image processing and it’s widely used on most popular algorithms, including SIFT.

slide-5
SLIDE 5

Imagine you have a 100x100 image. Now imagine that you look at subsets of the image with a pixel area of 10x10, scanning the image as seen in Fig. 2. For every quadrant we perform some computation that returns a real number value. Once we scan the whole image, we should end up with a lower resolution matrix which is the result of all the individual computations perform on the 10x10 “patches” within our image. What I just described here is known as a Convolution.

  • Fig. 2 Kernel Convolution

Convolutions

slide-6
SLIDE 6

Scale Invariant Feature Transform (SIFT)

Remember that 10x10 pixel area we mentioned before? This is called a kernel. The proper definition of kernels are beyond the scope of this presentation but for the purpose of understanding SIFT we can define this kernel to be a matrix that contains a set of real numbers that when multiplied (convolved) against the image, it augments - or diminishes - certain

  • properties. If you ever used a filter to alter an image on instagram, you most definitely have used

kernels before. These kernels are also called filters; you can think of a filter as a collection of kernels. SIFT uses these filters to make edges on an image more pronounced and everything else less pronounced. This makes it easy for the algorithm to find the edges within the image. SIFT uses a Gaussian Filter

  • n the image to apply a Gaussian Blur as a preprocessing step to extracting the keypoint

descriptors.

slide-7
SLIDE 7

Scale Invariant Feature Transform (SIFT)

The next series of steps are quite involved and beyond the scope of this presentation, so I will quickly mention the steps without diving into the math. SIFT applies this preprocessing step a number of times, each time rescaling the image to achieve scale invariance. Furthermore it looks for areas of interest known as keypoint. Each keypoint it computed as a function of their surroundings pixels - are there a lot of white pixels around these black pixels? Then this might be an edge. After all the keypoints are located, they are assign them an orientation to achieve rotational

  • invariance. Finally keypoint descriptors are computed and stored as 128 dimensional vector.
slide-8
SLIDE 8

Speeded up Robust Feature (SURF)

SURF is a popular algorithm that was inspired by the SIFT paper [1]. SURF is very similar to SIFT and

both have comparable retrieval performance. The main takeaway is that SURF is up to three time faster than SIFT and provides better rotational invariance while still achieving similar scale invariance.

slide-9
SLIDE 9

Neural Codes

  • High dimensionality

○ Very large vector

  • Expensive to compute - GPUs
  • Rotational and translational

invariant

  • Can use L2 norm directly to

compute similarity

Fig 3. VGG16 Convolutional Activation Map (CAM)

slide-10
SLIDE 10

Convolutional Neural Network (CNN)

Remember earlier when we mention convolutions. A CNN is a neural network architecture that takes this concept to the extreme. See Fig. 4

  • Fig. 4 CNN Architecture
slide-11
SLIDE 11

Convolutional Neural Network (CNN)

A CNN perform a series of convolutions followed by pooling. The output of each convolution produces a list of matrices called Feature Maps (FM) similar to the preprocessing step in SIFT. You can further convolve FM to extract lower level representations at the expense of resolution.

  • Fig. 5 CNN Architecture
slide-12
SLIDE 12

Convolutional Neural Network (CNN)

A convolution layer can be described, very naively, as a series of convolutions and pooling packaged into one phase. There are a number of different CNN architectures such as ResNet50, and VGG16 and they all have their own unique implementations. The type of features maps you get will depend not only on the architecture you decide to use, but also on the convolutional layer you choose. Un like SIFT, the convolution kernels, or filters, in a CNN are updated and improved via back propagation to achieve maximum class activation, see

  • fig. 3 [4], without any human input. It learns which filters are best directly from the dataset.

Finally, the rule of thumb is that the early convolutional layers learn simple edge detection while the lower layers can detect very complex patterns. Take a look at fig. 5 [5] to see a visualization of the kernels that were learned by the last VGG16 convolutional layer.

slide-13
SLIDE 13

Problem statement

  • The neural codes have yielded promising results in retrieval tasks when compared to

descriptors [3]

  • Uncompress state of the art descriptors were able to outperform the uncompress neural

codes.

  • In the advent of cloud computing accelerators such as GPUs have become accessible to most

consumers

  • Can we overcome the feature vector size problem?

○ A. Babenko et al [3] demonstrated some success using compressed neural codes with minimal loss in performance. ○ Can we improve upon his findings?

slide-14
SLIDE 14

Compression and Representation

  • Dimensionality reduction is the process of reducing the number of random variables under

consideration by obtaining a set of principal variables.

  • Dimensionality Reduction:

○ Principal Component Analysis (PCA) [6] ○ Linear Discriminant Analysis (LDA) [7][8]

  • Quantization (Compact Coding for ANN Search):

○ Product Quantization (PQ) [4] ○ Optimized Product Quantization (OPQ) [5]

slide-15
SLIDE 15

Related Work

  • Babenko et al. [3] demonstrated that compressed neural codes outperform even the best

performing low resolution descriptors.

○ Both PCA and Discriminative Dimensionality Reduction (DDR) where used to generate low resolution codes.

  • PCA:

○ Unsupervised

○ Optimizes for highest variance

  • Discriminative Dimensionality Reduction (*):

○ Supervised ○ Optimizes for highest separation of classes

* = PCA is applied to codes before DDR to avoid overfitting

slide-16
SLIDE 16

Goals

  • Compare performance using state of the art deep neural networks (DNN):

○ ResNet, DenseNet, InceptionV3

  • Extend the compression techniques

○ Linear Discriminant Analysis (LDA)[7][8]

○ Product Quantization (PQ)[4] ○ Optimize Product Quantization (OPQ)[5]

  • Further research convolutional features and their quality:

○ Convolutional Activation Map (CAM) [10] ○ Kernel Deconvolution [11]

slide-17
SLIDE 17

Fig 6. VGG16 Kernel Deconvolution

slide-18
SLIDE 18

References

  • [1] Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer

Vision 60, 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94

  • [2] Bay H., Tuytelaars T., Van Gool L. (2006) SURF: Speeded Up Robust Features. In: Leonardis A., Bischof H.,

Pinz A. (eds) Computer Vision – ECCV 2006. ECCV 2006. Lecture Notes in Computer Science, vol 3951. Springer, Berlin, Heidelberg.

  • [3] Babenko A., Slesarev A., Chigorin A., Lempitsky V. (2014) Neural Codes for Image Retrieval. In: Fleet D.,

Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8689. Springer, Cham.

  • [4] H. Jégou, M. Douze and C. Schmid, "Product Quantization for Nearest Neighbor Search," in IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011.

slide-19
SLIDE 19

References

  • [5] T. Ge, K. He, Q. Ke and J. Sun, "Optimized Product Quantization for Approximate Nearest Neighbor

Search," 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, 2013, pp. 2946-2953.

  • [6] Lindsay I. Smith. “A Tutorial on Principal Component Analysis.”

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf, February 26, 2002.

  • [7] R.A. Fisher, “The Statistical Utilization of Multiple Measurements, Annals of Eugenics”, vol. 8, pp. 376-386,

1938.

  • [8] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Academic Press, 1990.
  • [9] Nieves-Ponce, Gabriel. 2020, UMBC CMSC_676 Term Paper, master,

https://gitlab.com/nievespg/umbc/-/tree/10-term-paper/CMSC_676/term_paper

slide-20
SLIDE 20

References

  • [10] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative
  • Localization. In CVPR, 2016.
  • [11] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014.