Image Retrieval with CNN Giorgos Tolias Visual Recognition Group, - - PowerPoint PPT Presentation

image retrieval with cnn
SMART_READER_LITE
LIVE PREVIEW

Image Retrieval with CNN Giorgos Tolias Visual Recognition Group, - - PowerPoint PPT Presentation

Visual Place Recognition as Image Retrieval with CNN Giorgos Tolias Visual Recognition Group, CTU in Prague CVPR 2017 tutorial on Large-Scale Visual Place Recognition and Image-Based Localization Alex Kendall, Torsten Sattler, Giorgos Tolias,


slide-1
SLIDE 1

Visual Place Recognition as Image Retrieval with CNN

Giorgos Tolias Visual Recognition Group, CTU in Prague CVPR 2017 tutorial on Large-Scale Visual Place Recognition and Image-Based Localization

Alex Kendall, Torsten Sattler, Giorgos Tolias, Akihiko Torii

slide-2
SLIDE 2

Visual place recognition

slide-3
SLIDE 3

Visual place recognition by image retrieval

Query (image) query descriptor Nearest Neighbor search Descriptors for database images

slide-4
SLIDE 4

http://viral.image.ntua.gr

slide-5
SLIDE 5

CNN as feature extractors

  • CNN pre-trained for image classification
  • Internal layer activations as features
  • Good generalization properties
  • Detection
  • Fine-grained classification
  • Scene classification
  • Semantic segmentation
  • ……

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: A deep convolutional activation feature for generic visual recognition. In: arXiv:1310.1531. (2013). Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: An astounding baseline for recognition. In: CVPRW. (2014)

Figure from Razavian et al.

slide-6
SLIDE 6

Image retrieval with pre-trained CNN

slide-7
SLIDE 7

Global image representation – FC layer

  • Features: FC layer activations
  • Resize/crop to fixed image size

Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: ECCV. (2014) Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: An astounding baseline for recognition. In: CVPRW. (2014)

Figure from Babenko et al.

slide-8
SLIDE 8

Global image representation – Conv layer

  • Features: Conv layer activations
  • Global max or sum pooling
  • Any input image size
  • Better to use last Conv layer (VGG, Alex)

Azizpour, H., Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: From generic to specific deep representations for visual recognition. In: CVPRW. (2015) Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. In: ICCV. (2015)

Figure from Razavian et al.

slide-9
SLIDE 9

Spatial and channel weighting

  • Channel-wise and spatial-wise weighting
  • Global sum pooling
  • Channel-wise: IDF-like weighting
  • Spatial-wise: saliency mask by L2 norm

Figures from Kalantidis et al.

Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: ECCVW (2016)

slide-10
SLIDE 10

Input image conv5 filter 1 conv5 filter 2 …. conv5 filter i …. conv5 filter K

Maximum Activations of Convolutions - MAC

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

slide-11
SLIDE 11

Input image conv5 filter 1 conv5 filter 2 …. conv5 filter i …. conv5 filter K

maximum activation

Maximum Activations of Convolutions - MAC

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

slide-12
SLIDE 12

MAC similarity

  • Similarity: inner product of L2 normalized MAC descriptors
  • Max of the same feature map fires on the same location
  • Implicitly forms correspondences (512 for VGG)

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

slide-13
SLIDE 13

Regional Maximum Activations of Convolutions R-MAC

  • Extract MAC descriptor per region
  • Sum pool regional descriptors
  • Global image representation (same dimensionality as MAC)
  • PCA Whitening

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

MAC L2 norm Whitening L2 norm

whitening

slide-14
SLIDE 14

Comparison with local feature based methods

Method Oxf5k Oxf105k Par6k Par106k BoW(16M) + geometry + QE 84.9 79.5 82.4 77.3 Hamming Query Expansion 88.0 84.0 82.8

  • Triangulation Emb. 1024D

56.0 50.2

  • R-MAC (512D)

66.9 61.6 83.0 75.7 3-4k features / image Memory demanding Compact representation One descriptor /image

slide-15
SLIDE 15

Object localization

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

slide-16
SLIDE 16

Object localization

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

slide-17
SLIDE 17

Object localization with integral max pooling

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

slide-18
SLIDE 18

Object localization with integral max pooling

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

slide-19
SLIDE 19

Object localization with integral max pooling

Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

slide-20
SLIDE 20

Object localization with integral max pooling

Initial ranking (IR) Re-ranking (RR) Initial ranking (IR) Re-ranking (RR) Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016) IR  RR IR  RR

slide-21
SLIDE 21

Object localization with integral max pooling

Initial ranking (IR) Re-ranking (RR) Initial ranking (IR) Re-ranking (RR) Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)

<3 seconds to re-rank 1000 images using 1 CPU thread

IR  RR IR  RR

slide-22
SLIDE 22

Comparison with local features, geometry, and query expansion

Method Oxf5k Oxf105k Par6k Par106k BoW(16M) + geometry + QE 84.9 79.5 82.4 77.3 Hamming Query Expansion 88.0 84.0 82.8

  • R-MAC (512D)

66.9 61.6 83.0 75.7 R-MAC + localization + QE 77.3 73.2 86.5 79.8

slide-23
SLIDE 23

Known encodings applied on CNN local descriptor

  • Bag-of-Words

Mohedano E, Salvador A, McGuinness K, Giró-i-Nieto X, O'Connor N, Marqués F. Bags of Local Convolutional Features for Scalable Instance Search. In ICMR 2016

  • Fisher vectors
  • P. Kulkarni , J. Zepeda , F. Jurie , P. Perez and L. Chevallier, Hybrid multi-layer deep cnn/aggregator feature for image classification, In

ICASSP 2015

  • VLAD
  • Y. Gong, L. Wang, R. Guo, and S. Lazebnik, Multi-scale Orderless Pooling of Deep Convolutional Activation Features, In ECCV 2014

Other approaches

Figure from Mohedano et al. Figure from Gong et al.

slide-24
SLIDE 24

Off-the-shelf CNN

  • Target application: classification
  • Training dataset: ImageNet
  • Architecture: AlexNet, VGG, ResNet
  • Directly applicable to other tasks

Images from ImageNet.org

Fine-grain classification

Images from ImageNet.org

Object detection

Images from PASCAL VOC 2012

Image retrieval

slide-25
SLIDE 25

CNN fine-tuning for image retrieval

slide-26
SLIDE 26

Large Internet photo collection

Convolutional Neural Network (CNN) Image annotations Training

Lots of Training Examples

slide-27
SLIDE 27

Large Internet photo collection

Convolutional Neural Network (CNN) Not accurate Expensive $$

Lots of Training Examples

slide-28
SLIDE 28

Large Internet photo collection

Convolutional Neural Network (CNN) Not accurate Expensive $$

Manual cleaning of the training data done by Researchers

Very expensive $$$$

Lots of Training Examples

slide-29
SLIDE 29

Large Internet photo collection

Convolutional Neural Network (CNN) Not accurate Expensive $$

Manual cleaning of the training data done by Researchers

Very expensive $$$$

Automated extraction

  • f training data

Accurate Free $

Lots of Training Examples

slide-30
SLIDE 30

Annotations for CNN Image Retrieval

CNN pre-trained for classification task used for retrieval

[Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16]

Building class

slide-31
SLIDE 31

Annotations for CNN Image Retrieval

CNN pre-trained for classification task used for retrieval

[Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16]

Building class Landmark class

Fine-tuned CNN using a dataset with landmark classes

[Babenko et al. ECCV’14]

slide-32
SLIDE 32

Annotations for CNN Image Retrieval

CNN pre-trained for classification task used for retrieval

[Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16]

Building class Landmark class spatially closest ≠ matching

Fine-tuned CNN using a dataset with landmark classes

[Babenko et al. ECCV’14]

NetVLAD: Weakly supervised fine-tuned CNN using GPS tags

[Arandjelovic et al. CVPR’16]

slide-33
SLIDE 33

NetVLAD

WxHxD feature map

  • f last conv. layer

Negatives: geographically far Positives: geographically close and close in the feature space

Figures from Arandjelovic et al.

slide-34
SLIDE 34

Annotations for CNN Image Retrieval

CNN pre-trained for classification task used for retrieval

[Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16]

Building class Landmark class spatially closest ≠ matching

Fine-tuned CNN using a dataset with landmark classes

[Babenko et al. ECCV’14]

NetVLAD: Weakly supervised fine-tuned CNN using GPS tags

[Arandjelovic et al. CVPR’16]

slide-35
SLIDE 35

Automatic annotations for CNN training [Radenovic et al. ECCV’16]

Hard positives Hard negatives

Annotations for CNN Image Retrieval

CNN pre-trained for classification task used for retrieval

[Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16]

Building class Landmark class spatially closest ≠ matching

Fine-tuned CNN using a dataset with landmark classes

[Babenko et al. ECCV’14]

NetVLAD: Weakly supervised fine-tuned CNN using GPS tags

[Arandjelovic et al. CVPR’16]

slide-36
SLIDE 36

CNN learns from BoW – Training Data

7.4M images  713 training 3D models

[Schonberger et al. CVPR’15] [Radenovic et al. CVPR’16]

slide-37
SLIDE 37

CNN learns from BoW – Training Data

Camera Orientation Known Number of Inliers Known

7.4M images  713 training 3D models

[Schonberger et al. CVPR’15] [Radenovic et al. CVPR’16]

slide-38
SLIDE 38

Hard Negative Examples

anchor Negative examples: images from different 3D models than the anchor Hard negatives: closest negative examples to the anchor Only hard negatives: as good as using all negatives, but faster

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-39
SLIDE 39

Hard Negative Examples

anchor the most similar CNN descriptor Negative examples: images from different 3D models than the anchor Hard negatives: closest negative examples to the anchor Only hard negatives: as good as using all negatives, but faster

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-40
SLIDE 40

Hard Negative Examples

anchor the most similar CNN descriptor naive hard negatives top k by CNN Negative examples: images from different 3D models than the anchor Hard negatives: closest negative examples to the anchor Only hard negatives: as good as using all negatives, but faster

increasing CNN descriptor distance to the anchor

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-41
SLIDE 41

Hard Negative Examples

anchor the most similar CNN descriptor naive hard negatives top k by CNN diverse hard negatives top k: one per 3D model Negative examples: images from different 3D models than the anchor Hard negatives: closest negative examples to the anchor Only hard negatives: as good as using all negatives, but faster

increasing CNN descriptor distance to the anchor

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-42
SLIDE 42

anchor Positive examples: images that share 3D points with the anchor Hard positives: positive examples not close enough to the anchor

Hard Positive Examples

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-43
SLIDE 43

anchor top 1 by CNN used in NetVLAD Positive examples: images that share 3D points with the anchor Hard positives: positive examples not close enough to the anchor

Hard Positive Examples

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-44
SLIDE 44

anchor top 1 by CNN top 1 by BoW

harder positives

used in NetVLAD Positive examples: images that share 3D points with the anchor Hard positives: positive examples not close enough to the anchor

Hard Positive Examples

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-45
SLIDE 45

anchor top 1 by CNN top 1 by BoW random from top k by BoW

harder positives

used in NetVLAD Positive examples: images that share 3D points with the anchor Hard positives: positive examples not close enough to the anchor

Hard Positive Examples

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-46
SLIDE 46

CNN Siamese Learning

… MAC & L2-norm

D x 1 CNN desc.

Query Convolutional Layers Pooling Descriptor

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-47
SLIDE 47

CNN Siamese Learning

… MAC & L2-norm

D x 1 CNN desc.

Query Convolutional Layers Pooling Descriptor … MAC & L2-norm

D x 1 CNN desc.

Positive Convolutional Layers Pooling Descriptor

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-48
SLIDE 48

CNN Siamese Learning

… MAC & L2-norm

D x 1 CNN desc.

Query Convolutional Layers Pooling Descriptor … MAC & L2-norm

D x 1 CNN desc.

Positive Convolutional Layers Pooling Descriptor Contrastive Loss 1 – positive 0 – negative Pair Label

MATCHING PAIR

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-49
SLIDE 49

CNN Siamese Learning

… MAC & L2-norm

D x 1 CNN desc.

Query Convolutional Layers Pooling Descriptor … MAC & L2-norm

D x 1 CNN desc.

Positive Convolutional Layers Pooling Descriptor Contrastive Loss 1 – positive 0 – negative Pair Label

MATCHING PAIR

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-50
SLIDE 50

CNN Siamese Learning

… MAC & L2-norm

D x 1 CNN desc.

Query Convolutional Layers Pooling Descriptor … MAC & L2-norm

D x 1 CNN desc.

Convolutional Layers Pooling Descriptor Contrastive Loss 1 – positive 0 – negative Pair Label

NON-MATCHING PAIR

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-51
SLIDE 51

Whitening and dimensionality reduction

1. PCAw – PCA of an independent set of descriptors

[Babenko et al. ICCV’15, Tolias et al. ICLR’16]

2. Lw – We propose to learn whitening using labeled training data and linear discriminant projections

[Mikolajczyk & Matas ICCV’07] …

global max pooling & L2-norm Dx1 CNN desc. whitening

end-to-end learning post-processing

  • ptional

dim reduction

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-52
SLIDE 52

1. PCAw – PCA of an independent set of descriptors

[Babenko et al. ICCV’15, Tolias et al. ICLR’16]

2. Lw – We propose to learn whitening using labeled training data and linear discriminant projections

[Mikolajczyk & Matas ICCV’07]

3. End-to-end Learning – Performs comparable or worse than Lw, while slowing down the convergence

global max pooling & L2-norm Dx1 CNN desc. whitening

end-to-end learning

  • ptional

dim reduction

Whitening and dimensionality reduction

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-53
SLIDE 53

1. PCAw – PCA of an independent set of descriptors

[Babenko et al. ICCV’15, Tolias et al. ICLR’16]

2. Lw – We propose to learn whitening using labeled training data and linear discriminant projections

[Mikolajczyk & Matas ICCV’07]

3. End-to-end Learning – Performs comparable or worse than Lw, while slowing down the convergence

global max pooling & L2-norm Dx1 CNN desc. whitening

end-to-end learning post-processing

  • ptional

dim reduction

Whitening and dimensionality reduction

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016
slide-54
SLIDE 54

Experiments – Learning (AlexNet)

  • Careful choice of positive and negative training

images makes a difference

slide-55
SLIDE 55

Experiments – Learning (AlexNet)

  • Careful choice of positive and negative training

images makes a difference

Oxford 5k Paris 6k

  • ff-the-shelf

44.2 51.6

slide-56
SLIDE 56

Experiments – Learning (AlexNet)

  • Careful choice of positive and negative training

images makes a difference

Oxford 5k Paris 6k

  • ff-the-shelf

top 1 CNN + top k CNN

44.2 51.6 56.2 63.1

slide-57
SLIDE 57

Experiments – Learning (AlexNet)

  • Careful choice of positive and negative training

images makes a difference

Oxford 5k Paris 6k

  • ff-the-shelf

top 1 CNN + top k CNN top 1 CNN + top 1 / model CNN

44.2 51.6 56.2 63.1 56.7 63.9

slide-58
SLIDE 58

Experiments – Learning (AlexNet)

  • Careful choice of positive and negative training

images makes a difference

Oxford 5k Paris 6k

  • ff-the-shelf

top 1 CNN + top k CNN top 1 CNN + top 1 / model CNN top 1 BoW + top 1 / model CNN

44.2 51.6 56.2 63.1 56.7 63.9 59.7 67.1

slide-59
SLIDE 59

Experiments – Learning (AlexNet)

  • Careful choice of positive and negative training

images makes a difference

Oxford 5k Paris 6k

  • ff-the-shelf

top 1 CNN + top k CNN top 1 CNN + top 1 / model CNN top 1 BoW + top 1 / model CNN random(top k BoW) + top 1 / model CNN

44.2 51.6 56.2 63.1 56.7 63.9 59.7 67.1 60.2 67.5

slide-60
SLIDE 60

Experiments – Learning (AlexNet)

  • Careful choice of positive and negative training

images makes a difference

Oxford 5k Paris 6k

  • ff-the-shelf

top 1 CNN + top k CNN top 1 CNN + top 1 / model CNN top 1 BoW + top 1 / model CNN random(top k BoW) + top 1 / model CNN

44.2 51.6 56.2 63.1 56.7 63.9 59.7 67.1

62.2 68.9

60.2 67.5

learned whitening

slide-61
SLIDE 61

Teacher vs. Student

Fine-tuned CNN with re-ranking (R) and query expansion(QE) surpasses its teacher

Method Oxf5k Oxf105k Par6k Par106k BoW(16M)+R+QE

84.9 79.5 82.4 77.3

Fine-tuned MAC (512D)

79.7 73.9 82.4 74.6

Fine-tuned MAC (512D)+R+QE

85.0 81.8 86.5 78.8

slide-62
SLIDE 62

ANN search with CNN descriptors

  • A. Gordo, J. Almazan, J. Revaud, D. Larlus, End-to-end Learning of Deep Visual Representations for Image Retrieval, In arXiv 2017
slide-63
SLIDE 63

Teacher vs. Student for small objects

query region query region

CNN BoW+geometry

slide-64
SLIDE 64

Manifold search on fine-tuned CNN features

slide-65
SLIDE 65

Manifold search

  • Euclidean distance not enough for severe visual variations
  • Manifold search via graph-based methods, i.e. diffusion
slide-66
SLIDE 66
  • Normalization of affinity matrix A
  • Vector y defines the query:
  • Iterative solution (PageRank)
  • Closed-form solution (typically avoided)

Manifold search with diffusion

  • D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Scholkopf. Ranking on data manifolds. In NIPS, 2003

Query

slide-67
SLIDE 67
  • Normalization of affinity matrix A
  • Vector y defines the query:
  • Iterative solution (PageRank)
  • Closed-form solution (typically avoided)

Manifold search with diffusion

  • D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Scholkopf. Ranking on data manifolds. In NIPS, 2003

Query

2D toy example with query, database points, and iso-contours for diffusion similarity

slide-68
SLIDE 68

Unseen queries

Define vector y with neighbors of the query Perform standard diffusion:

  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017

2D toy example with query, NN of the query and database points

Query

slide-69
SLIDE 69

Efficient solution

  • Iterative solution (inefficient)
  • Conjugate gradient to solve linear system
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017
slide-70
SLIDE 70

Efficient solution

  • Iterative solution (inefficient)
  • Conjugate gradient to solve linear system
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017

Not sparse Sparse

slide-71
SLIDE 71

Efficient solution

  • Iterative solution (inefficient)
  • Conjugate gradient to solve linear system
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017

Not sparse Sparse Equivalent to Jacobi solver

slide-72
SLIDE 72

Regional diffusion

  • Extract one descriptor per region

Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: A baseline for visual instance retrieval with deep convolutional networks. In: arXiv:1412.6574. (2014)

  • Graph with regions as nodes
  • Multiple query regions (sum to construct y)
  • Query with all regions jointly (diffuse once)
  • Obtain image similarity: pooling over regions
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017

Query regions

slide-73
SLIDE 73

Regional diffusion

  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017

Query regions

2D toy example with query points, NN of the queries and database points

slide-74
SLIDE 74

Small object retrieval

  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017

Precision at retrieved location with global regional diffusion

slide-75
SLIDE 75

Small object retrieval

  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017
slide-76
SLIDE 76

Embedding for manifold search

Iterative solution Convergence solution Efficient solution by conjugate gradient The Laplacian is sparse

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017
slide-77
SLIDE 77

Embedding for manifold search

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017

Convergence solution The Laplacian is sparse

slide-78
SLIDE 78

Embedding for manifold search

Low-rank decomposition Low-rank decomposition of a matrix that is never created

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017

Convergence solution The Laplacian is sparse

slide-79
SLIDE 79

Embedding for manifold search

Low-rank decomposition Low-rank decomposition of a matrix that is never created

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017

Approximate & efficient methods

Convergence solution The Laplacian is sparse

slide-80
SLIDE 80

Embedding for manifold search

Low-rank decomposition Low-rank decomposition of a matrix that is never created Very efficient diffusion

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017

Approximate & efficient methods

Convergence solution The Laplacian is sparse

slide-81
SLIDE 81

Embedding for manifold search

Low-rank decomposition Low-rank decomposition of a matrix that is never created Very efficient diffusion

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017

N x r: N = #nodes, r = rank Approximate & efficient methods

Convergence solution The Laplacian is sparse

slide-82
SLIDE 82

Embedding for manifold search

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017
slide-83
SLIDE 83

Embedding for manifold search

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017

Two orders of magnitude faster than CG solution

slide-84
SLIDE 84

Embedding for manifold search

  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017
slide-85
SLIDE 85

Panorama to panorama matching for location recognition

based on fine-tuned CNN features

slide-86
SLIDE 86

Panorama to panorama matching

Query (image) Query (image set from the same location) Self driving cars query descriptor Location recognition by NN search

Image to image matching

Descriptors for database images

Panorama to Panorama matching

Descriptors for database image sets  each set from the same location query descriptor

  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Panorama to panorama matching for location recognition, In ICMR 2017
slide-87
SLIDE 87

Explicit panorama construction

  • Auto-stitch images of the same location
  • Use NetVLAD on the stitched (panoramic) image
  • Index one NetVLAD descriptor per location
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Panorama to panorama matching for location recognition, In ICMR 2017
slide-88
SLIDE 88

Implicit panorama construction

  • Use NetVLAD for single image descriptor
  • Joint representation of image set by pooling in the descriptor space
  • Index one vector (joint representation) per location
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Panorama to panorama matching for location recognition, In ICMR 2017
slide-89
SLIDE 89

Implicit panorama construction

  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Panorama to panorama matching for location recognition, In ICMR 2017

[Iscen et al. Transactions on Big Data, 2017]

slide-90
SLIDE 90

Sparse panorama

  • Implicit way is directly applicable
  • An additional intermediate solution proposed
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Panorama to panorama matching for location recognition, In ICMR 2017
slide-91
SLIDE 91

Pan2Pan matching for location recognition

Location recognition on Pittsburgh dataset with 24 views (full panorama) per location Location recognition on Pittsburgh dataset with sparse views (sparse panorama) per query location

slide-92
SLIDE 92

Pan2Pan matching for location recognition

Location recognition on Pittsburgh dataset with 24 views (full panorama) per location Location recognition on Pittsburgh dataset with sparse views (sparse panorama) per query location

standard image to image retrieval

  • ur contribution
slide-93
SLIDE 93

Summarize

  • Good performance with pre-trained features
  • Compact representation
  • Fine-tuning significantly improves
  • Ways to automatically collect data
  • Fine-tuning not enough
  • Manifold search
  • Benefit by representation adapted to the task
slide-94
SLIDE 94

Collaborators

Ahmet Iscen Filip Radenović Ronan Sicre Yannis Avrithis Hervé Jégou Teddy Furon Ondřej Chum

slide-95
SLIDE 95

Online code and data

  • R-MAC and localization (ICLR 2016)
  • Matlab package http://cmp.felk.cvut.cz/~toliageo/soft.html
  • Siamese training code and training data (ECCV 2016)
  • Matlab package using MatConvNet http://cmp.felk.cvut.cz/cnnimageretrieval/
  • Region manifold search (CVPR 2017)
  • Matlab package https://github.com/ahmetius/diffusion-retrieval
  • Manifold embedding (arXiv 2017)
  • Code coming soon…
slide-96
SLIDE 96

References

  • R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, 2016.
  • H. Azizpour, A. Razavian, J. Sullivan, A. Maki, S. Carlsson. From generic to specific deep representations for visual recognition. In: CVPRW, 2015.
  • A. Babenko and V. Lempitsky. Aggregating deep convolutional features for image retrieval. In ICCV, 2015.
  • A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for image retrieval. In ECCV, 2014.
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell: DeCAF: A deep convolutional activation feature for generic visual recognition. In: arXiv:1310.1531
  • Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In ECCV, 2014
  • A. Gordo, J. Almazan, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for image search. In ECCV, 2016
  • A. Gordo, J. Almazan, J. Revaud, D. Larlus, End-to-end Learning of Deep Visual Representations for Image Retrieval, In arXiv 2017
  • A. Iscen, Y. Avrithis, G. Tolias, T. Furon, O. Chum, Fast Spectral Ranking for Similarity Search, In arXiv 2017
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, In CVPR 2017
  • A. Iscen, G. Tolias, Y. Avrithis, T. Furon, O. Chum, Panorama to panorama matching for location recognition, In ICMR 2017
  • A. Iscen, T. Furon, V. Gripon, M. Rabbat, and H. Jegou, Memory vectors for similarity search in high-dimensional spaces, IEEE Transactions on Big Data, 2017
  • Y. Kalantidis, C. Mellina, and S. Osindero. Cross-dimensional weighting for aggregated deep convolutional features. In arXiv:1512.04065, 2015.
  • P. Kulkarni , J. Zepeda , F. Jurie , P. Perez and L. Chevallier, Hybrid multi-layer deep cnn/aggregator feature for image classification, In ICASSP 2015
  • E. Mohedano, A. Salvador, K. McGuinness , X. Giró-i-Nieto, N. O'Connor, F. Marqués. Bags of Local Convolutional Features for Scalable Instance Search. In ICMR 2016
  • A. Mikulik, M. Perdoch, O. Chum, J. Matas: Learning vocabularies over a fine quantization. IJCV (2013)
  • A. Mousavian and J. Kosecka, Deep Convolutional Features for Image Based Retrieval and Scene Categorization, In arXiv, 2015
  • F. Radenovic, G. Tolias, and O. Chum. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In ECCV, 2016.
  • F. Radenovic, J. Schonberger, D. Ji, J.M. Frahm, O. Chum, J. Matas: From dusk till dawn: Modeling in the dark. In: CVPR. (2016
  • A. Razavian, J. Sullivan, A. Maki, and S. Carlsson. A baseline for visual instance retrieval with deep convolutional networks. In arXiv:1412.6574, 2014.
  • A. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In: CVPRW, 2014.
  • J. Schonberger, F. Radenovic, O. Chum, J.M. Frahm. From single image query to detailed 3D reconstruction. In: CVPR. 2015
  • G. Tolias, R. Sicre, and H. Jegou. Particular object retrieval with integral max-pooling of cnn activations. In ICLR, 2016.
  • G. Tolias and H. Jégou, Visual query expansion with or without geometry: refining local descriptors by feature aggregation, In Pattern Recognition, 2014.