Place recognition with instance search
from hand-crafted to learning-based methods
Giorgos Tolias
Tutorial on “Large-Scale Visual Place Recognition and Image-Based Localization” Tolias Sattler Brachmann ICCV 2019, Seoul
Place recognition with instance search from hand-crafted to - - PowerPoint PPT Presentation
Place recognition with instance search from hand-crafted to learning-based methods Giorgos Tolias Tutorial on Large -Scale Visual Place Recognition and Image- Based Localization Tolias Sattler Brachmann ICCV 2019, Seoul Outline
Giorgos Tolias
Tutorial on “Large-Scale Visual Place Recognition and Image-Based Localization” Tolias Sattler Brachmann ICCV 2019, Seoul
representation of database images
query image query representation nearest neighbor search
GPS-based ground truth
image from Tokyo 24 / 7
GPS-based ground truth IM2GPS [Hays & Efros, CVPR’08]
images from [Hays & Efros]
[Radenovic et al., CVPR’18] Manually constructed ground truth
http://cmp.felk.cvut.cz/revisitop/
Google Landmarks Dataset Crowd-sourced ground truth
https://github.com/cvdfoundation/google-landmark
4.1m images 200k landmarks
762k images (1⁄3 decrease) 101k landmarks
118k images about 1% depicts landmarks
Embedding function: Image similarity: Local similarity function:
Linear matching of local descriptors
Linear matching of local descriptors voting approach
Linear matching of local descriptors voting approach
Linear matching of local descriptors global descriptor voting approach
process & aggregate post-processing (whitening) local descriptor set global descriptor global descriptor
SIFT [ICCV’99] SURF [ECCV’06] MSER [BMVC’03] Hessian-Affine [IJCV’04], …
SIFT [ICCV’99] SURF [ECCV’06], …
[Sivic, ICCV’03] [Csurka, ECCVW’04]
[Sivic, ICCV’03] [Csurka, ECCVW’04]
not all pairs contribute
[Sivic, ICCV’03] [Csurka, ECCVW’04]
global descriptor voting approach i-th element
Aggregate residual vectors, instead of counts
[Jegou et al., CVPR’10]
[Zhao et al., BMVC’13]
… … … … … …
Aggregating dense features fixed dimensionality of global descriptor DenseVLAD [Torii et al, CVPR’15] [Tolias et al., ECCV’14]
global descriptor voting approach
1: 2: … i: ... K: Codebook size in the order of 106 [Philbin et al., CVPR’07]
1: 2: … i: ... K: Codebook size in the order of 106 [Philbin et al., CVPR’07]
1: 2: … i: ... K: Codebook size in the order of 106 [Philbin et al., CVPR’07]
1: 2: … i: ... K: Codebook size in the order of 106 [Philbin et al., CVPR’07]
1: 2: … i: ... K: … … … … Codebook size in the order of 106 [Philbin et al., CVPR’07]
1: 2: … i: ... K: … … … … Codebook size in the order of 106 [Philbin et al., CVPR’07]
1: 2: … i: ... K: … … … … Codebook size in the order of 106 [Philbin et al., CVPR’07]
Non-linear local similarity Normalized residuals
[Tolias et al. IJCV’16]
Normalized aggregated residuals
[Tolias et al. IJCV’16]
(or binarized aggregated residuals for efficiency)
[Tolias et al. IJCV’16] same color = same visual word typical codebook size: 65k
tentative correspondences inliers with fast spatial verification [Philbin et al., CVPR’07] [Philbin et al., CVPR’07]
embedding & aggregation
fully convolutional network
descriptor:
[Mohedano et al. ICMR’16]
soft-assignment
Codebook centroids are trainable parameters
[Arandjelovic et al., CVPR’16]
discriminative power of CNN activations
[Babenko & Lempitsky, ICCV’15]
α: weight based on L2 norm of local descriptors β: inverted-document-frequency weight example of α [Kalantidis et al., ECCV’16]
Input image conv5 filter 1 conv5 filter 2 …. conv5 filter i …. conv5 filter K
[Razavian et al., MTA’16] [Tolias et al., ICLR’16]
Input image conv5 filter 1 conv5 filter 2 …. conv5 filter i …. conv5 filter K
maximum activation
[Razavian et al., MTA’16] [Tolias et al., ICLR’16]
[Razavian et al., MTA’16] [Tolias et al., ICLR’16] regions for top matching components different color per component pair 1 pair 3 pair 2
𝑞 → ∞ max pool (MAC) 𝑞 = 1 avg pool (SPoC) [Radenovic et al., PAMI’19]
MAC descriptor
[Tolias et al., ICLR’16]
MAC descriptor
[Tolias et al., ICLR’16]
MAC descriptor
[Tolias et al., ICLR’16]
MAC descriptor
[Tolias et al., ICLR’16]
MAC descriptor
[Tolias et al., ICLR’16]
10 20 30 40 50 60 70 80 ResNet101 pre-trained ResNet101 fine-tuned SPoC CroW MAC GeM R-MAC
Precision@10 on R-Oxford+1M distractors
10 20 30 40 50 60 70 80 ResNet101 pre-trained ResNet101 fine-tuned SPoC CroW MAC GeM R-MAC
Precision@10 on R-Oxford+1M distractors Fine-tuning improvement for GeM: +26.6%
Descriptor Image similarity [Kim et al., CVPR’17]
image from Kim et al.
learned attention: 83.2 vs CroW attention : 80.1 NetVLAD top-1 accuracy, San Francisco learned attention : 75.2 vs no attention : 71.8 NetVLAD top-1 accuracy, Tokyo 24/7
Descriptor Image similarity [Noh et al., ICCV’17]
image from Noh et al.
[Noh et al., ICCV’17]
Training time:
Test time:
image from Noh et al.
mAP on Oxford5k Global descriptor 77.9 Voting + SP 83.8
(not in the paper, thanks to authors for sharing)
cross entropy loss Discrete labels difficult to obtain at instance-level FC layer soft-max global descriptor
anchor negative positive
far enough as close as possible large enough
The larger the batch the better no need to sample [Revaud et al., ICCV’19]
[Revaud et al., ICCV’19]
anchor candidate negatives
anchor candidate positives camera orientation (unkown)
anchor candidate positives camera orientation (unkown)
Descriptor distance to resolve: pick the closest [Arandjelovic et al., CVPR’16]
7.4M images 713 training 3D models [Schonberger et al. CVPR’15] [Radenovic et al. CVPR’16]
7.4M images 713 training 3D models [Schonberger et al. CVPR’15] [Radenovic et al. CVPR’16]
anchor
Negative examples: images from different 3D models than the query Hard negatives: closest negative examples to the query [Radenovic et al. PAMI’19]
anchor the most similar CNN descriptor
Negative examples: images from different 3D models than the query Hard negatives: closest negative examples to the query [Radenovic et al. PAMI’19]
anchor the most similar CNN descriptor naive hard negatives top k by CNN
Negative examples: images from different 3D models than the query Hard negatives: closest negative examples to the query
increasing CNN descriptor distance to the query
[Radenovic et al. PAMI’19]
anchor the most similar CNN descriptor naive hard negatives top k by CNN
Negative examples: images from different 3D models than the query Hard negatives: closest negative examples to the query
increasing CNN descriptor distance to the query
[Radenovic et al. PAMI’19]
anchor the most similar CNN descriptor naive hard negatives top k by CNN diverse hard negatives top k: one per 3D model
Negative examples: images from different 3D models than the query Hard negatives: closest negative examples to the query
increasing CNN descriptor distance to the query
[Radenovic et al. PAMI’19]
anchor
Positive examples: images that share 3D points with the query Hard positives: positive examples not close enough to the query
[Radenovic et al. PAMI’19]
anchor top 1 by CNN
Positive examples: images that share 3D points with the query Hard positives: positive examples not close enough to the query
[Radenovic et al. PAMI’19]
anchor top 1 by CNN
Positive examples: images that share 3D points with the query Hard positives: positive examples not close enough to the query
[Radenovic et al. PAMI’19]
anchor top 1 by CNN top 1 by inliers
harder positives
Positive examples: images that share 3D points with the query Hard positives: positive examples not close enough to the query
[Radenovic et al. PAMI’19]
anchor top 1 by CNN top 1 by inliers random from top k by inliers
harder positives
Positive examples: images that share 3D points with the query Hard positives: positive examples not close enough to the query
[Radenovic et al. PAMI’19]
[Radenovic et al. PAMI’19]
Oxford 5k Paris 6k
Off-the-shelf
44.2 51.6
[Radenovic et al. PAMI’19]
Oxford 5k Paris 6k
Off-the-shelf top 1 CNN + top k CNN
44.2 51.6 56.2 63.1
[Radenovic et al. PAMI’19]
Oxford 5k Paris 6k
Off-the-shelf top 1 CNN + top k CNN
44.2 51.6 56.2 63.1
[Radenovic et al. PAMI’19]
Oxford 5k Paris 6k
Off-the-shelf top 1 CNN + top k CNN top 1 CNN + top 1 / model CNN
44.2 51.6 56.2 63.1 56.7 63.9
[Radenovic et al. PAMI’19]
Oxford 5k Paris 6k
Off-the-shelf top 1 CNN + top k CNN top 1 CNN + top 1 / model CNN top 1 inliers + top 1 / model CNN
44.2 51.6 56.2 63.1 56.7 63.9 59.7 67.1
[Radenovic et al. PAMI’19]
Oxford 5k Paris 6k
Off-the-shelf top 1 CNN + top k CNN top 1 CNN + top 1 / model CNN top 1 inliers + top 1 / model CNN random(top k inliers) + top 1 / model CNN
44.2 51.6 56.2 63.1 56.7 63.9 59.7 67.1 60.2 67.5
[Radenovic et al. PAMI’19]
[Gordo et al. IJCV’18]
Use classical computer vision to collect training data: Bag-of-Words and spatial verification
[Gordo et al. IJCV’18]
[Gordo et al. IJCV’18]
classification loss vs ranking loss
adaptive partitioning into k=26,263 cells N-way classification training [Weyand et al., ICCV’17] Very compact model (377 MB)! But is it better than instance search?
[Vo et al., CVPR’17]
Evaluation at different scales IM2GPS dataset Fine street (1km) city (25km) Coarse scale region (250km) country (750km) continent (7500km)
[Vo et al., CVPR’17]
Evaluation at different scales IM2GPS dataset Fine street (1km) city (25km) Coarse scale region (250km) country (750km) continent (7500km)
[Vo et al., CVPR’17]
Evaluation at different scales IM2GPS dataset Fine street (1km) city (25km) Coarse scale region (250km) country (750km) continent (7500km)
[Vo et al., CVPR’17]
Evaluation at different scales IM2GPS dataset Fine street (1km) city (25km) Coarse scale region (250km) country (750km) continent (7500km)
Combining global GeM with local DELF-ASMK
23.3 21.1 29.3 5 10 15 20 25 30 35 GeM ASMK-DELF+SP Combined
DELF with ASMK*+SP
GeM fine-tuned with SfM, D = 512
R-Oxford+1M distractors [Radenovic et al., CVPR’18]
DELF with ASMK*+SP
GeM fine-tuned with SfM, D = 512
R-Oxford+1M distractors [Radenovic et al., CVPR’18]
DELF with ASMK*+SP
GeM fine-tuned with SfM, D = 512
R-Oxford+1M distractors [Radenovic et al., CVPR’18]
DELF with ASMK*+SP
GeM fine-tuned with SfM, D = 512
R-Oxford+1M distractors [Radenovic et al., CVPR’18]
fine-tuned R-MAC ResNet101 Uncompressed (8192 bytes): 82.8 [Gordo et al. IJCV’18] Product Quantization (PQ) [Jegou et al., PAMI’10]
process & aggregate post-processing (whitening) local descriptor set global descriptor global descriptor
process & aggregate post-processing (whitening) local descriptor set global descriptor global descriptor pre-trained
process & aggregate post-processing (whitening) local descriptor set global descriptor global descriptor learned end-to-end
eigen-vectors as columns global descriptor mean vector
jointly down-weigh co-occurring features
[Perronnin et al., CVPR’10]
[Jegou & Chum, ECCV’12]
jointly down-weigh co-occurring features
jointly down-weigh co-occurring features
[Mukundan et al., IJCV’19]
Covariance from matching and non-matching pairs Obtain the whitening from covariance of matching
Obtain the rotation by PCA on whitened covariance of non-matching: [Mikolajczyk & Matas, ICCV’07]
MAC 31.6 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 supervised whitening 46.9 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 supervised whitening 46.9 mAP on R-Oxford pre-trained ResNet101
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 supervised whitening 46.9 mAP on R-Oxford pre-trained ResNet101 SPoC 24.5 42.7
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 supervised whitening 46.9 mAP on R-Oxford pre-trained ResNet101 SPoC 24.5 42.7 GeM 31.6 50.1
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 supervised whitening 46.9 mAP on R-Oxford pre-trained ResNet101 SPoC 24.5 42.7 GeM 31.6 50.1
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 supervised whitening 46.9 mAP on R-Oxford pre-trained ResNet101 SPoC 24.5 42.7 GeM 31.6 50.1 fine-tuned ResNet101 GeM 52.9 64.1
MAC 31.6 PCA+power-law 40.1 PCA whitening 41.7 PCA wh.+shrink 43.5 supervised whitening 46.9 mAP on R-Oxford pre-trained ResNet101 SPoC 24.5 42.7 GeM 31.6 50.1 fine-tuned ResNet101 GeM 52.9 64.1 some pooling operations benefit more than others additional insight about whitening in [Mukundan et al. IJCV’19]
process & aggregate post-processing (whitening) local descriptor set global descriptor global descriptor learned end-to-end
process & aggregate FC (whitening) local descriptor set global descriptor global descriptor learned end-to-end
https://github.com/filipradenovic/cnnimageretrieval-pytorch
Instance search for place recognition Better than classification-based approaches Classification loss can handle large intra-class variability ranking loss needs selection of clean & challenging pairs Global descriptors vs voting approaches speed/compactness vs better accuracy Surprising effectiveness of descriptor whitening
Tolias, Radenovic, Chum: Targeted mismatch adversarial attacks. ICCV 2019
Access to place recognition / retrieval systems sharing of private content
Original queries
Original queries Concealed queries
+ = plane plane Non-targeted misclassification
+ = plane kangaroo Targeted misclassification
+ = dissimilar descriptors Non-targeted mismatch
+ = similar descriptors Targeted mismatch carrier attack target perturbation concealed query initial query
Known: FCN, Pooling & Normalization (PN), Re-sampling Unknown: Post-processing
Known: FCN, Re-sampling Unknown: Pooling & Normalization (PN), Post-processing
Known: FCN, Re-sampling Unknown: Pooling & Normalization (PN), Post-processing
Known: FCN Unknown: Re-sampling, Pooling & Normalization (PN), Post-processing multi-scale attack Key ingredient: blurring before re-sampling
target carrier 0.782
target carrier 0.782 1.000 GeM, λ=0
target carrier 0.782 1.000 GeM, λ=0 tens, λ=0 0.999
target carrier 0.782 1.000 GeM, λ=0 tens, λ=0 0.999
target carrier 0.782 1.000 GeM, λ=0 tens, λ=0 0.999 tens, λ=1 0.997
target carrier 0.782 1.000 GeM, λ=0 tens, λ=0 0.999 tens, λ=1 0.997
target carrier 0.782 1.000 GeM, λ=0 tens, λ=0 0.999 tens, λ=1 0.997
target carrier 0.782 1.000 GeM, λ=0 tens, λ=0 0.999 tens, λ=1 0.997
target carrier 0.782 1.000 GeM, λ=0 tens, λ=0 0.999 tens, λ=1 0.997 1.000 hist, λ=0
target carrier 0.782 1.000 GeM, λ=0 tens, λ=0 0.999 tens, λ=1 0.997 1.000 hist, λ=0
performance similarity to target
multi-scale attack with dense resolution sampling performance similarity to target
https://github.com/gtolias/tma Online code