Feature-based Place Recognition
1
Akihiko Torii Tokyo Tech
CVPR 2017 tutorial on Large-Scale Visual Place Recognition and Image-Based Localization Alex Kendall, Torsten Sattler, Giorgos Tolias, Akihiko Torii
Feature-based Place Recognition Akihiko Torii Tokyo Tech CVPR 2017 - - PowerPoint PPT Presentation
1 Feature-based Place Recognition Akihiko Torii Tokyo Tech CVPR 2017 tutorial on Large-Scale Visual Place Recognition and Image-Based Localization Alex Kendall, Torsten Sattler, Giorgos Tolias, Akihiko Torii 2 Introduction Challenges in
1
CVPR 2017 tutorial on Large-Scale Visual Place Recognition and Image-Based Localization Alex Kendall, Torsten Sattler, Giorgos Tolias, Akihiko Torii
2
3
4
5
https://www.google.co.jp/maps/ @35.6066354,139.6861582,3a,45.2y,256.68h,96.58t
6
7
8
Photo community sites (flickr, instagram, …) + Never-stop growing
StreetView images (Google StreetView, Mapillary,…) + Accurate, covering almost all the streets
9
Generate perspective cutouts [Gronat11, Chen11, Torii13]
10
Figures from: [Chen-CVPR11]
11
12
13
14
15
16
Image representation space
Transfer GPS
Geotagged image database
17
Extract local features
2 1 1 …
f(I)
+
Aggregate Image I
18
[Sivic03]
19
[Sivic03]
20
[Jégou10b]
21
[Jégou10b]
22
(e.g. 1.6M)
+ Can be a sparse histogram
(using a large vocab.)
+ Can provide matches
(e.g. 256K x 128K)
+ Performs well with a small vocab. + Can be compressed by PCA with
a small loss in performance
+ No extra memory requirement to
encode more features
23
24
25
(e.g. 1.6M)
+ Can be a sparse histogram
(using a large vocab.)
+ Can provide matches
26
(e.g. 256K x 128K)
+ Performs well with a small vocab. + Can be compressed by PCA with
a small loss in performance
+ No extra memory requirement to
encode more features
27
Extract local features (DoG+SIFT) f(I)
+
Aggregate Image I
Extract local features (DSIFT, PHOW)
28
f(I)
+
Aggregate Image I
See [Lazebnik06, Bosch07, Iscen15, Torii15]
Extract local features (DSIFT, PHOW)
29
f(I)
+
Aggregate Image I
conv (w,b) 1x1xDxK soft-max VLAD core (c) intra- normalization L2 normalization
soft-assignment
V x x s (KxD)x1 VLAD vector
NetVLAD layer Convolutional Neural Network
...
Image
WxHxD map interpreted as NxD local descriptors x
CNN layers Pooling layer
30
31
32
33
34
35
36
37
38
39
See also Disloc+geometric burstiness [Sattler16] and selective match kernel [Tolias16]
40
41
positive image many negative data points
200m
[Schindler07] – What are informative features? [Zamir10] – Ratio test with location constraint.
[Knopp10, Gronat13, Cao13, Sattler16 …. ]
42
Find the most similar images that are spatially far
43 Matches with confused images Confusion score Confuising regions
44
200m
Objective function: where h is squared hinge loss.
Similar to Exemplar SVM by [Malisiewicz11]
See also: [Cao13]
45
– geographically far from the query
Query
46
47
Figure from [Maddern14] Figure from [Neubert15]
48
49
50
Large vp change Sparse
Query image Street-view Sparse SIFT (DoG) Inlier ratio: 0.05 (53/1149)
51
Large vp change Sparse Dense
Query image Street-view Sparse SIFT (DoG) Inlier ratio: 0.05 (53/1149) Dense SIFT (DoG) Inlier ratio: 0.31 (1135/3708)
52
Large vp change Small vp change Sparse Dense
Query image Street-view Query image Synthesized view Sparse SIFT (DoG) Inlier ratio: 0.05 (53/1149) Sparse SIFT (DoG) Inlier ratio: 0.12 (122/984) Dense SIFT (DoG) Inlier ratio: 0.31 (1135/3708)
53
Large vp change Small vp change Sparse Dense
Query image Street-view Query image Synthesized view Sparse SIFT (DoG) Inlier ratio: 0.05 (53/1149) Sparse SIFT (DoG) Inlier ratio: 0.12 (122/984) Dense SIFT (DoG) Inlier ratio: 0.31 (1135/3708) Dense SIFT Inlier ratio: 0.76 (5410/7138)
54
Street-view panorama Associated depth-map Individual scene planes Examples of synthesized views
Virtual view Street-view Query
We seek to find one or more images depicting the same place!
55
We seek to find one or more images depicting the same place!
56
57
Datasets Database images
Database 3D points
#Query Ground truth Pittsburgh [Torii13] 254K 24K GPS Tokyo [Torii15] 76K (374K) 1,125 GPS
Oxford 5K (+100K) 55 Label Paris 6.4K 55 Label
San Francisco PCI [Chen11] 1.06M 803 Building ID (label) San Francisco SF-0 [Li12] 610K
30M
803 Building ID (label) San Francisco SF-1 [Li12] 790K
75M
803 Building ID (label)
Datasets Database images
Database 3D points
#Query Ground truth Arts Quad [Li10] 6.5K
2M
348 Differentiable GPS Aachen [Sattler12] 3K
1.5M
369 (10.6K seq.) Camera pose Baidu-IBL [Sun17] 682
67M
2296 LiDAR reg. Landmarks [Li12] 205K
38M
10K SfM Dubrovnik [Li12] 6K
1.9M
800 SfM Cambridge Landmarks [Kendall15] 6.8K
RGBD
4K Camera poses (GPS) 7 scenes [Shotton13] 26K
RGBD
17K GPS
60
Number of query images at least one of the top N retrieved database images has the ground truth building ID match
61
Number of query images at least one of the top N retrieved database images has the ground truth building ID match
62
Number of query images at least one of the top N retrieved database images has the ground truth label
(35.66,139.65) (35.66,139.64) (35.70,139.60) (35.50,139.70)
63
Number of query images at least one of the top N retrieved database images has the ground truth label
64
Number of query images localized within the threshold (x-axis)
Figure from [Zamir10]
65 Large-scale 3D model Geotagged images Query image ? ?
66
Query image
Retrieved geo-tagged images
Local 3D model
Geo-Registration
Local SfM
67
http://www.ok.sc.e.titech.ac.jp/~torii/project/vlocalization/
Results | Reference Poses | Benchmark Protocol
Query Most relevant DB imageManual annotation:
Pose estimation:
68
5 10 15 20 25 30
Distance threshold [meters]
20 40 60 80 100
Correctly localized queries [%]
Disloc (NN) Disloc (SR) Disloc (SR-SfM)
5 10 15 20 25 30
Distance threshold [meters]
20 40 60 80 100
Correctly localized queries [%]
DenseVLAD (NN) DenseVLAD (SR) DenseVLAD (SR-SfM) NetVLAD (NN) NetVLAD (SR) NetVLAD (SR-SfM)
5 10 15 20 25 30
Distance threshold [meters]
20 40 60 80 100
Correctly localized queries [%]
DenseVLAD (SR-SfM) Disloc (SR-SfM) Hyperpoints (3D) CPV w/ GPS (3D) CPV w/o GPS (3D)
reference pose, distances measured in UTM coordinates in 2D (height undefined)
many CNN based representations are build on top of them, e.g. [Tolias16, Arandjelovic16, Radenovic16, Kim17]
Iscen17, Wijmans17]
69
[Arandjelovic12] R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval. In Proc. CVPR 2012. [Arandjelovic14] R. Arandjelovic and A. Zisserman. DisLocation: Scalable descriptor distinctiveness for location
[Bosch07] A. Bosch, A. Zisserman, and X. Munoz. Image classification using random forests and ferns. In Proc. ICCV 2007. [Cao13] S. Cao and N. Snavely. Graph-based discriminative learning for location recognition. In Proc. CVPR 2013. [Chen11] D. M. Chen, G. Baatz, K. Koeser, S. S. Tsai, R. Vedantham, T. Pylvanainen, K. Roimela, X. Chen, J. Bach,
2011. [Chum11] O. Chum, A. Mikulik, M. Per{\v d]och, and J. Matas. Total recall {II]: Query expansion revisited. In Proc. CVPR 2011. [Cummins08] M. Cummins and P . Newman. FAB-MAP: Probabilistic localization and mapping in the space of
[Gronat13] P . Gronat, G. Obozinski, J. Sivic, and T. Pajdla. Learning and calibrating per-location classifiers for visual place recognition. In Proc. CVPR 2013. [Hays08] J. Hays and A. Efros. im2gps: estimating geographic information from a single image. In Proc. CVPR 2008. [Iscen17] A. Iscen, G. Tolias, Y. S. Avrithis, T. Furon, and O. Chum. Panorama to panorama matching for location
[Iscen15] A. Iscen, G. Tolias, P . H. Gosselin, and H. Jegou. A comparison of dense region detectors for image search and fine-grained classification. IEEE Transactions on Image Processing 24(8):2369--2381, 2015. [Jegou09] H. Jegou, M. Douze, and C. Schmid. On the burstiness of visual elements. In Proc. CVPR 2009. [Jegou10] H. Jegou, M. Douze, and C. Schmid. Improving bag-of-features for large scale image search. IJCV 87(3): 316--336, 2010.
70
[Jegou10b] H. Jegou, M. Douze, C. Schmid, and P . Perez. Aggregating local descriptors into a compact image
[Kim17] H. Jin Kim, E. Dunn, and J.-M. Frahm. Learned contextual feature reweighting for image geo-localization. In Proc. CVPR 2017. [johns14] E. Johns and G.-Z. Yang. Generative methods for long-term place recognition in dynamic scenes. International Journal of Computer Vision 106(3):297--314, 2014. [Knopp10] J. Knopp, J. Sivic, and T. Pajdla. Avoiding confusing features in place recognition. In Proc. ECCV 2010. [Lazebnik06] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proc. CVPR 2006. [Maddern14] W. Maddern, A. Stewart, C. McManus, B. Upcroft, W. Churchill, and P . Newman. Illumination invariant imaging: Applications in robust vision-based localisation, mapping and classification for autonomous vehicles. In Proceedings of the Visual Place Recognition in Changing Environments Workshop, IEEE International Conference
[Neubert15] P . Neubert, N. Sunderhauf, and P . Protzel. Superpixel-based appearance change prediction for long- term navigation across seasons. Robotics and Autonomous Systems 69:15--27, 2015. [Philbin08] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular
[Sattler15]
. Radenovic, K. Schindler, and M. Pollefeys. Hyperpoints and fine vocabularies for large- scale location recognition. In Proc. ICCV 2015. [Sattler16] T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys. Large-scale location recognition and the geometric burstiness problem. In Proc. CVPR 2016. [Sattler17] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, and T. Pajdla. Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization? In Proc. CVPR 2017. [Sivic03] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In Proc. ICCV 2003.
71
[Sunderhauf-RSS15] N. Sunderhauf, S. Shirazi, A. Jacobson, F . Dayoub, E. Pepperell, B. Upcroft, and M. Milford. Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free. In Robotics: Science and Systems 2015. [Tolias14] G. Tolias and H. Jegou. Visual query expansion with or without geometry: refining local descriptors by feature aggregation. Pattern Recognition 2014. [Torii15] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place recognition by view synthesis. In
[Torii13] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi. Visual place recognition with repetitive structures. In Proc. CVPR 2013. [Zamir10] A. Zamir and M. Shah. Accurate image localization based on google maps street view. In Proc. ECCV 2010. [Zamir16] A. R. Zamir, A. Hakeem, L. Van Gool, M. Shah, and R. Szeliski. Large-scale visual geo-localization]. Springer, 2016. [Zheng16] L. Zheng, Y. Yang, and Q. Tian. SIFT meets CNN: A decade survey of instance retrieval. CoRR abs/ 1608.01807, 2016.
72