Overview Overview Local invariant features (C. Schmid) Matching - - PowerPoint PPT Presentation

overview overview
SMART_READER_LITE
LIVE PREVIEW

Overview Overview Local invariant features (C. Schmid) Matching - - PowerPoint PPT Presentation

Overview Overview Local invariant features (C. Schmid) Matching and recognition with local features (J. Sivic) Efficient visual search (J. Sivic) Very large scale search (C. Schmid) Practical session Image search system for


slide-1
SLIDE 1

Overview Overview

  • Local invariant features (C. Schmid)
  • Matching and recognition with local features (J. Sivic)
  • Efficient visual search (J. Sivic)
  • Very large scale search (C. Schmid)
  • Practical session
slide-2
SLIDE 2

Image search system for large datasets Image search system for large datasets

Large image dataset (one million images or more) (one million images or more) Image search ranked image list query Image search system

  • Issues for very large databases
  • to reduce the query time

q y

  • to reduce the storage requirements
  • with minimal loss in retrieval accuracy
slide-3
SLIDE 3

Large scale object/scene recognition Large scale object/scene recognition

Image dataset: k d i li t > 1 million images query Image search system ranked image list q y

  • Each image described by approximately 2000 descriptors

2 * 109 descriptors to index for one million images! – 2 109 descriptors to index for one million images!

  • Database representation in RAM:

Database representation in RAM:

– Size of descriptors : 1 TB, search+memory intractable

slide-4
SLIDE 4

Bag-of-words [Sivic & Zisserman’03] g

centroids (visual words) Set of SIFT descriptors Query image [Nister & al 04, Chum & al 07]

Hessian-Affine regions + SIFT descriptors Bag-of-features processing +tf-idf weighting

sparse frequency vector

g g

[Mikolajezyk & Schmid 04] [Lowe 04] Inverted

  • Visual Words
  • 1 word (index) per local descriptor

l i id i i d fil

querying

Inverted file

  • nly images ids in inverted file

 8 GB for a million images, fits in RAM

ranked image

G t i

Re ranked

  • Problem: Matching approximation

ranked image short-list

Geometric verification

Re-ranked list [Lowe 04, Chum & al 2007]

slide-5
SLIDE 5

Approximate nearest neighbour (ANN) evaluation of bag-of- features

ANN algorithms returns a list of potential

0 6 0.7

k=100

p neighbors Accuracy: NN recall = probability that the

0.5 0.6

200 500

= probability that the NN is in this list Ambiguity removal:

0.4 ecall

1000 2000

g y = proportion of vectors in the short-list

0.3 NN re

5000 10000 20000

In BOF, this trade-off is managed by the

0.2

20000 30000 50000

g y number of clusters k

0.1 BOW 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 rate of points retrieved BOW

slide-6
SLIDE 6

20K visual word: false matches

slide-7
SLIDE 7

200K visual word: good matches missed

slide-8
SLIDE 8

Problem with bag-of-features

  • The intrinsic matching scheme performed by BOF is weak

f “ ll” i l di ti t f l t h

  • for a “small” visual dictionary: too many false matches
  • for a “large” visual dictionary: many true matches are missed
  • No good trade-off between “small” and “large” !
  • either the Voronoi cells are too big
  • r these cells can’t absorb the descriptor noise

 intrinsic approximate nearest neighbor search of BOF is not sufficient

  • Possible solutions
  • Soft assignment [Philbin et al. CVPR’08]
  • Additional short codes [Jegou et al. ECCV’08]
slide-9
SLIDE 9

Hamming Embedding

  • Representation of a descriptor x
  • Vector-quantized to q(x) as in standard BOF

Vector quantized to q(x) as in standard BOF + short binary vector b(x) for an additional localization in the Voronoi cell

  • Two descriptors x and y match iif

where h(a,b) is the Hamming distance

  • Nearest neighbors for Hamming distance  the ones for Euclidean distance
  • Efficiency
  • Hamming distance = very few operations
  • Fewer random memory accesses: 3faster that BOF with same dictionary size!
slide-10
SLIDE 10

Hamming Embedding

  • Off-line (given a quantizer)

d th l j ti t i P f i d × d

  • draw an orthogonal projection matrix P of size db × d

 this defines db random projection directions

  • for each Voronoi cell and projection direction compute the median value
  • for each Voronoi cell and projection direction, compute the median value

from a learning set

  • On-line: compute the binary signature b(x) of a given descriptor
  • project x onto the projection directions as z(x) = (z1,…zdb)
  • bi(x) = 1 if zi(x) is above the learned median value, otherwise 0

[H. Jegou et al., Improving bag of features for large scale image search, ICJV’10]

slide-11
SLIDE 11

Hamming and Euclidean neighborhood

  • trade-off between

memory usage and

1

y g accuracy  more bits yield higher accuracy

0.8 call)

accuracy I ti 64 bit (8 b t )

0.6 trieved (rec

In practice 64 bits (8 bytes)

0.4 f 5-NN ret 0.2 rate of 8 bits 16 bit 16 bits 32 bits 64 bits 128 bits 0.2 0.4 0.6 0.8 1 rate of cell points retrieved

slide-12
SLIDE 12

ANN evaluation of Hamming Embedding

0 6 0.7

k=100 32 28 24

0.5 0.6

200 500 20 22

compared to BOW: at least 10 times less points in the short list for the

0.4 ecall

1000 2000 18

in the short-list for the same level of accuracy

0.3 NN re

5000 10000 20000 ht=16

Hamming Embedding provides a much better trade-off between recall

0.2

20000 30000 50000

trade off between recall and ambiguity removal

0.1 HE+BOW BOW 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 rate of points retrieved BOW

slide-13
SLIDE 13

Matching points - 20k word vocabulary

201 matches 240 matches Many matches with the non-corresponding image!

slide-14
SLIDE 14

Matching points - 200k word vocabulary

69 matches 35 matches Still many matches with the non-corresponding one

slide-15
SLIDE 15

Matching points - 20k word vocabulary + HE

83 matches 8 matches 10x more matches with the corresponding image!

slide-16
SLIDE 16

Bag-of-features [Sivic&Zisserman’03] Bag of features [Sivic&Zisserman 03]

sparse freq enc ector centroids (visual words) Set of SIFT descriptors Query image

Harris-Hessian-Laplace regions + SIFT descriptors Bag-of-features processing +tf-idf weighting

sparse frequency vector

querying

Inverted

querying

file ranked image

Geometric

Re-ranked g short-list

verification

list [Chum & al. 2007]

slide-17
SLIDE 17

Geometric verification

Use the position and shape of the underlying features t i t i l lit to improve retrieval quality Both images have many matches – which is correct? g y

slide-18
SLIDE 18

Geometric verification

We can measure spatial consistency between the query d h l i i l li and each result to improve retrieval quality Many spatially consistent matches – correct result Few spatially consistent matches – incorrect matches – correct result matches – incorrect result

slide-19
SLIDE 19

Geometric verification

Gives localization of the object

slide-20
SLIDE 20

Re-ranking based on geometric verification Re ranking based on geometric verification

  • works very well
  • but performed on a short-list only (100 - 1000 images)

 for very large datasets, the number of distracting images is so high that relevant images are not even short-listed! that relevant images are not even short-listed!  weak geometry

1 0 7 0.8 0.9

  • rt-listed

20 images 100 images 1000 images short-list size: 0 4 0.5 0.6 0.7 evant images sho 0.2 0.3 0.4 rate of rele 0.1 1000 10000 100000 1000000 dataset size

slide-21
SLIDE 21

Weak geometry consistency Weak geometry consistency

  • Weak geometric information used for all images (not only the short list)
  • Weak geometric information used for all images (not only the short-list)
  • Each invariant interest region detection has a scale and rotation angle

g g associated, here characteristic scale and dominant gradient orientation

Scale change 2 Rotation angle ca. 20 degrees

  • Each matching pair results in a scale and angle difference
  • For the global image scale and rotation changes are roughly consistent
slide-22
SLIDE 22

WGC: orientation consistency GC o e tat o co s ste cy

Max = rotation angle between images

slide-23
SLIDE 23

WGC: scale consistency

slide-24
SLIDE 24

Weak geometry consistency Weak geometry consistency

Integration of the geometric verification into the BOF

  • Integration of the geometric verification into the BOF

– votes for an image in two quantized subspaces, i.e. for angle & scale – these subspace are show to be roughly independent these subspace are show to be roughly independent – final score: filtering for each parameter (angle and scale)

  • Only matches that do agree with the main difference of
  • rientation and scale will be taken into account in the final

score Re ranking sing f ll geometric transformation still adds

  • Re-ranking using full geometric transformation still adds

information in a final stage

slide-25
SLIDE 25

Experimental results

  • Evaluation for the INRIA holidays dataset, 1491 images
  • 500 query images + 991 annotated true positives
  • 500 query images + 991 annotated true positives
  • Most images are holiday photos of friends and family
  • 1 million & 10 million distractor images from Flickr
  • 1 million & 10 million distractor images from Flickr
  • Vocabulary construction on a different Flickr set

Al t l ti h d

  • Almost real-time search speed

E l ti t i i i (i [0 1] bi b tt )

  • Evaluation metric: mean average precision (in [0,1], bigger = better)
  • Average over precision/recall curve
slide-26
SLIDE 26

Holiday dataset – example queries

slide-27
SLIDE 27

Dataset : Venice Channel

Query Base 2 Base 1 Base 4 Base 3

slide-28
SLIDE 28

Dataset : San Marco square

Query Base 1 Base 3 Base 2 Query Base 1 Base 3 Base 2 Base 4 Base 5 Base 7 Base 6 Base 9 Base 8

slide-29
SLIDE 29

Example distractors - Flickr

slide-30
SLIDE 30

Experimental evaluation p

  • Evaluation on our holidays dataset, 500 query images, 1 million distracter

images g

  • Metric: mean average precision (in [0,1], bigger = better)
  • 0 8
  • 0.9
  • 1
  • baseline
  • WGC
  • HE
  • WGC+HE
  • +re-ranking

Average query time (4 CPU cores)

  • 0.6
  • 0.7

0.8 P

Compute descriptors 880 ms Quantization 600 ms

0 3

  • 0.4
  • 0.5
  • mAP

Search – baseline 620 ms Search – WGC 2110 ms S h HE 200

  • 0.1
  • 0.2
  • 0.3

Search – HE 200 ms Search – HE+WGC 650 ms

  • 1000000
  • 100000
  • 10000
  • 1000
  • database size
slide-31
SLIDE 31

Results – Venice Channel

Base 1 Flickr Query Flickr Base 4 Query

slide-32
SLIDE 32

Demo at http://bigimbaz.inrialpes.fr

slide-33
SLIDE 33

Towards large-scale image search Towards large scale image search

BOF+inverted file can handle up to 10 millions images

  • BOF+inverted file can handle up to ~10 millions images

– with a limited number of descriptors per image  RAM: 40GB – search: 2 seconds search: 2 seconds

  • Web-scale = billions of images

g

– with 100 M per machine  search: 20 seconds, RAM: 400 GB – not tractable

  • Solution: represent each image by one compressed vector
slide-34
SLIDE 34

Very large scale image search

description ector centroids (visual words) Set of SIFT descriptors Query image

Hessian-Affine regions + SIFT descriptors Bag-of-features processing +tf-idf weighting

description vector [Mikolajezyk & Schmid 04] [Lowe 04]

Vector compression compression

  • Each image is represented by one vector

(Bag-of-features VLAD Fisher GIST)

Vector search

(Bag of features, VLAD, Fisher, GIST)

  • Vector compression to reduce storage

requirements and search time

ranked image

Geometric

Re-ranked

requirements and search time

g short-list

verification

list [Lowe 04, Chum & al 2007]

slide-35
SLIDE 35

Related work on very large scale image search

Min-hash and geometrical min-hash [Chum et al. 07-09]

Compressing the BoF representation (miniBof) [ Jegou et al. 09]  require hundreds of bytes to obtain a “reasonable quality” q y q y GIST d i t ith S t l H hi [W i t l ’08]

GIST descriptors with Spectral Hashing [Weiss et al.’08]  very limited invariance to scale/rotation/crop

slide-36
SLIDE 36

Global scene context – GIST descriptor + spectral hashing

The “GIST” of a scene: Oliva & Torralba (2001)

5 frequency bands and 6 orientations for each image location

Tiling of the image (windowing)

Tiling of the image (windowing)

~ 900 dimensions

Spectral hashing produces binary codes, similar to spectral clustering

slide-37
SLIDE 37

Related work on very large scale image search

Min-hash and geometrical min-hash [Chum et al. 07-09]

Compressing the BoF representation (miniBof) [Jegou et al. 09]  require hundreds of bytes to obtain a “reasonable quality” q y q y GIST d i t ith S t l H hi [W i t l ’08]

GIST descriptors with Spectral Hashing [Weiss et al.’08]  very limited invariance to scale/rotation/crop

Efficient object category recognition using classemes [Torresani et al.’10]

Aggregating local descriptors into a compact image representation [Jegou&al.’10,‘12]

slide-38
SLIDE 38

Aggregating local descriptors into a compact image representation

Aim: improving the tradeoff between

search speed

memory usage

search quality

Approach: joint optimization of three stages

local descriptor aggregation

dimension reduction

dimension reduction

indexing algorithm Image representation VLAD PCA + PQ codes (Non) – exhaustive search VLAD PQ codes search

Local desc. aggregation Dimension reduction Indexing algorithm

slide-39
SLIDE 39

Aggregation of local descriptors

Problem: represent an image by a single fixed-size vector: set of n local descriptors → 1 vector

Most popular idea: BoF representation [Sivic & Zisserman 03] p p p [ ]

sparse vector

highly dimensional significant dimensionality reduction introduces loss → significant dimensionality reduction introduces loss

Alternative: VLAD descriptor [VLAD = vector of locally aggregated descriptors]

non sparse vectors p

excellent results with low dimensional vectors

slide-40
SLIDE 40

VLAD : vector of locally aggregated descriptors

Learning: a vector quantifier (k-means)

  • utput: k centroids (visual words): c1,…,ci,…ck

centroid ci has dimension d

For a given image

assign each descriptor to closest center ci

accumulate (sum) descriptors per cell

accumulate (sum) descriptors per cell vi := vi + (x - ci) VLAD (di i D k d) x

VLAD (dimension D = k x d)

The vector is L2-normalized ci

slide-41
SLIDE 41

VLADs for corresponding images

v1 v2 v3 ...

SIFT-like representation per centroid (+ components: blue, - components: red)

slide-42
SLIDE 42

VLAD performance and dimensionality reduction

We compare VLAD descriptors with BoF: INRIA Holidays Dataset (mAP,%)

Dimension is reduced to from D to D’ dimensions with PCA

Aggregator k D D’=D

(no reduction)

D’=128 D’=64 BoF 1,000 1,000 41.4 44.4 43.4 BoF 20,000 20,000 44.6 45.2 44.5 BoF 200 000 200 000 54 9 43 2 41 6 BoF 200,000 200,000 54.9 43.2 41.6 VLAD 16 2,048 49.6 49.5 49.4 VLAD 64 8 192 52 6 51 0 47 7

Observations:

VLAD 64 8,192 52.6 51.0 47.7 VLAD 256 32,768 57.5 50.8 47.6

Observations:

VLAD better than BoF for a given descriptor size

Choose a small D if output dimension D’ is small

slide-43
SLIDE 43

Product quantization for nearest neighbor search

Vector split into m subvectors: S b t ti d t l b ti

Subvectors are quantized separately by quantizers where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

each subvector is quantized with 256 centroids -> 8 bit

very large codebook 256^8 ~ 1 8x10^19

very large codebook 256^8 ~ 1.8x10^19

16 components y1 y2 y3 y4 y5 y6 y7 y8 q1 q2 q3 q4 q5 q6 q7 q8

256 centroids

1 2 3 4 5 6 7 8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

centroids

8 bits

⇒ 8 subvectors x 8 bits = 64-bit quantization index

slide-44
SLIDE 44

Optimizing the dimension reduction and quantization together

VLAD vectors undergoes two approximations

mean square error from PCA projection

mean square error from quantization

Given k and bytes/image, choose D’ minimizing their sum Results on Holidays dataset:

  • there exists an optimal D’
  • 16 byte best results for k=64

16 byte best results for k 64

  • 320 byte best results for k=256

ADC t i di t ADC = asymmetric distance computation

slide-45
SLIDE 45

Joint optimization of VLAD and dimension reduction-indexing

For VLAD

The larger k, the better the raw search performance

But large k produce large vectors, that are harder to index

Optimization of the vocabulary size

Fixed output size (in bytes)

D’ computed from k via the joint optimization of reduction/indexing  end-to-end parameter optimization

slide-46
SLIDE 46

Results on the Holidays dataset with various quantization parameters

slide-47
SLIDE 47

Results on standard datasets

Datasets

University of Kentucky benchmark score: nb relevant images, max: 4

INRIA Holidays dataset score: mAP (%) M th d b t UKB H lid Method bytes UKB Holidays BoF, k=20,000 10K 2.92 44.6 BoF k 200 000 12K 3 06 54 9 BoF, k=200,000 12K 3.06 54.9 miniBOF 20 2.07 25.5 miniBOF 160 2.72 40.3 VLAD k=16, ADC 16 x 8 16 2.88 46.0 VLAD k=64, ADC 32 x10 40 3.10 49.5

D’ =64 for k=16 and D’ =96 for k=64 ADC (subvectors) x (bits to encode each subvector) miniBOF: “Packing Bag-of-Features”, ICCV’09 ADC (subvectors) x (bits to encode each subvector)

slide-48
SLIDE 48

Large scale experiments (10 million images)

Exhaustive search of VLADs, D’=64

4.77s

With the product quantizer

Exhaustive search with ADC: 0.29s

Non-exhaustive search with IVFADC: 0.014s IVFADC -- Combination with an inverted file IVFADC Combination with an inverted file

slide-49
SLIDE 49

Large scale experiments (10 million images)

0 7 0.8 0.6 0.7 0 4 0.5 @100

Timings

0.3 0.4 recall@ 4.768s

g

0 1 0.2 BOF D=200k VLAD k=64 VLAD k=64, D'=96 ADC: 0.286s IVFADC: 0.014s SH ≈ 0 267s 0.1 1000 10k 100k 1M 10M , VLAD k=64, ADC 16 bytes VLAD+Spectral Hashing, 16 bytes SH ≈ 0.267s 1000 10k 100k 1M 10M Database size: Holidays+images from Flickr

slide-50
SLIDE 50

Conclusion & future work

Excellent search accuracy and speed in 10 million of images

Each image is represented by very few bytes (20 – 40 bytes)

Tested on up to 220 million video frames

extrapolation for 1 billion images: 20GB RAM, query time < 1s on 8 cores

On-line available: Matlab source code for product quantizer

Alternative: using Fisher vectors instead of VLAD descriptors [Perronnin’10]

Extension to video & more “semantic” search

slide-51
SLIDE 51

Event retrieval in large video collections [Revaud et al. 2013]

Video description

frame t  VLAD descriptor, reduced to 512D with PCA

Comparison of two videos

  • query

Comparison of two videos

  • database
  • video

Fast calculation in the frequency domain + product quantization

slide-52
SLIDE 52