[PPT] - Instance-level recognition 1) Local invariant features 2) Matching PowerPoint Presentation

SLIDE 1

Instance-level recognition

1) Local invariant features 2) Matching and recognition with local features 3) Efficient visual search 4) Very large scale indexing

SLIDE 2

Visual search

…

SLIDE 3

Image search system for large datasets

Image search system ranked image list Large image dataset (one million images or more) query

Issues for very large databases
to reduce the query time
to reduce the storage requirements
with minimal loss in retrieval accuracy

SLIDE 4

Two strategies

1. Efficient approximate nearest neighbor search on local

feature descriptors

2. Quantize descriptors into a “visual vocabulary” and use

efficient techniques from text retrieval (Bag-of-words representation)

SLIDE 5

Images

Local features invariant descriptor vectors

1. Compute local features in each image independently 2. Describe each feature by a descriptor vector 3. Find nearest neighbour vectors between query and database 4. Rank matched images by number of (tentatively) corresponding regions 5. Verify top ranked images based on spatial consistency

Strategy 1: Efficient approximate NN search

invariant descriptor vectors

SLIDE 6

Voting algorithm

local characteristics vector of

( )

1

I

1

I

n

I

2

I

2

I

SLIDE 7

Voting algorithm

1

I

1

I

n

I

2

I

2

I

1 1

2 1 1 I is the corresponding model image

1

2 1 1

SLIDE 8

Finding nearest neighbour vectors

Establish correspondences between query image and images in the database by nearest neighbour matching on SIFT vectors

128D descriptor space Model image Image database

Solve following problem for all feature vectors, , in the query image: where, , are features from all the database images.

SLIDE 9

Quick look at the complexity of the NN-search

N … images M … regions per image (~1000) D … dimension of the descriptor (~128) Exhaustive linear search: O(M NMD) Example:

Matching two images (N=1), each having 1000 SIFT descriptors

Nearest neighbors search: 0.4 s (2 GHz CPU, implemenation in C)

Memory footprint: 1000 * 128 = 128kB / image

N = 1,000 … ~7min (~100MB) N = 10,000 … ~1h7min (~ 1GB) … N = 107 ~115 days (~ 1TB) … All images on Facebook: N = 1010 … ~300 years (~ 1PB)

# of images CPU time Memory req.

SLIDE 10

Nearest-neighbor matching

Solve following problem for all feature vectors, xj, in the query image: where xi are features in database images. Nearest-neighbour matching is the major computational bottleneck

Linear search performs dn operations for n features in the

database and d dimensions

No exact methods are faster than linear search for d>10
Approximate methods can be much faster, but at the cost of

missing some correct matches

SLIDE 11

l1 l8 1 l2 l3 l4 l5 l7 l6 l9 l10 3 2 5 4 11 9 10 8 6 7

4 7 6 5 1 3 2 9 8 10 11 l1 l2

K-d tree

K-d tree is a binary tree data structure for organizing a set of points
Each internal node is associated with an axis aligned hyper-plane

splitting its associated points into two sub-trees

Dimensions with high variance are chosen first
Position of the splitting hyper-plane is chosen as the mean/median of

the projected points – balanced tree

SLIDE 12

Large scale object/scene recognition

Each image described by approximately 1000 descriptors

– 109 descriptors to index for one million images!

Database representation in RAM:

– Size of descriptors : 1 TB, search+memory intractable

Image search system ranked image list Image dataset: > 1 million images query

SLIDE 13

Bag-of-features [Sivic&Zisserman’03]

Harris-Hessian-Laplace regions + SIFT descriptors Bag-of-features processing +tf-idf weighting querying

sparse frequency vector centroids (visual words) Inverted file ranked image short-list Set of SIFT descriptors Query image

Geometric verification

Re-ranked list

“visual words”:

– 1 “word” (index) per local descriptor – only images ids in inverted file  8 GB fits!

[Chum & al. 2007]

SLIDE 14

Indexing text with inverted files

Need to map feature descriptors to “visual words”

Inverted file: Term List of hits (occurrences in documents) People [d1:hit hit hit], [d4:hit hit] … Common [d1:hit hit], [d3: hit], [d4: hit hit hit] … Sculpture [d2:hit], [d3: hit hit hit] … Document collection:

SLIDE 15

[Sivic and Zisserman, ICCV 2003] Vector quantize descriptors

Compute SIFT features from a subset of images
K-means clustering (need to choose K)

Build a visual vocabulary

128D descriptor space 128D descriptor space

SLIDE 16

K-means clustering

Minimizing sum of squared Euclidean distances between points xi and their nearest cluster centers Algorithm:

Randomly initialize K cluster centers
Iterate until convergence:
Assign each data point to the nearest center
Recompute each cluster center as the mean of all points

assigned to it

Local minimum, solution dependent on initialization Initialization important, run several times, select best

SLIDE 17

Visual words

Example: each group

f patches belongs to

the same visual word

17

Figure from S ivic & Zisserman, ICCV 2003

128D descriptor space

SLIDE 18

Samples of visual words (clusters on SIFT descriptors):

SLIDE 19

Samples of visual words (clusters on SIFT descriptors):

SLIDE 20

Sivic and Zisserman, ICCV 2003

Visual words: quantize descriptor space

Nearest neighbour matching

128D descriptor space Image 1 Image 2

expensive to

do for all frames

SLIDE 21

Sivic and Zisserman, ICCV 2003 Nearest neighbour matching

128D descriptor space Image 1 Image 2

Vector quantize descriptors

128D descriptor space Image 1 Image 2

42 5 42 5 5 42

expensive to

do for all frames

Visual words: quantize descriptor space

SLIDE 22

Sivic and Zisserman, ICCV 2003 Nearest neighbour matching

128D descriptor space Image 1 Image 2

Vector quantize descriptors

128D descriptor space Image 1 Image 2

42 5 42 5 5 42

New image

expensive to

do for all frames

Visual words: quantize descriptor space

SLIDE 23

Sivic and Zisserman, ICCV 2003 Nearest neighbour matching

128D descriptor space Image 1 Image 2

Vector quantize descriptors

128D descriptor space Image 1 Image 2

42 5 42 5 5 42

New image

42

expensive to

do for all frames

Visual words: quantize descriptor space

SLIDE 24

Vector quantize the descriptor space (SIFT)

The same visual word

5 42

SLIDE 25

Image Colelction of visual words

Representation: bag of (visual) words

Visual words are ‘iconic’ image patches or fragments

represent their frequency of occurrence
but not their position

SLIDE 26

Offline: Assign visual words and compute histograms for each image

Normalize patch Detect patches Compute SIFT descriptor 5 42 Represent image as a sparse histogram of visual word occurrences 2 1 1 … Find nearest cluster center

SLIDE 27

Offline: create an index

Image credit: A. Zisserman

K. Grauman, B. Leibe

Word number Posting list

For fast search, store a “posting list” for the dataset
This maps visual word occurrences to the images they occur in

(i.e. like the “book index”)

SLIDE 28

At run time

Image credit: A. Zisserman

K. Grauman, B. Leibe

Word number Posting list

User specifies a query region
Generate a short-list of images using visual words in the region

1. Accumulate all visual words within the query region 2. Use “book index” to find other images with these words 3. Compute similarity for images sharing at least one word

SLIDE 29

At run time

Image credit: A. Zisserman

K. Grauman, B. Leibe
Score each image by the (weighted) number of common

visual words (tentative correspondences)

Worst case complexity is linear in the number of images N
In practice, it is linear in the length of the lists (<< N)

Word number Posting list

SLIDE 30

Another interpretation: Bags of visual words

Summarize entire image based

n its distribution (histogram)
f visual word occurrences

Slide: Grauman&Leibe, Image: L. Fei-Fei Hofmann 2001

... 1 ... ... 2

t

d =

Analogous to bag of words representation commonly used for text documents

SLIDE 31

For a vocabulary of size K, each image is represented by a K-vector where ti is the number of occurrences of visual word i Images are ranked by the normalized scalar product between the query vector vq and all vectors in the database vd:

Another interpretation: the bag-of-visual-words model

Scalar product can be computed efficiently using inverted file

SLIDE 32

Bag-of-features [Sivic&Zisserman’03]

Harris-Hessian-Laplace regions + SIFT descriptors Bag-of-features processing +tf-idf weighting querying

sparse frequency vector centroids (visual words) Inverted file ranked image short-list Set of SIFT descriptors Query image

Geometric verification

Re-ranked list [Chum & al. 2007]

1 2 3 3 4 5 Results

SLIDE 33

Geometric verification

Use the position and shape of the underlying features to improve retrieval quality Both images have many matches – which is correct?

SLIDE 34

Geometric verification

Remove outliers, many matches are incorrect
Estimate geometric transformation
Robust strategies

– RANSAC – Hough transform

SLIDE 35

Geometric verification

We can measure spatial consistency between the query and each result to improve retrieval quality, re-rank Many spatially consistent matches – correct result Few spatially consistent matches – incorrect result

SLIDE 36

Geometric verification

Gives localization of the object

SLIDE 37

Geometric verification – example

1. Query
3. Spatial verification (re-rank on # of inliers)

…

2. Initial retrieval set (bag of words model)

SLIDE 38

Evaluation dataset: Oxford buildings

All Soul's Ashmolean Balliol Bodleian Thom Tower Cornmarket Bridge of Sighs Keble Magdalen University Museum Radcliffe Camera

 Ground truth obtained for 11 landmarks  Evaluate performance by mean Average Precision

SLIDE 39

Measuring retrieval performance: Precision - Recall

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 recall precision

all images returned images relevant images

Precision: % of returned images that

are relevant

Recall: % of relevant images that are

returned

SLIDE 40

Average Precision

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 recall precision

A good AP score requires both high

recall and high precision

Application-independent

AP

Performance measured by mean Average Precision (mAP)

ver 55 queries on 100K or 1.1M image datasets

SLIDE 41

SLIDE 42

Query images

Prec. Rec.

high precision at low recall (like google)
variation in performance over queries
does not retrieve all instances

SLIDE 43

Obtaining visual words is like a sensor measuring the image “noise” in the measurement process means that some visual words are missing or incorrect, e.g. due to

Missed detections
Changes beyond built in invariance
Quantization effects

Consequence: Visual word in query is missing

Why aren’t all objects retrieved?

Clustered and Clustered and quantized to visual words

sparse frequency vector Set of SIFT descriptors query image [Lowe04, Mikolajczyk07] [Sivic03, Philbin07]

descriptors Hessian-Affine regions + SIFT descriptors

1. Query expansion
2. Better quantization

SLIDE 44

Query Expansion in text

In text :

Reissue top n responses as queries
Blind relevance feedback
Danger of topic drift

In vision:

Reissue spatially verified image regions as queries

SLIDE 45

Automatic query expansion

Visual word representations of two images of the same

bject may differ (due to e.g. detection/quantization noise)

resulting in missed returns Initial returns may be used to add new relevant visual words to the query Strong spatial model prevents ‘drift’ by discarding false positives

[Chum, Philbin, Sivic, Isard, Zisserman, ICCV’07; Chum, Mikulik, Perdoch, Matas, CVPR’11]

SLIDE 46

Visual query expansion - overview

1. Original query
3. Spatial verification
4. New enhanced query

…

2. Initial retrieval set
5. Additional retrieved images

SLIDE 47

Query Image Originally retrieved image Originally not retrieved

Query Expansion

SLIDE 48

Query Expansion

SLIDE 49

Query Expansion

SLIDE 50

Query Expansion

SLIDE 51

Query Expansion

…

New expanded query is formed as

the average of visual word vectors of spatially verified returns
only inliers are considered
regions are back-projected to the original query image

Spatially verified retrievals with matching regions overlaid New expanded query Query Image

SLIDE 52

Query image Originally retrieved Retrieved only after expansion

Query Expansion

SLIDE 53

Query image Expanded results (improved) Original results

Prec. Prec. Rec. Rec.

SLIDE 54

Quantization errors

Typically, quantization has a significant impact on the final performance of the system [Sivic03,Nister06,Philbin07] Quantization errors split features that should be grouped together and confuse features that should be separated

Voronoi cells

SLIDE 55

Visual words – approximate NN search

Map descriptors to words by quantizing the feature space

– Quantize via k-means clustering to obtain visual words – Assign descriptors to closest visual words

Bag-of-features as approximate nearest neighbor search

Bag-of-features matching function Descriptor matching with k-nearest neighbors where q(x) is a quantizer, i.e., assignment to a visual word and δa,b is the Kronecker operator (δa,b=1 iff a=b)

SLIDE 56

Approximate nearest neighbor search evaluation

ANN algorithms usually returns a short-list of nearest neighbors

– this short-list is supposed to contain the NN with high probability – exact search may be performed to re-order this short-list

Proposed quality evaluation of ANN search: trade-off between

– NN recall = probability that the NN is in this list against – NN precision = proportion of vectors in the short-list

the lower this proportion
the more information we have about the vector
the lower the complexity if we perform exact search on the short-list
ANN search algorithms usually have some parameters to handle this trade-off

SLIDE 57

ANN evaluation of bag-of-features

ANN algorithms

returns a list of potential neighbors

NN recall

= probability that the NN is in this list

NN precision:

= proportion of vectors in the short-list

In BOF, this trade-off

is managed by the number of clusters k

NN recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 rate of points retrieved

k=100 200 500 1000 2000 5000 10000 20000 30000 50000

BOW

SLIDE 58

20K visual word: false matches

SLIDE 59

200K visual word: good matches missed

SLIDE 60

Problem with bag-of-features

The matching performed by BOF is weak

– for a “small” visual dictionary: too many false matches – for a “large” visual dictionary: many true matches are missed

No good trade-off between “small” and “large” !

– either the Voronoi cells are too big – or these cells can’t absorb the descriptor noise  intrinsic approximate nearest neighbor search of BOF is not sufficient – possible solutions

soft assignment [Philbin et al. CVPR’08]
additional short codes [Jegou et al. ECCV’08]

SLIDE 61

Beyond bags-of-visual-words

Soft-assign each descriptor to multiple cluster centers

[Philbin et al. 2008, Van Gemert et al. 2008] A: 0.1 B: 0.5 C: 0.4 B: 1.0

Hard Assignment Soft Assignment

SLIDE 62

Beyond bag-of-visual-words

Hamming embedding [Jegou et al. 2008]

Standard quantization using bag-of-visual-words
Additional localization in the Voronoi cell by a binary

signature

SLIDE 63

Hamming Embedding

Representation of a descriptor x – Vector-quantized to q(x) as in standard BOF + short binary vector b(x) for an additional localization in the Voronoi cell Two descriptors x and y match iif

where h(a,b) Hamming distance

SLIDE 64

Hamming Embedding

Nearest neighbors for Hamming distance  those for Euclidean distance

 a metric in the embedded space reduces dimensionality curse effects

Efficiency

– Hamming distance = very few operations – Fewer random memory accesses: 3 x faster that BOF with same dictionary size!

SLIDE 65

Hamming Embedding

Off-line (given a quantizer)

– draw an orthogonal projection matrix P of size db × d  this defines db random projection directions – for each Voronoi cell and projection direction, compute the median value for a training set

On-line: compute the binary signature b(x) of a given

descriptor

– project x onto the projection directions as z(x) = (z1,…zdb) – bi(x) = 1 if zi(x) is above the learned median value, otherwise 0

[H. Jegou et al., Improving bag of features for large scale image search, ECCV’08, ICJV’10]

SLIDE 66

Hamming neighborhood

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 rate of NN retrieved (recall) rate of cell points retrieved 8 bits 16 bits 32 bits 64 bits 128 bits

Trade-off between memory usage and accuracy More bits yield higher accuracy In practice, 64 bits (8 byte)

SLIDE 67

ANN evaluation of Hamming Embedding

0.7 NN recall 0.1 0.2 0.3 0.4 0.5 0.6 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 rate of points retrieved

k=100 200 500 1000 2000 5000 10000 20000 30000 50000 ht=16 18 20 22

HE+BOW BOW

32 28 24

compared to BOW: at least 10 times less points in the short-list for the same level

f NN recall

Hamming Embedding provides a much better trade-off between recall and ambiguity removal

SLIDE 68

Matching points - 20k word vocabulary

201 matches 240 matches Many matches with the non-corresponding image!

SLIDE 69

Matching points - 200k word vocabulary

69 matches 35 matches Still many matches with the non-corresponding one

SLIDE 70

Matching points - 20k word vocabulary + HE

83 matches 8 matches 10x more matches with the corresponding image!

SLIDE 71

INRIA holidays dataset

Evaluation for the INRIA holidays dataset, 1491 images

– 500 query images + 991 annotated true positives – Most images are holiday photos of friends and family

1 million & 10 million distractor images from Flickr
Vocabulary construction on a different Flickr set
Evaluation metric: mean average precision (in [0,1],

bigger = better)

– Average over precision/recall curve

SLIDE 72

Holiday dataset – example queries

SLIDE 73

Dataset : Venice Channel

Query Base 4 Base 3 Base 2 Base 1

SLIDE 74

Dataset : San Marco square

Query Base 1 Base 3 Base 2 Base 9 Base 8 Base 4 Base 5 Base 7 Base 6

SLIDE 75

Example distractors - Flickr

SLIDE 76

Experimental evaluation

Evaluation on our holidays dataset, 500 query images, 1 million distracter

images

Metric: mean average precision (in [0,1], bigger = better)

Average query time (4 CPU cores) Compute descriptors 880 ms Quantization 600 ms Search – baseline 620 ms Search – WGC 2110 ms Search – HE 200 ms Search – HE+WGC 650 ms

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1000000 100000 10000 1000 mAP database size

baseline WGC HE WGC+HE +re-ranking

SLIDE 77

Results – Venice Channel

Base 1 Flickr Flickr Base 4 Query

SLIDE 78

Image retrieval - products

Search for places and particular objects

– For example on a smart phone Courtesy Google

SLIDE 79

Google image search

SLIDE 80

Towards large-scale image search

BOF+inverted file can handle up to ~10 millions images

– with a limited number of descriptors per image  RAM: 40GB – search: 2 seconds

Web-scale = billions of images

– with 100 M per machine  search: 20 seconds, RAM: 400 GB – not tractable

Solution: represent each image by one compressed vector

SLIDE 81

Very large scale image search

Hessian-Affine regions + SIFT descriptors Bag-of-features processing +tf-idf weighting

description vector centroids (visual words) [Mikolajezyk & Schmid 04] [Lowe 04] ranked image short-list Set of SIFT descriptors Query image

Geometric verification

Re-ranked list [Lowe 04, Chum & al 2007]

Vector compression Vector search

Each image is represented by one vector

(Bag-of-features, VLAD, Fisher, GIST)

Vector compression to reduce storage

requirements and search time

SLIDE 82

Aggregating local descriptors

Set of n local descriptors  1 vector
Popular approach: bag of features, often with SIFT features
Recently improved aggregation schemes

– Fisher vector [Perronnin & Dance ‘07] – VLAD descriptor [Jegou, Douze, Schmid, Perez ‘10] – Supervector [Zhou et al. ‘10] – Sparse coding [Wang et al. ’10, Boureau et al.’10]

Used in very large-scale retrieval and classification

SLIDE 83

Global scene context – GIST descriptor



The “gist” of a scene: Oliva & Torralba (2001)



5 frequency bands and 6 orientations for each image location



Tiling of the image for the description



Global representation

SLIDE 84

Aggregating local descriptors



Most popular approach: BoF representation [Sivic & Zisserman 03]

►

sparse vector

►

highly dimensional → significant dimensionality reduction introduces loss



Vector of locally aggregated descriptors (VLAD) [Jegou et al. 10]

►

non sparse vector

►

fast to compute

►

excellent results with a small vector dimensionality



Fisher vector [Perronnin & Dance 07]

►

probabilistic version of VLAD

►

initially used for image classification

►

comparable performance to VLAD for image retrieval

SLIDE 85

VLAD : vector of locally aggregated descriptors



Determine a vector quantifier (k-means)

►

utput: k centroids (visual words): c1,…,ci,…ck

►

centroid ci has dimension d



For a given image

►

assign each descriptor to closest center ci

►

accumulate (sum) descriptors per cell vi := vi + (x - ci)



VLAD (dimension D = k x d)



The vector is square-root + L2-normalized



Alternative: Fisher vector ci x

[Jegou, Douze, Schmid, Perez, CVPR’10]

SLIDE 86

VLADs for corresponding images

SIFT-like representation per centroid (+ components: blue, - components: red)



good coincidence of energy & orientations v1 v2 v3 ...

SLIDE 87

Translated cluster → large derivative on for this component

Fisher vector



Use a Gaussian Mixture Model as vocabulary



Statistical measure of the descriptors of the image w.r.t the GMM



Derivative of likelihood w.r.t. GMM parameters GMM parameters: weight mean variance (diagonal)

[Perronnin & Dance 07]

SLIDE 88

Fisher vector

For image retrieval in our experiments:

only deviation wrt mean, dim: K*D [K number of Gaussians, D dim of descriptor]
variance does not improve for comparable vector length

SLIDE 89

VLAD/Fisher/BOF performance and dimensionality reduction



We compare Fisher, VLAD and BoF on INRIA Holidays Dataset (mAP %)



Dimension is reduced to D’ dimensions with PCA



Observations:

►

Fisher, VLAD better than BoF for a given descriptor size

►

Choose a small D if output dimension D’ is small

►

Performance of GIST not competitive

[Jegou, Perronnin, Douze, Sanchez, Perez, Schmid, PAMI’12] GIST 960 36.5

SLIDE 90

Compact image representation



Aim: improving the tradeoff between

►

search speed

►

memory usage

►

search quality



Approach: joint optimization of three stages

►

local descriptor aggregation

►

dimension reduction

►

indexing algorithm Image representation VLAD / Fisher PCA + PQ codes (Non) – exhaustive search

SLIDE 91



Vector split into m subvectors:



Subvectors are quantized separately by quantizers where each is learned by k-means with a limited number of centroids



Example: y = 128-dim vector split in 8 subvectors of dimension 16

►

each subvector is quantized with 256 centroids -> 8 bit

►

very large codebook 256^8 ~ 1.8x10^19

Product quantization for nearest neighbor search

8 bits 16 components

⇒ 8 subvectors x 8 bits = 64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8 q1 q2 q3 q4 q5 q6 q7 q8 q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

[Jegou, Douze, Schmid, PAMI’11]

SLIDE 92

Deep image retrieval [Gordo et al. 2016]

Learns to represent images for retrieval

– Deep network which focuses on retrieval

Requires train data

– Introduces an automatic cleaning procedure based on geometric constraints

State-of-the-art results
Details in student presentation