SIMILARITY SEARCH The Metric Space Approach
Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko
SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe - - PowerPoint PPT Presentation
SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching Survey of exiting approaches Part
Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko
Similarity Search: The Metric Space Approach Part I, Chapter 1 2
Foundations of metric space searching Survey of exiting approaches
Centralized index structures Approximate similarity search Parallel and distributed indexes
Similarity Search: The Metric Space Approach Part I, Chapter 1 3
Approximate similarity search overcomes problems of exact
similarity search using traditional access methods
Moderate improvement of performance with respect to sequential
scan
Dimensionality curse
Similarity search returns mathematically precise result sets
Similarity is subjective so, in some cases, also approximate result
sets satisfy the user
Approximate similarity search processes query faster at the
price of imprecision in the returned result sets
Useful for instance in interactive systems
Similarity search is an iterative process where temporary results are used to create a new query
Improvements up to two orders of magnitude
Similarity Search: The Metric Space Approach Part I, Chapter 1 4
Approximation strategies
Relaxed pruning conditions
Data regions overlapping the query regions can be discarded depending on the specific strategy
Early termination of the search algorithm
Search algorithm might stop before all regions have been accessed
Similarity Search: The Metric Space Approach Part I, Chapter 1 5
1.
Range and k-NN search queries
2.
3.
4.
5.
6.
Similarity Search: The Metric Space Approach Part I, Chapter 1 6
Let oN be the nearest neighbour of q. If
This can be generalized to the k-th nearest neighbor
N A
N k A k
Similarity Search: The Metric Space Approach Part I, Chapter 1 7
Exact pruning strategy:
rq rp
p q
q p
Similarity Search: The Metric Space Approach Part I, Chapter 1 8
Approximate pruning strategy:
rq rp rq/(1+e
p q
q p
Similarity Search: The Metric Space Approach Part I, Chapter 1 9
1.
Range and k-NN search queries
2.
K-NN search queries
3.
4.
5.
6.
Similarity Search: The Metric Space Approach Part I, Chapter 1 10
The k-NN algorithm determines the final result by
When the current result set belongs to a specific
Example: Stop when current result set belongs to the 10%
Similarity Search: The Metric Space Approach Part I, Chapter 1 11
For this strategy we use the distance distribution
The distance distribution Fq(x) specifies what is the
It is easy to see that Fq (x) gives, in probabilistic
q
Similarity Search: The Metric Space Approach Part I, Chapter 1 12
q
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
d(q,ok) Fraction of the data set whose distances from q are smaller than d(q,ok) Fq(x)
Similarity Search: The Metric Space Approach Part I, Chapter 1 13
When Fq(d(ok,q)) < r all objects of the current result
q
Similarity Search: The Metric Space Approach Part I, Chapter 1 14
Fq(x) is difficult to be handled since we need to
It was proven that the overall distance distribution
F(x) can be easily estimated as a discrete function
2 1 o
Similarity Search: The Metric Space Approach Part I, Chapter 1 15
1.
Range and k-NN search queries
2.
K-NN search queries
3.
K-NN search queries
4.
5.
6.
Similarity Search: The Metric Space Approach Part I, Chapter 1 16
The M-Tree’s k-NN algorithm determines the final
Each step of the algorithm the temporary result is
When the improvement of the temporary result set
Similarity Search: The Metric Space Approach Part I, Chapter 1 17
0,2 0,22 0,24 0,26 0,28 0,3 0,32 0,34 0,36 0,38 500 1000 1500
Iteration Distance
A k
Similarity Search: The Metric Space Approach Part I, Chapter 1 18
Similarity Search: The Metric Space Approach Part I, Chapter 1 19
2 1 1
j i
2 2 1 1
Similarity Search: The Metric Space Approach Part I, Chapter 1 20
0,2 0,22 0,24 0,26 0,28 0,3 0,32 0,34 0,36 0,38 0,4 500 1000 1500
Iteration Distance
Distance Hyperbolic Regr. Logarithmic Regr.
Similarity Search: The Metric Space Approach Part I, Chapter 1 21
1.
Range and k-NN search queries
2.
K-NN search queries
3.
K-NN search queries
4.
Range and k-NN search queries
5.
6.
Similarity Search: The Metric Space Approach Part I, Chapter 1 22
Regions whose probability of containing qualifying
Proximity between regions is defined as the probability
that a randomly chosen object appears in both the regions.
This resulted in an increase of performance of two
Similarity Search: The Metric Space Approach Part I, Chapter 1 23
1.R
1.2
1.q 1.R
1.1
1.R
1.3
1.q R
1
R
1.2
R
3
Similarity Search: The Metric Space Approach Part I, Chapter 1 24
1.
Range and k-NN search queries
2.
K-NN search queries
3.
K-NN search queries
4.
Range and k-NN search queries
5.
1-NN search queries
6.
Similarity Search: The Metric Space Approach Part I, Chapter 1 25
The relaxed branching condition is the same used for the relative error approximation to find an (1+e)-approximate-nearest neighbor
In addition it halts prematurely when the probability that we have found the (1+e)-approximate-nearest neighbor is above the threshold d
Similarity Search: The Metric Space Approach Part I, Chapter 1 26
N A act
act
Similarity Search: The Metric Space Approach Part I, Chapter 1 27
Distribution of the distance of the nearest neighbor
Given that The algorithm halts when
n q q
x F x
d X
G )) ( 1 ( 1 ) , ( : Pr ) (
) 1 /( ) , ( ) 1 /( ) , ( ) , ( : Pr 1 ) , ( / ) , ( : Pr Pr e e e e e
A q A A act
d G
d
d X
d
d X
e ) 1 /( ) , (
A q
d G
Similarity Search: The Metric Space Approach Part I, Chapter 1 28
1.
Range and k-NN search queries
2.
K-NN search queries
3.
K-NN search queries
4.
Range and k-NN search queries
5.
1-NN search queries
6.
Similarity Search: The Metric Space Approach Part I, Chapter 1 29
Objects are vectors of 45 dimensions
Range queries tested on the methods:
Relative error
Proximity
Nearest-neighbors queries tested on all methods
Similarity Search: The Metric Space Approach Part I, Chapter 1 30
Relative error 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 0.2 0.4 0.6 0.8 1 R IE r=1,800 r=2,200 r=2,600 r=3,000
Similarity Search: The Metric Space Approach Part I, Chapter 1 31
Proximity 1 2 3 4 5 6 7 0.2 0.4 0.6 0.8 1 R IE r=1,800 r=2,200 r=2,600 r=3,000
Similarity Search: The Metric Space Approach Part I, Chapter 1 32
Relative error 1 1.1 1.2 1.3 1.4 1.5 1.6 0.001 0.002 0.003 0.004 EP IE k=1 k=3 k=10 k=50
Similarity Search: The Metric Space Approach Part I, Chapter 1 33
Good fraction 100 200 300 400 500 600 700 800 0.01 0.02 0.03 EP IE k=1 k=3 k=10 k=50
Similarity Search: The Metric Space Approach Part I, Chapter 1 34
Small chance improvement 20 40 60 80 100 120 140 160 180 200 0.02 0.04 0.06 0.08 0.1 EP IE k=1 k=3 k=10 k=50
Similarity Search: The Metric Space Approach Part I, Chapter 1 35
Proximity 100 200 300 400 500 600 700 800 0.005 0.01 0.015 0.02 0.025 0.03 EP IE k=1 k=3 k=10 k=50
Similarity Search: The Metric Space Approach Part I, Chapter 1 36
PAC 50 100 150 200 250 300 350 400 450 500 0.001 0.002 0.003 0.004 0.005 EP IE eps=2 eps=3 eps=4
Similarity Search: The Metric Space Approach Part I, Chapter 1 37
Vector spaces are a special case of metric space.
Best performance obtained with the good fraction approximation methods
The proximity based is a bit worse than good fraction approximation but can be used for range queries and k-NN queries.