SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe - - PowerPoint PPT Presentation

similarity search the metric space approach
SMART_READER_LITE
LIVE PREVIEW

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe - - PowerPoint PPT Presentation

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching Survey of exiting approaches Part


slide-1
SLIDE 1

SIMILARITY SEARCH The Metric Space Approach

Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

slide-2
SLIDE 2
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 2

Table of Contents

Part I: Metric searching in a nutshell

 Foundations of metric space searching  Survey of exiting approaches

Part II: Metric searching in large collections

 Centralized index structures  Approximate similarity search  Parallel and distributed indexes

slide-3
SLIDE 3
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 3

Approximate similarity search

 Approximate similarity search overcomes problems of exact

similarity search using traditional access methods

 Moderate improvement of performance with respect to sequential

scan

 Dimensionality curse

 Similarity search returns mathematically precise result sets

 Similarity is subjective so, in some cases, also approximate result

sets satisfy the user

 Approximate similarity search processes query faster at the

price of imprecision in the returned result sets

 Useful for instance in interactive systems 

Similarity search is an iterative process where temporary results are used to create a new query

 Improvements up to two orders of magnitude

slide-4
SLIDE 4
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 4

Approximate similarity search

 Approximation strategies

 Relaxed pruning conditions 

Data regions overlapping the query regions can be discarded depending on the specific strategy

 Early termination of the search algorithm 

Search algorithm might stop before all regions have been accessed

slide-5
SLIDE 5
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 5

Approximate Similarity Search

1.

relative error approximation (pruning condition)

Range and k-NN search queries

2.

good fraction approximation

3.

small chance improvement approximation

4.

proximity-based approximation

5.

PAC nearest neighbor searching

6.

performance trials

slide-6
SLIDE 6
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 6

Relative error approximation

 Let oN be the nearest neighbour of q. If

then oA is the (1+e)-approximate nearest neighbor

  • f q

 This can be generalized to the k-th nearest neighbor

   

e  1 , , q

  • d

q

  • d

N A

   

e  1 , , q

  • d

q

  • d

N k A k

slide-7
SLIDE 7
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 7

Relative error approximation

 Exact pruning strategy:

rq rp

 

p q

r r p q d   ,

q p

slide-8
SLIDE 8
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 8

Relative error approximation

 Approximate pruning strategy:

rq rp rq/(1+e

 

1 ,  

p q

r p q d r

q p

e 

slide-9
SLIDE 9
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 9

Approximate Similarity Search

1.

relative error approximation (pruning condition)

Range and k-NN search queries

2.

good fraction approximation (stop condition)

K-NN search queries

3.

small chance improvement approximation

4.

proximity-based approximation

5.

PAC nearest neighbor searching

6.

performance trials

slide-10
SLIDE 10
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 10

Good fraction approximation

 The k-NN algorithm determines the final result by

reducing distances of current result set

 When the current result set belongs to a specific

fraction of the objects closest to the query, the approximate algorithm stops

 Example: Stop when current result set belongs to the 10%

  • f the objects closest to the query
slide-11
SLIDE 11
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 11

Good fraction approximation

 For this strategy we use the distance distribution

defined as

 The distance distribution Fq(x) specifies what is the

probability that the distance of a random object o from q is smaller than x

 It is easy to see that Fq (x) gives, in probabilistic

terms, the fraction of the database corresponding to the set of objects whose distance from q is smaller than x

 

x q d x F

q

  ) , ( Pr ) (

slide-12
SLIDE 12
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 12

Good fraction approximation

q

  • k

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

d(q,ok) Fraction of the data set whose distances from q are smaller than d(q,ok) Fq(x)

slide-13
SLIDE 13
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 13

Good fraction approximation

 When Fq(d(ok,q)) < r all objects of the current result

set belong to the fraction r of the dataset

  • k

q

slide-14
SLIDE 14
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 14

Good fraction approximation

 Fq(x) is difficult to be handled since we need to

compute it for all possible queries

 It was proven that the overall distance distribution

F(x) defined as follows can be used in practice, instead of Fq(x), since they have statistically the same behaviour.

 F(x) can be easily estimated as a discrete function

and it can be easily maintained in main memory

 

x d x F   ) , ( Pr ) (

2 1 o

slide-15
SLIDE 15
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 15

Approximate Similarity Search

1.

relative error approximation (pruning condition)

Range and k-NN search queries

2.

good fraction approximation (stop condition)

K-NN search queries

3.

small chance improvement approximation (stop c.)

K-NN search queries

4.

proximity-based approximation

5.

PAC nearest neighbor searching

6.

performance trials

slide-16
SLIDE 16
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 16

Small chance improvement approximation

 The M-Tree’s k-NN algorithm determines the final

result by improving the current result set

 Each step of the algorithm the temporary result is

improved and the distance of the k-th element decreases

 When the improvement of the temporary result set

slows down, the algorithms can stop

slide-17
SLIDE 17
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 17

Small chance improvement approximation

0,2 0,22 0,24 0,26 0,28 0,3 0,32 0,34 0,36 0,38 500 1000 1500

Iteration Distance

) , ( : ) (

A k

  • q

d x f  

slide-18
SLIDE 18
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 18

Small chance improvement approximation

Function f (x) is not known a priori.

A regression curve j (x), which approximate f (x), is computed using the least square method while the algorithm proceeds

Through the derivative of j (x) it is possible to decide when the algorithm has to stop

slide-19
SLIDE 19
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 19

Small chance improvement approximation

The regression curve has the following form where c1 and c2 are such that is minimum

We have used both j1(x)=ln(x) and j1(x)=1/x

2 1 1

) ( ) ( c x c x   j j  

 

j i

i f c i c

2 2 1 1

) ( ) ( j

slide-20
SLIDE 20
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 20

Regression curves

0,2 0,22 0,24 0,26 0,28 0,3 0,32 0,34 0,36 0,38 0,4 500 1000 1500

Iteration Distance

Distance Hyperbolic Regr. Logarithmic Regr.

slide-21
SLIDE 21
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 21

Approximate Similarity Search

1.

relative error approximation (pruning condition)

Range and k-NN search queries

2.

good fraction approximation (stop condition)

K-NN search queries

3.

small chance improvement approximation (stop c.)

K-NN search queries

4.

proximity-based approximation (pruning cond.)

Range and k-NN search queries

5.

PAC nearest neighbor searching

6.

performance trials

slide-22
SLIDE 22
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 22

Proximity-based approximation

 Regions whose probability of containing qualifying

  • bjects is below a certain threshold are pruned even

if they overlap the query region

 Proximity between regions is defined as the probability

that a randomly chosen object appears in both the regions.

 This resulted in an increase of performance of two

  • rders of magnitude both for range queries and

nearest neighbour queries

slide-23
SLIDE 23
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 23

Proximity-based approximation

1.R

1.2

1.q 1.R

1.1

1.R

1.3

1.q R

1

R

1.2

R

3

slide-24
SLIDE 24
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 24

Approximate Similarity Search

1.

relative error approximation (pruning condition)

Range and k-NN search queries

2.

good fraction approximation (stop condition)

K-NN search queries

3.

small chance improvement approximation (stop c.)

K-NN search queries

4.

proximity-based approximation (pruning cond.)

Range and k-NN search queries

5.

PAC nearest neighbor searching (pruning & stop)

1-NN search queries

6.

performance trials

slide-25
SLIDE 25
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 25

PAC nearest neighbour searching

It uses the same time a relaxed branching condition and a stop condition

The relaxed branching condition is the same used for the relative error approximation to find an (1+e)-approximate-nearest neighbor

In addition it halts prematurely when the probability that we have found the (1+e)-approximate-nearest neighbor is above the threshold d

It can only be used for 1-NN search queries

slide-26
SLIDE 26
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 26

PAC nearest neighbour searching

Let us suppose that then nearest neighbour found so far is oA

Let eact be the actual error on distance of oA

The algorithm stops if

The above probability is obtained by computing the distribution of the distance of the nearest neighbor.

    1

, ,   q

  • d

q

  • d

N A act

e

  d

e e  

act

Pr

slide-27
SLIDE 27
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 27

PAC nearest neighbour searching

 Distribution of the distance of the nearest neighbor

in X (of cardinality n) with respect to q:

 Given that  The algorithm halts when

 

n q q

x F x

  • q

d X

  • x

G )) ( 1 ( 1 ) , ( : Pr ) (       

 

     

) 1 /( ) , ( ) 1 /( ) , ( ) , ( : Pr 1 ) , ( / ) , ( : Pr Pr e e e e e              

A q A A act

  • q

d G

  • q

d

  • q

d X

  • q

d

  • q

d X

 d

e   ) 1 /( ) , (

A q

  • q

d G

slide-28
SLIDE 28
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 28

Approximate Similarity Search

1.

relative error approximation (pruning condition)

Range and k-NN search queries

2.

good fraction approximation (stop condition)

K-NN search queries

3.

small chance improvement approximation (stop c.)

K-NN search queries

4.

proximity-based approximation (pruning cond.)

Range and k-NN search queries

5.

PAC nearest neighbor searching (pruning & stop)

1-NN search queries

6.

performance trials

slide-29
SLIDE 29
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 29

Comparisons tests

Tests on a dataset of 11,000 objects

Objects are vectors of 45 dimensions

We compared the five approximation approaches

Range queries tested on the methods:

Relative error

Proximity

Nearest-neighbors queries tested on all methods

slide-30
SLIDE 30
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 30

Comparisons: range queries

Relative error 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 0.2 0.4 0.6 0.8 1 R IE r=1,800 r=2,200 r=2,600 r=3,000

slide-31
SLIDE 31
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 31

Comparisons: range queries

Proximity 1 2 3 4 5 6 7 0.2 0.4 0.6 0.8 1 R IE r=1,800 r=2,200 r=2,600 r=3,000

slide-32
SLIDE 32
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 32

Comparisons NN queries

Relative error 1 1.1 1.2 1.3 1.4 1.5 1.6 0.001 0.002 0.003 0.004 EP IE k=1 k=3 k=10 k=50

slide-33
SLIDE 33
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 33

Comparisons NN queries

Good fraction 100 200 300 400 500 600 700 800 0.01 0.02 0.03 EP IE k=1 k=3 k=10 k=50

slide-34
SLIDE 34
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 34

Comparisons NN queries

Small chance improvement 20 40 60 80 100 120 140 160 180 200 0.02 0.04 0.06 0.08 0.1 EP IE k=1 k=3 k=10 k=50

slide-35
SLIDE 35
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 35

Comparisons NN queries

Proximity 100 200 300 400 500 600 700 800 0.005 0.01 0.015 0.02 0.025 0.03 EP IE k=1 k=3 k=10 k=50

slide-36
SLIDE 36
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 36

Comparisons NN queries

PAC 50 100 150 200 250 300 350 400 450 500 0.001 0.002 0.003 0.004 0.005 EP IE eps=2 eps=3 eps=4

slide-37
SLIDE 37
  • P. Zezula, G. Amato, V. Dohnal, M. Batko:

Similarity Search: The Metric Space Approach Part I, Chapter 1 37

Conclusions: Approximate similarity search in metric spaces

These techniques for approximate similarity search can be applied to generic metric spaces

Vector spaces are a special case of metric space.

High accuracy of approximate results are generally

  • btained with high improvement of efficiency

Best performance obtained with the good fraction approximation methods

The proximity based is a bit worse than good fraction approximation but can be used for range queries and k-NN queries.