Permutation Search Methods are Efficient, Yet Faster Search is - - PowerPoint PPT Presentation

permutation search methods are efficient yet faster
SMART_READER_LITE
LIVE PREVIEW

Permutation Search Methods are Efficient, Yet Faster Search is - - PowerPoint PPT Presentation

Permutation Search Methods are Efficient, Yet Faster Search is Possible Bileg (Bilegsaikhan) Naidan 1 Leo (Leonid) Boytsov 2 Eric Nyberg 2 1 Norwegian University of Science and Technology (NTNU) 2 Carnegie Mellon University (CMU)


slide-1
SLIDE 1

Permutation Search Methods are Efficient, Yet Faster Search is Possible

Bileg (Bilegsaikhan) Naidan1 Leo (Leonid) Boytsov2 Eric Nyberg2

1Norwegian University of Science and Technology (NTNU) 2Carnegie Mellon University (CMU)

https://github.com/searchivarius/NonMetricSpaceLib

slide-2
SLIDE 2

Nearest-neighbor search (NN-search)

1/ 17 4/9/15

slide-3
SLIDE 3

Nearest-neighbor search (NN-search)

  • Input: A set of n objects and a distance function d(x, y)

1/ 17 4/9/15

slide-4
SLIDE 4

Nearest-neighbor search (NN-search)

  • Input: A set of n objects and a distance function d(x, y)
  • Query: New object q and k

1/ 17 4/9/15

slide-5
SLIDE 5

Nearest-neighbor search (NN-search)

  • Input: A set of n objects and a distance function d(x, y)
  • Query: New object q and k
  • Task: Quickly find k most similar objects in the database

to q Query q k = 3

q 1 2 3

1/ 17 4/9/15

slide-6
SLIDE 6

Distance function

Name d(x, y) Symmetry Triangle ineq. Euclidean (L2) (xi − yi)2 Cosine distance 1 − x · y |x||y| KL-diverg. xi log xi yi JS-diverg. symmetrized & smoothed KL-diverg.

Distance functions can be metric or non-metric

2/ 17 4/9/15

slide-7
SLIDE 7

How to find similar objects?

3/ 17 4/9/15

slide-8
SLIDE 8

How to find similar objects?

  • Brute-force
  • Exact search
  • Slow: n distance computations

3/ 17 4/9/15

slide-9
SLIDE 9

How to find similar objects?

  • Brute-force
  • Exact search
  • Slow: n distance computations
  • Indexing
  • Exact search is mostly slow in high-dimensions and/or

non-metric spaces: O(n) distance computations

  • Approximate search can be fast

3/ 17 4/9/15

slide-10
SLIDE 10

State-of-the-art approximate search methods

  • Locality Sensitivity Hashing (LSH)
  • VP-tree/ball-tree (data-dependent tuning)
  • Proximity graphs (kNN-graphs)
  • Permutation methods

4/ 17 4/9/15

slide-11
SLIDE 11

Why should we care about permutation methods?

5/ 17 4/9/15

slide-12
SLIDE 12

Why should we care about permutation methods?

  • Promising universal methods for non-metric spaces

5/ 17 4/9/15

slide-13
SLIDE 13

Why should we care about permutation methods?

  • Promising universal methods for non-metric spaces
  • Mapping data from “hard” spaces to “easy” spaces (the

Euclidean space)

5/ 17 4/9/15

slide-14
SLIDE 14

Why should we care about permutation methods?

  • Promising universal methods for non-metric spaces
  • Mapping data from “hard” spaces to “easy” spaces (the

Euclidean space)

  • Database-friendly methods that are easy to implement
  • n top of a database system or Lucene

5/ 17 4/9/15

slide-15
SLIDE 15

Research questions

6/ 17 4/9/15

slide-16
SLIDE 16

Research questions

  • How good are permutation-based projections?

6/ 17 4/9/15

slide-17
SLIDE 17

Research questions

  • How good are permutation-based projections?
  • How well do permutation methods fare against state of

the art?

6/ 17 4/9/15

slide-18
SLIDE 18

Permutation Methods

  • Filter-and-refine methods using pivot-based

projection to the permutation space (L1 or L2)

7/ 17 4/9/15

slide-19
SLIDE 19

Permutation Methods

  • Filter-and-refine methods using pivot-based

projection to the permutation space (L1 or L2)

  • Select randomly a set of reference points called pivots

7/ 17 4/9/15

slide-20
SLIDE 20

Permutation Methods

  • Filter-and-refine methods using pivot-based

projection to the permutation space (L1 or L2)

  • Select randomly a set of reference points called pivots
  • Order pivots by their distances to data points to obtain

pivot rankings, which we call permutations

7/ 17 4/9/15

slide-21
SLIDE 21

Permutation Methods

  • Filter-and-refine methods using pivot-based

projection to the permutation space (L1 or L2)

  • Select randomly a set of reference points called pivots
  • Order pivots by their distances to data points to obtain

pivot rankings, which we call permutations

  • Filter by comparing permutations to obtain candidate

points

7/ 17 4/9/15

slide-22
SLIDE 22

Permutation Methods

  • Filter-and-refine methods using pivot-based

projection to the permutation space (L1 or L2)

  • Select randomly a set of reference points called pivots
  • Order pivots by their distances to data points to obtain

pivot rankings, which we call permutations

  • Filter by comparing permutations to obtain candidate

points

  • Refine by comparing candidate points to the query

7/ 17 4/9/15

slide-23
SLIDE 23

Permutation Methods

How do we carry out the filtering step?

8/ 17 4/9/15

slide-24
SLIDE 24

Permutation Methods

How do we carry out the filtering step?

  • Brute force searching

8/ 17 4/9/15

slide-25
SLIDE 25

Permutation Methods

How do we carry out the filtering step?

  • Brute force searching
  • Indexing of permutations

8/ 17 4/9/15

slide-26
SLIDE 26

Permutation Methods

How do we carry out the filtering step?

  • Brute force searching
  • Indexing of permutations
  • Neighborhood APProximation Index (NAPP) is the best

approach

8/ 17 4/9/15

slide-27
SLIDE 27

Experiments: Datasets

Name Distance Number Brute-force Dimens. function

  • f points

(sec.) Metric Data CoPhIR L2 5 · 106 0.6 282 SIFT L2 5 · 106 0.3 128 ImageNet SQFD 1 · 106 4.1 N/A Non-Metric Data Wiki-sparse Cosine sim. 4 · 106 1.9 105 Wiki-8 KL-div/JS-div 2 · 106 0.045/0.28 8 Wiki-128 KL-div/JS-div 2 · 106 0.22/4 128 DNA

  • Norm. Leven.

1 · 106 3.5 N/A

9/ 17 4/9/15

slide-28
SLIDE 28

Experiments: Projection Quality

Distance in the original space vs. distance in the projected space. The closer to a monotonic mapping, the better:

100 200 300 200 400 600

Good projection (original distance: L2)

10/ 17 4/9/15

slide-29
SLIDE 29

Experiments: Projection Quality

Distance in the original space vs. distance in the projected space. The closer to a monotonic mapping, the better:

50 100 150 200 250 0.0 0.2 0.4 0.6

Bad projection (original distance: JS-div.)

11/ 17 4/9/15

slide-30
SLIDE 30

Experiments: Efficiency vs Accuracy

Improvement in efficiency over brute-force search vs. accuracy. Higher and to the right is better:

SIFT (L2)

0.6 0.7 0.8 0.9 1 101 102 Recall

  • Improv. in efficiency (log. scale)

VP-tree MPLSH kNN-graph (SW) NAPP

12/ 17 4/9/15

slide-31
SLIDE 31

Experiments: Efficiency vs Accuracy

Improvement in efficiency over brute-force search vs. accuracy. Higher and to the right is better:

  • Norm. Levenshtein

0.6 0.7 0.8 0.9 1 101 102 Recall

  • Improv. in efficiency (log. scale)

VP-tree kNN-graph (NN-desc) brute-force filt. bin. NAPP

13/ 17 4/9/15

slide-32
SLIDE 32

Conclusions

  • Permutation methods beat state-of-the-art methods

(VP-trees, kNN-graphs and Multiprobe LSH) for some data sets, in particular, when the distance function is expensive

14/ 17 4/9/15

slide-33
SLIDE 33

Conclusions

  • Permutation methods beat state-of-the-art methods

(VP-trees, kNN-graphs and Multiprobe LSH) for some data sets, in particular, when the distance function is expensive

  • The quality of permutation-based projection can be

both good and poor: it appears to be better when the space is metric and/or dimensionality is low

14/ 17 4/9/15

slide-34
SLIDE 34

Poster Session Discussion Points

What makes a good, amenable, non-metric space?

15/ 17 4/9/15

slide-35
SLIDE 35

Thank you for your attention!

16/ 17 4/9/15

slide-36
SLIDE 36

Some technical details

slide-37
SLIDE 37

Permutation Methods

The data points are a, b, c, d in 2-dim. Euclidean space (L2). The Voronoi diagram produced by 4 pivots πi.

1 2 4 3 c b a d

Point Pivot Order Permutations a (π1, π2, π3, π4) (1, 2, 3, 4) b (π1, π2, π4, π3) (1, 2, 4, 3) c (π3, π1, π2, π4) (2, 3, 1, 4) d (π4, π2, π1, π3) (3, 2, 4, 1) Position of π4 is 1 Similar

slide-38
SLIDE 38

Permutation Methods

The data points are a, b, c, d in 2-dim. Euclidean space (L2). The Voronoi diagram produced by 4 pivots πi.

1 2 4 3 c b a d

Point Pivot Order Permutations a (π1, π2, π3, π4) (1, 2, 3, 4) b (π1, π2, π4, π3) (1, 2, 4, 3) c (π3, π1, π2, π4) (2, 3, 1, 4) d (π4, π2, π1, π3) (3, 2, 4, 1) Position of π4 is 1 Similar Permutation is a fancy word for a pivot ranking!

slide-39
SLIDE 39

Permutation Methods

  • Filtering step - compare permutations instead of
  • riginal data points to obtain γ candidate points
  • Footrule distance(x, y) =

i |xi − yi| (same as L1)

  • Spearman’s rho distance (same as L2)

1 2 4 3 c b a d

Point Footrule(a, •) b |1 − 1| + |2 − 2| + |3 − 4| + |4 − 3| = 2 c |1 − 2| + |2 − 3| + |3 − 1| + |4 − 4| = 4 d |1 − 3| + |2 − 2| + |3 − 4| + |4 − 1| = 6 candidate points

  • Refinement step - apply d(q, •) for the candidate points

(in our example, γ = 2, q = a, d(q, b) and d(q, c))

slide-40
SLIDE 40

Permutation Methods

Filtering step:

  • Naive approach - Brute force searching
  • using a priority queue
  • incremental sorting [Gonzales 2008] (×2 faster than the

priority queue approach)

  • binarized permutations (select a threshold b and use the

Hamming distance)

  • Brute force in the permutation space is efficient if the

distance is expensive.

slide-41
SLIDE 41

Permutation Methods

To reduce the cost of the filtering stage, three types of indices were proposed:

  • use the existing methods for metric spaces [Figueroa

2009]

  • the Permutation Prefix Index (PP-Index) [Esuli 2009]
  • the Metric Inverted File (MI-file) [Amato et al. 2008]
slide-42
SLIDE 42

Permutation Methods

Permutation Prefix Index (PP-index) [Esuli 2009] Point Pivot Order a (π1, π2, π3, π4) b (π1, π2, π4, π3) c (π3, π1, π2, π4) d (π4, π2, π1, π3) 1 2 3 4 3 1 2 4 2 1 a b c d

slide-43
SLIDE 43

Permutation Methods

Metric Inverted File (MI-file) [Amato et al. 2008] Point Pivot Order a (π1, π2, π3, π4) b (π1, π2, π4, π3) c (π3, π1, π2, π4) d (π4, π2, π1, π3) Posting Lists 1 → (a, 1), (b, 1), (c, 2) 2 → (a, 2), (b, 2), (d, 2) 3 → (c, 1) 4 → (d, 1)

slide-44
SLIDE 44

Permutation Methods

Neighborhood Approximation Index (NAPP) [Tellez et al.

2013]

  • Simplified version of MI-file
  • Main differences:
  • Posting lists contain only object identifiers (no positions
  • f pivots in permutations)
  • Not possible to compute the Footrule distance
  • The number of most closest common pivots is used to

sort candidate objects

slide-45
SLIDE 45

Indexing of permutations

Neighborhood APProximation index (NAPP) appears to be the best indexing approach:

slide-46
SLIDE 46

Indexing of permutations

Neighborhood APProximation index (NAPP) appears to be the best indexing approach:

  • Neighboring points should share some closest pivots
slide-47
SLIDE 47

Indexing of permutations

Neighborhood APProximation index (NAPP) appears to be the best indexing approach:

  • Neighboring points should share some closest pivots
  • Index k closest pivots using an inverted file
slide-48
SLIDE 48

Indexing of permutations

Neighborhood APProximation index (NAPP) appears to be the best indexing approach:

  • Neighboring points should share some closest pivots
  • Index k closest pivots using an inverted file
  • Retrieve candidate points that share m ≤ k closest

pivots with the query

slide-49
SLIDE 49

Experimental settings

[noframenumbering]

  • Our program is written in C++ and compiled in GCC 4.8

with the option ✲❖❢❛st

  • Linux Intel Xeon server (3.60 GHz, 32GB memory) in a

single threaded mode using the Non-Metric Space Library

  • Quality measure - ❘❡❝❛❧❧
  • Performance measure -

■♠♣r♦✈❡♠❡♥t ✐♥ ❊✣❝✐❡♥❝② =

t✐♠❡ ♥❡❡❞❡❞ ❢♦r ❜r✉t❡ ❢♦r❝❡ s❡❛r❝❤ t✐♠❡ ♥❡❡❞❡❞ ❢♦r ❛♣♣r♦①✐♠❛t❡ s❡❛r❝❤

slide-50
SLIDE 50

Experiments: Indexing time

Indexing time in minutes:

VP-tree NAPP MPLSH Brute-force filt. kNN graph SIFT 0.4 5 18.4 52.2 ImageNet 4.4 33 32.3 127.6 Wiki-sparse 7.9 231.2 Wiki-128 1.2 36.6 36.1 DNA 0.9 15.9 15.6 88

slide-51
SLIDE 51

Experiments: Efficiency vs Accuracy

Improvement in efficiency over brute-force search vs.

  • accuracy. Higher and to the right is better:

SIFT (L2) ImageNet (SQFD)

0.6 0.7 0.8 0.9 1 101 102 Recall

  • Improv. in efficiency (log. scale)

VP-tree MPLSH kNN-graph (SW) NAPP

0.75 0.8 0.85 0.9 0.95 1 101 102 Recall

  • Improv. in efficiency (log. scale)

VP-tree brute-force filt. kNN-graph (SW) NAPP

  • NAPP beats MPLSH & VP-tree for SIFT, as well as VP-tree for Wiki-128
  • kNN graph is the best for SIFT, Wiki-128, and ImageNet
slide-52
SLIDE 52

Experiments: Efficiency vs Accuracy

Improvement in efficiency over brute-force search vs.

  • accuracy. Higher and to the right is better:

Wiki-sparse (cosine dist.)

  • Norm. Levenshtein

0.7 0.8 0.9 100 101 102 Recall

  • Improv. in efficiency (log. scale)

kNN-graph (SW) NAPP

0.6 0.7 0.8 0.9 1 101 102 Recall

  • Improv. in efficiency (log. scale)

VP-tree kNN-graph (NN-desc) brute-force filt. bin. NAPP

  • kNN graph is the best for Wiki-sparse
  • Brute force filtering beats all methods including kNN graphs for Norm.

Levenshtein

slide-53
SLIDE 53

Some Applications

NN-search is a core primitive in machine learning, vision and language processing.

slide-54
SLIDE 54

Some Applications

NN-search is a core primitive in machine learning, vision and language processing.

  • Query by image content
  • Classification
  • Entity detection
  • Spell-checking