PP-Index: Using Permutation Prefixes for Efficient and Scalable - - PowerPoint PPT Presentation

pp index using permutation prefixes for efficient and
SMART_READER_LITE
LIVE PREVIEW

PP-Index: Using Permutation Prefixes for Efficient and Scalable - - PowerPoint PPT Presentation

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli andrea.esuli@isti.cnr.it Istituto di Scienza e Tecnologie dellInformazione A. Faedo Consiglio Nazionale delle Ricerche Via


slide-1
SLIDE 1

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search

Andrea Esuli

andrea.esuli@isti.cnr.it Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, 1 — 56124 Pisa, Italy

ISTI:Science seminar, May 12, 2009

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 1 / 48

slide-2
SLIDE 2

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 2 / 48

slide-3
SLIDE 3

Introduction

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 3 / 48

slide-4
SLIDE 4

Introduction Similarity search

Outline

1

Introduction Similarity search Permutation based methods Local similarity hashing methods

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 4 / 48

slide-5
SLIDE 5

Introduction Similarity search

Similarity search

The similarity search model involves: A collection of objects D, belonging to a domain O; a query object q ∈ O; a distance function d : O × O → R+. The goal is to sort the objects in D by their distance with respect to q, returning the objects that are closer to q, which are considered to be the most similar. Typically only the k-top ranked objects are returned (k-NN query), or those within a maximum distance value r (range query). The determination of a meaningful r value is often a non-easy task. k-NN queries are usually preferred, specially in end-user applications, also for the direct control on the result set size.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 5 / 48

slide-6
SLIDE 6

Introduction Similarity search

Similarity search

Example (R2, L2):

10

  • q
  • 3
  • 6
  • 5
  • r

8

  • 12
  • 7
  • 11
  • 9
  • 4
  • 1
  • 2
  • Figure 1: Range query.

10

  • q
  • 3
  • 6
  • 5
  • 8
  • 12
  • 7
  • 11
  • 9
  • 4
  • 1
  • 2
  • Figure 2: k-NN query (k = 5).

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 6 / 48

slide-7
SLIDE 7

Introduction Similarity search

Approximate similarity search

Exhaustive search: for all oi ∈ D compute the distance d(q, oi), while keeping track of which objects satisfy the query.

It does not scale to large collections.

Exact methods: equivalent to exhaustive search, but using data structures that leverage on the properties of the observed similarity space (e.g., vectorial spaces, metric spaces) in order to reduce the number of objects of D to be compared with the query.

Usually efficient but still not enough for huge collections.

Approximate methods: accepting that the results could contain errors (e.g., d(q, o1) < d(q, o2), o2 is in the results and o1 is not), gaining efficiency.

Approximation is acceptable, e.g., when d is an approximation of a complex, human-perceived concept of similarity. It (obviously) scales! Typically derived from “relaxed” exact methods. Natively approximated proposals, e.g.: local similarity hashing (LSH) index and permutation-based index (the PP-Index takes inspiration from both).

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 7 / 48

slide-8
SLIDE 8

Introduction Similarity search

Approximate similarity search

Approximation quality: What have we missed? What have we included? How much have we saved?

10

  • q
  • 3
  • 6
  • 5
  • 8
  • 12
  • 7
  • 11
  • 9
  • 4
  • 1
  • 2
  • Figure 3: Approximate result for a k-NN query (k = 5).

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 8 / 48

slide-9
SLIDE 9

Introduction Permutation based methods

Outline

1

Introduction Similarity search Permutation based methods Local similarity hashing methods

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 9 / 48

slide-10
SLIDE 10

Introduction Permutation based methods

Permutation based methods

Independently proposed by Amato and Savino1 and Chavez et al.2, using different data structures. The idea: an object is represented by its view of the surrounding world. Intuively, if two objects “see” the elements of a set of reference objects R in the same order of (increasing) distance, they are likely to be close one to the other. Example Where am I likely to live if I see the main European cities in the following order? Rome, Milan, Bern, Marseilles, Munich, Luxembourg, Bonn, Vienna, Belgrade, Brussels, Barcelona, Paris, Berlin, Amsterdam, London, Copenhagen, Madrid, Istanbul, Dublin, Athens, Oslo, Stockholm, Lisbon, Helsinki.

  • 1G. Amato and P. Savino, Approximate similarity search in metric spaces using inverted files, INFOSCALE

2008, pages 1-10.

  • 2E. Chavez, K. Figueroa, and G. Navarro, Effective proximity retrieval by ordering permutations, IEEE

Transactions on Pattern Analysis and Machine Intelligence, 30(9):1647-1658, 2008.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 10 / 48

slide-11
SLIDE 11

Introduction Permutation based methods

Permutation based methods

The method: A set of reference objects R = {r0, . . . , r|R|−1} ⊂ O is defined (e.g., by randomly selecting |R| objects from D). Every object oi ∈ D is then represented by a permutation Πoi of 0, . . . , |R| − 1, i.e., the list of the identifiers of reference objects, so that the identifiers are sorted by the distance of their relative reference objects with respect to oi. The search process mainly consists in computing Πq and estimating the true distance d(q, oi) using a permutation-based distance d′(Πq, Πoi), e.g., the Spearman’s footrule distance. Amato and Savino have shown that using only the prefix Πl

  • i of the permutation

Πoi(e.g., l = 100 when |R| = 500) improves both efficiency and effectiveness. The PP-Index adopts a permutation-based data representation model, using very short prefixes (e.g., l = 6 when |R| = 1000). Differently from previous approaches, the permutation prefixes are used just to quickly find a small set of candidate objects from D for inclusion into results, not to estimate their relative order.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 11 / 48

slide-12
SLIDE 12

Introduction Permutation based methods

Permutation based methods

3 2 1 4 5

<5,2,3,4,1,0> <1,4,3,2,0,5> <1,3,2,0,4,5> <4,1,3,2,5,0> <4,3,1,2,5,0> <5,2,3,0,1,4> <0,2,3,1,5,4>

r r r r r r

3 2 4 5

<1,4,3> <1,3,2> <4,1,3> <4,3,1> <5,2,3> <0,2,3> <3,2,1> <2,5,3> <2,3,5> <2,3,0> <2,0,3> <0,1,3> <1,3,0> <1,3,4> <3,2,5> <4,3,5>

r r

1

r r r r

Figure 4: Regions of the 2-dimensional space identified by 6 randomly selected reference points, using the Euclidean distance, and full-lenght permutations (left) or permutation prefixes of lenght 3.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 12 / 48

slide-13
SLIDE 13

Introduction Local similarity hashing methods

Outline

1

Introduction Similarity search Permutation based methods Local similarity hashing methods

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 13 / 48

slide-14
SLIDE 14

Introduction Local similarity hashing methods

Local similarity hashing methods

A family H of hash functions f : O → U is called (r, ǫ, p1, p2)-sensitive, with r, ǫ > 0, p1 > p2 > 0, if for any p, q ∈ O: if d(p, q) ≤ r then P[h(p) = h(q)] ≥ p1 if d(p, q) > r(1 + ǫ) then P[h(p) = h(q)] ≤ p2 for any function h randomly selected from H. Intuitively: two objects have a (high) probability x1 ≥ p1 to collide if they are closer than r, and a (low) probability x2 ≤ p2 if they are more distant than r(1 + ǫ). LSH-Index3: j randomly chosen functions hi ∈ H define a hash function g(x) = (h1(x)h2(x) . . . hj(x)), i.e. bad collision probability is significantly lowered to pj

2.

t different hash tables are built, based on randomly generated g1 . . . gt functions, in order to increase good collision probability.

  • 3P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality,

STOC 1998, pages 604-613.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 14 / 48

slide-15
SLIDE 15

Introduction Local similarity hashing methods

Local similarity hashing methods

It is hard to tune LSH-Index (length of hash keys) in order to obtain good efficacy, due to the dependence between data distribution and hash length. LSH-Forest4: Use of variable length hash keys. Long hash key are indexed in a prefix tree (LSH-Tree). At search time the key length is varied in order to retrieve a given number of candidate objects. Candidate objects are retrieved sequentially from a data storage on disk. Multiple LSH-Tree, i.e., a forest, are used to improve effectiveness. The PP-Index uses similar data structure.

  • 4M. Bawa, T. Condie, and P. Ganesan, LSH-Forest: self-tuning indexes for similarity search, WWW 2005,

pages 651-660.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 15 / 48

slide-16
SLIDE 16

The PP-Index

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 16 / 48

slide-17
SLIDE 17

The PP-Index Data structures

Outline

1

Introduction

2

The PP-Index Data structures Algorithms

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 17 / 48

slide-18
SLIDE 18

The PP-Index Data structures

PP-Index: data structures

The PP-Index represents each indexed object with a permutation prefix of length l. Data structures: a prefix tree kept in main memory, indexing the permutation prefixes, and a data storage kept on disk, storing the information required to compute real distances between objects in D and any object in O. The prefix tree is used in order to rapidly identify a set of at least z candidates (z ≥ k), leaving to the original distance function the task of determining the final k-NN result from such set of candidates. Candidates are retrieved from the data storage with a few sequential disk accesses. The PP-Index adopts a bulk data processing model, similar to the one used for text-based inverted list indexes (assumption on the static nature of data). It is easy to provide update capabilities (i.e., insert, delete, modify).

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 18 / 48

slide-19
SLIDE 19

The PP-Index Algorithms

Outline

1

Introduction

2

The PP-Index Data structures Algorithms

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 19 / 48

slide-20
SLIDE 20

The PP-Index Algorithms

PP-Index: building the index

BuildIndex(D, d, R, l) 1 prefixTree ← EmptyPrefixTree() 2 dataStorage ← EmptyDataStorage() 3 for i ← 0 to |D − 1| 4 do oi ← GetObject(D, i) 5 dataBlockoi ← GetDataBlock(oi) 6 poi ← Append(dataBlockoi, dataStorage) 7 woi ← ComputePrefix(oi, R, d, l) 8 hoi ← i 9 Insert(woi, hoi, poi, prefixTree) 10 L ← ListPointersByOrderedVisit(prefixTree) 11 P ← CreateInvertedList(L) 12 ReorderStorage(dataStorage, P) 13 CorrectLeafValues(prefixTree, dataStorage) 14 index ← NewIndex(d, R, l, prefixTree, dataStorage) 15 return index

Figure 5: The BuildIndex function.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 20 / 48

slide-21
SLIDE 21

The PP-Index Algorithms

PP-Index: building the index

Input: dataset D, distance function d, reference objects R, prefix length l. Indexing process: Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Data storage reordering: data blocks are sorted to reflect the order of prefixes.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 21 / 48

slide-22
SLIDE 22

The PP-Index Algorithms

PP-Index: building the index

  • w =<1, 3, 2>

2

  • w =<5, 2, 3>

4

  • w =<1, 3, 2>

6

  • w =<1, 3, 4>

8

  • w =<1, 3, 2>

1

  • w =<2, 3, 0>

3

  • w =<4, 1, 3>

5

  • w =<4, 1, 3>

7

  • w =<5, 2, 3>

9

  • w =<4, 3, 5>

Permutation prefixes |D|=10, |R|=6, l=3 Index characteristics

|

Figure 6: Sample data.

1 2 4 5 3 3 3 1 2 3 3 2 4 5

3

p

5

p

7

p

2

p p0 p

4 p8 6

p

1

p p9

  • ...

5

  • ...

8

  • ...

7

  • ...

3

  • ...

6

  • ...

9

  • ...

2

  • ...

4

  • ...

1

  • ...

Data storage Prefix tree

root

3 5 7 2

h0

4 8 6 1 9

h h h h h h h h h

main memory secondary memory

Figure 7: Index data structure after the first phase of object insertion.

Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 22 / 48

slide-23
SLIDE 23

The PP-Index Algorithms

PP-Index: building the index

1 2 4 5 3 3 3 1 2 3 3 2 4 5

3

p

5

p

7

p

2

p p0 p

4 p8 6

p

1

p p9

  • ...

5

  • ...

8

  • ...

7

  • ...

3

  • ...

6

  • ...

9

  • ...

2

  • ...

4

  • ...

1

  • ...

Data storage Prefix tree

root

3 5 7 2

h0

4 8 6 1 9

h h h h h h h h h

main memory secondary memory

Figure 8: Index data structure after the first phase of

  • bject insertion.

1 2 4 5 3 3 3 1 2 3 3 2 4 5 Prefix tree

root

  • ...

3

  • ...

2

  • ...

9

  • ...

6

  • ...

5

  • ...

7

  • ...

8

  • ...

1

  • ...

4

  • ...

Data storage

main memory secondary memory

end

p

start

p

end start

h h

532 532 532 532

p h

435 435 end

p

start

p

end start

h h

413 413 413 413

p h

230 230

p h

134 134 end

p

start

p

end start

h h

132 132 132 132

Figure 9: Index data structure after the first phase of

  • bject insertion.

Data storage reordering: data blocks are sorted to reflect the order of prefixes. The leaves of the final prefix tree point to intervals of the data storage. Efficiency alert: performed using a m-way merge sort algorithm.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 23 / 48

slide-24
SLIDE 24

The PP-Index Algorithms

PP-Index: search function

FindCandidates(q, prefixTree, R, d, l, z) 1 wq ← ComputePrefix(q, R, d, l) 2 for i ← l to 1 3 do wi

q ← SubPrefix(wq, i)

4 node ← SearchPath(wi

q, prefixTree)

5 if node = nil 6 then minLeaf ← GetMin(node, prefixTree) 7 maxLeaf ← GetMax(node, prefixTree) 8 if (maxLeaf.hend − minLeaf.hstart + 1) ≥ z ∨ i = 1 9 then return (minLeaf.pstart, maxLeaf.hend) 10 return (0, 0)

Figure 10: The FindCandidates function.

Given the prefix representing the query, FindCandidates searches for the smallest subtree of the prefix tree pointing to at least z data blocks (z′).

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 24 / 48

slide-25
SLIDE 25

The PP-Index Algorithms

PP-Index: search function

Search(q, k, z, index) 1 (pstart, pend) ← FindCandidates(q, index.prefixTree, index.R, index.d, index.l, z) 2 resultsHeap ← EmptyHeap() 3 cursor ← pstart 4 while cursor ≤ pend 5 do dataBlock ← Read(cursor, index.dataStorage) 6 AdvanceCursor(cursor) 7 distance ← index.d(q, dataBlock.data) 8 if resultsHeap.size < k 9 then Insert(resultsHeap, distance, dataBlock.id) 10 else if distance < resultsHeap.top.distance 11 then ReplaceTop(resultsHeap, distance, dataBlock.id) 12 Sort(resultsHeap) 13 return resultsHeap

Figure 11: The Search function.

The z′ candidate data blocks are sequentially read from the data storage. A heap is used to keep track of the best k results.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 25 / 48

slide-26
SLIDE 26

The PP-Index Algorithms

PP-Index: improving the search effectiveness

The basic search strategy is designed for efficiency. Effectiveness can be boosted using various search strategies: Multiple index: building n PP-Index using different R sets, R1 . . . Rn (LSH-Forest style).

Projecting different R-induced “grids” on the objects, helps to approximate a better (less skewed) partitioning of the space. Can be implemented using data replication (faster/more storage) or using data referencing (slower/less storage). k-NN results from the various indexes are merged together in the final one.

Multiple query: generating m perturbed versions of wq in order to explore the neighborhood of wq.

The perturbed wi

q prefixes are generated by swapping pairs of elements of wq,

first selecting those with the smaller distance difference with respect to q. All the wi

q prefixes are used to find candidates on the same index.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 26 / 48

slide-27
SLIDE 27

The PP-Index Algorithms

PP-Index: prefix tree optimizations

1 2 4 5 3 3 1 2 4 Prefix tree

root

  • ...

3

  • ...

2

  • ...

9

  • ...

6

  • ...

5

  • ...

7

  • ...

8

  • ...

1

  • ...

4

  • ...

Data storage

main memory secondary memory

end

p

start

p

end start

h h

532 532 532 532

p h

435 435 end

p

start

p

end start

h h

413 413 413 413

p h

230 230

p h

134 134 end

p

start

p

end start

h h

132 132 132 132

Figure 12: Pruning of only-child paths to leaves.

1,3 2 4 5 3 1 2 4 Prefix tree

root

  • ...

3

  • ...

2

  • ...

9

  • ...

6

  • ...

5

  • ...

7

  • ...

8

  • ...

1

  • ...

4

  • ...

Data storage

main memory secondary memory

end

p

start

p

end start

h h

532 532 532 532

p h

435 435 end

p

start

p

end start

h h

413 413 413 413

p h

230 230

p h

134 134 end

p

start

p

end start

h h

132 132 132 132

Figure 13: Only-child paths compression.

Scalability alert: reducing to a single leaf any subtree pointing to less than z data blocks. Applicable when z is hardcoded into the search function. Does not affect search results quality. Lossy with respect to index update operations.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 27 / 48

slide-28
SLIDE 28

The PP-Index Algorithms

PP-Index: merging (and updating) the index

Scalability alert: the index (prefix tree) reaches its maximum memory requirement at the end of the main loop of the indexing process. Could not fit into memory, when the final index will (after optimizations). Strategy: building many smaller indexes, using the same R set, then merging them together. The merge process is efficient: The source prefix trees are merged into the final prefix tree by performing a parallel ordered visit on them. Data storages are merged into the final data storage while building the final prefix tree. Can be done from-disk-to-disk, minimum memory occupation. Linear cost with respect to index size (if not done in an m-way style). Uses only sequential reads/writes. Update operations can be supported by keeping track of such operations by using a small all-in-memory index and performing periodic merge operations.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 28 / 48

slide-29
SLIDE 29

Experiments

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 29 / 48

slide-30
SLIDE 30

Experiments The CoPhIR collection

Outline

1

Introduction

2

The PP-Index

3

Experiments The CoPhIR collection Evaluation measures Results

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 30 / 48

slide-31
SLIDE 31

Experiments The CoPhIR collection

The CoPhIR collection

The CoPhIR5 consists of a crawl of 106 millions images from the Flickr photo sharing website. Textual data + five MPEG-7 visual descriptors (240 GB of XML description data). Visual similarity measure: linear combination of distance functions defined on the MPEG-7 descriptors.

MPEG-7 Visual Descriptor Distance type Dimension Weight Scalable Color L1 64 2 Color Structure L1 64 3 Color Layout sum of L2 80 2 Edge Histogram L1 62 4 Homogeneous Texture L1 12 0.5

Table 1: Details on the five MPEG-7 visual descriptors used in CoPhIR, and the weights used in the linear

  • combination. The “Dim.” column refer to the specific dimension for visual descriptors adopted by the CoPhIR

data set.

Experiments made on 1, 10, and 100 millions images, using 100 randomly selected images (excluded from indexes).

5http://cophir.isti.cnr.it/ http://www.saphir.eu/ http://www.flickr.com/ Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 31 / 48

slide-32
SLIDE 32

Experiments Evaluation measures

Outline

1

Introduction

2

The PP-Index

3

Experiments The CoPhIR collection Evaluation measures Results

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 32 / 48

slide-33
SLIDE 33

Experiments Evaluation measures

Evaluation measures

Effectiveness measures: Recall (ranking-based6): Recall(k) = |Dk

q ∩ P k q |

k (1) Relative Distance Error (distance-based): RDE(k) = 1 k

k

  • i=1

d(q, P k

q (i))

d(q, Dk

q (i)) − 1

(2) where Dk

q is the list of the k closest elements of D to q, sorted by their distance

with respect to q, and P k

q is the list returned by the algorithm.

Efficiency measures: index time. index size (RAM, disk). number of candidates retrieved from disk (z′). average search time.

  • 6M. Patella and P. Ciaccia, The many facets of approximate similarity search, SISAP 2008, pages 10-21.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 33 / 48

slide-34
SLIDE 34

Experiments Results

Outline

1

Introduction

2

The PP-Index

3

Experiments The CoPhIR collection Evaluation measures Results

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 34 / 48

slide-35
SLIDE 35

Experiments Results

Results

|R|

100 200 500 1000

Indexing time (s)

100 1000 10000 100000 1000000 100M 10M 1M

Figure 14: Indexing time w.r.t. to the size of R and the data set size.

|D| indexing prefix tree size data l′ time (sec) full comp. storage 1M 419 7.7 MB 91 kB 349 MB 2.1 10M 4385 53.8 MB 848 kB 3.4 GB 2.7 100M 45664 354.5 MB 6.5 MB 34 GB 3.5

Table 2: Indexing times (with |R| = 100), resulting index sizes, and average prefix tree depth l′ (after prefix tree compression with z = 1, 000), for the various data set sizes.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 35 / 48

slide-36
SLIDE 36

Experiments Results

Results

|R|

100 200 500 1000

Search time (s)

0.000 0.050 0.100 0.150 0.200 0.250 0.300 100M 10M 1M

Figure 15: Search time w.r.t. to the size of R and the data set size. Search performed with z = 1, 000 and k = 100 (single index, single query).

|R| |D| 1M 10M 100M 100 4,075 5,817 7,941 200 3,320 5,571 7,302 500 1,803 5,065 6,853 1,000 1,091 4,748 6,644

Table 3: Average z′ value (z = 1, 000), i.e., average number of retrieved candidate objects for a query, with respect to the size of the reference objects set and data set size.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 36 / 48

slide-37
SLIDE 37

Experiments Results

Results

|R|

100 200 500 1000

Recall(k)

0.0 0.2 0.4 0.6 0.8 1.0 100M 10M 1M

|R|

100 200 500 1000

RDE(k)

0.00 0.05 0.10 0.15 0.20 100M 10M 1M

Figure 16: Effectiveness with respect of the size of R set, on various index sizes, using k = 100, and z = 1, 000 (single index, single query).

|R|

100 200 500 1000

Recall(k)

0.0 0.2 0.4 0.6 0.8 1.0 k=100 k=10 k=1

|R|

100 200 500 1000

RDE(k)

0.00 0.05 0.10 0.15 0.20 k=100 k=10 k=1

Figure 17: Effectiveness with respect of the size of R set, on the 100M index, using z = 1, 000 (single index, single query).

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 37 / 48

slide-38
SLIDE 38

Experiments Results

Results

|indexes|

1 2 4 8

Recall(k)

0.0 0.2 0.4 0.6 0.8 1.0 k=100 k=10 k=1

|indexes|

1 2 4 8

RDE(k)

0.00 0.02 0.04 0.06 0.08 0.10 k=100 k=10 k=1

Figure 18: Effectiveness of the multiple index search strategy on the 100M index, using |R| = 1, 000 and z = 1, 000.

|queries|

1 2 4 8

Recall(k)

0.0 0.2 0.4 0.6 0.8 1.0 k=100 k=10 k=1

|queries|

1 2 4 8

RDE(k)

0.00 0.02 0.04 0.06 0.08 0.10 k=100 k=10 k=1

Figure 19: Effectiveness of the multiple query search strategy on the 100M index, using |R| = 1, 000 and z = 1, 000.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 38 / 48

slide-39
SLIDE 39

Experiments Results

Results

k

1 2 5 10 20 50 100

Recall(k)

0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 100M 10M 1M

k

1 2 5 10 20 50 100

RDE(k)

2e-3 4e-3 6e-3 8e-3 0.01 100M 10M 1M

Figure 20: Effectiveness of the combined multiple query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using |R| = 100, and z = 1, 000.

k

1 2 5 10 20 50 100

Recall(k)

0.95 0.96 0.97 0.98 0.99 1.00 100M 10M 1M

k

1 2 5 10 20 50 100

RDE(k)

2e-4 4e-4 6e-4 8e-4 1e-3 100M 10M 1M

Figure 21: Effectiveness of the combined multiple query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using |R| = 1, 000, and z = 1, 000.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 39 / 48

slide-40
SLIDE 40

Demo

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 40 / 48

slide-41
SLIDE 41

Demo

Demo

http://mipai.esuli.it

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 41 / 48

slide-42
SLIDE 42

Conclusions

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 42 / 48

slide-43
SLIDE 43

Conclusions Summary

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions Summary Questions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 43 / 48

slide-44
SLIDE 44

Conclusions Summary

Summary

The PP-Index: is a simple but effective data structure for approximate similarity search. scales well, both at indexing time and at search time. can be kept updated with minor additional effort. has good parallelization properties. relates well with other data structures (i.e., inverted lists). There is a lot still to investigate: policies for reference points selection. studing the relations between l, |R|, z, k, l′, and z′. giving a theoretical foundation to the permutation based methods. applicability to other domains and similarity space types. policies for data partitioning

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 44 / 48

slide-45
SLIDE 45

Conclusions Questions

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions Summary Questions

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 45 / 48

slide-46
SLIDE 46

Conclusions Questions

Questions?

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 46 / 48

slide-47
SLIDE 47

Conclusions Questions

FAQ

Q: How does the PP-Index differ from the “orthodox” metric approach X? A: Please help yourself:

No “explicit” requirement of metric properties. Use of a predetermined (i.e., fixed) set of reference points. Any reference point has a “global” influence. Different data access model. Different data update model. . . .

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 47 / 48

slide-48
SLIDE 48

Conclusions Questions

FAQ

Q: What are the key differences between the permutation-based methods and the LSH-based methods? A:

The permutation-based methods are mostly based on geometrical considerations, while LSH-based methods are mostly based on probabilistic considerations. The permutation-based methods are able to take into account how data is distributed in the similarity space (by means of R), while the LSH hash functions are derived only from the distance function. Each element of the hash key generated by an LSH hash function is independent from the others, while the order relation between the elements of a permutation is the crucial information for a permutation-based method.

Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 48 / 48