Multi-Probe LSH: Efficient Indexing for Efficient Indexing for - - PowerPoint PPT Presentation

multi probe lsh efficient indexing for efficient indexing
SMART_READER_LITE
LIVE PREVIEW

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for - - PowerPoint PPT Presentation

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH: High-Dimensional Similarity Search High-Dimensional Similarity Search Qin (Christine) (Christine) Lv Lv Qin Stony Brook University Stony Brook University


slide-1
SLIDE 1

Multi-Probe LSH: Multi-Probe LSH: Efficient Indexing for Efficient Indexing for High-Dimensional Similarity Search High-Dimensional Similarity Search

Qin Qin (Christine) (Christine) Lv Lv Stony Brook University Stony Brook University

Joint work with Joint work with Zhe Zhe Wang, William Wang, William Josephson Josephson, , Moses Moses Charikar Charikar, Kai Li (Princeton University) , Kai Li (Princeton University)

slide-2
SLIDE 2

2 2

Motivations Motivations

Massive amounts of feature-rich data Massive amounts of feature-rich data

  Audio, video, digital photos, sensor data,

Audio, video, digital photos, sensor data, … …

Fuzzy & high-dimensional Fuzzy & high-dimensional

  Similarity search

Similarity search in high dimensions in high dimensions

  KNN or ANN in

KNN or ANN in feature-vector space feature-vector space

Important in various areas Important in various areas

  Databases, data mining, search engines

Databases, data mining, search engines … …

slide-3
SLIDE 3

3 3

Ideal Indexing for Similarity Search Ideal Indexing for Similarity Search

Accurate Accurate

  Return

Return results that are close to brute-force search results that are close to brute-force search

Time efficient Time efficient

  O(1) or O(log N) query time

O(1) or O(log N) query time

Space efficient Space efficient

  Small

Small space usage for index space usage for index

  May fit into main memory even for large datasets

May fit into main memory even for large datasets

High-dimensional High-dimensional

  Work well

Work well for datasets with high dimensionality for datasets with high dimensionality

slide-4
SLIDE 4

4 4

Previous Indexing Methods Previous Indexing Methods

K-D tree, R-tree, X-tree, SR-tree K-D tree, R-tree, X-tree, SR-tree … …

  “

“curse of dimensionality curse of dimensionality” ”

  Linear scan outperforms when d > 10

Linear scan outperforms when d > 10 [WSB98] [WSB98]

Navigating nets Navigating nets [KL04] [KL04], cover tree , cover tree [BKL06] [BKL06]

  Based on

Based on “ “intrinsic dimensionality intrinsic dimensionality” ”

  Do not perform well with high intrinsic dimensionality

Do not perform well with high intrinsic dimensionality

Locality Locality sensitive hashing (LSH) sensitive hashing (LSH)

slide-5
SLIDE 5

5 5

Outline Outline

Motivations Motivations Locality sensitive hashing (LSH) Locality sensitive hashing (LSH)

  Basic

Basic LSH, LSH, entropy-based LSH entropy-based LSH

Multi-probe LSH indexing Multi-probe LSH indexing

  Step-wise probing, query-directed probing

Step-wise probing, query-directed probing

Evaluations Evaluations Conclusions & future work Conclusions & future work

slide-6
SLIDE 6

6 6

LSH: Locality Sensitive Hashing LSH: Locality Sensitive Hashing

(r, (r, cr cr, p , p1

1, p

, p2

2)-sensitive

)-sensitive [IM98] [IM98]

  If

If D(q,p) < r D(q,p) < r, then , then Pr [h(q) Pr [h(q)=h =h(p)] >= p (p)] >= p1

1

  If

If D(q,p) > D(q,p) > cr cr, then , then Pr [h(q) Pr [h(q)=h =h(p)] <= p (p)] <= p2

2

  i.e.

i.e. closer objects have higher collision probability closer objects have higher collision probability

LSH LSH based on based on p p-stable

  • stable distributions

distributions [DIIM04] [DIIM04]

  w

w : slot width : slot width

  • +
  • =

w b v a v h b

a

) (

,

w w w Slot 1 Slot 2 Slot 3 q r cr h

slide-7
SLIDE 7

7 7

LSH for Similarity Search LSH for Similarity Search

False positive False positive

  Intersection of

Intersection of multiple multiple hashes hashes

False negative False negative

  Union of

Union of multiple multiple hashes hashes

q w w w h1 w w w h2

slide-8
SLIDE 8

8 8

[IM98, GIM99, DIIM04] [IM98, GIM99, DIIM04] M M hash functions per table hash functions per table L L hash tables hash tables Issues: Issues:

  Large number of tables

Large number of tables

  L > 100 in

L > 100 in [GIM99] [GIM99]

  L > 500 in

L > 500 in [Buhler01] [Buhler01]

Basic LSH Indexing Basic LSH Indexing

Impractical for large datasets

q

g1 g1(q) gi gL gi(q) gL(q)

G = { g1, …, gL } gi (v) = ( hi,1 (v), …, hi,M (v) )

slide-9
SLIDE 9

9 9

Entropy-Based LSH Indexing Entropy-Based LSH Indexing

[ [Panigrahy Panigrahy, SODA , SODA’ ’06] 06] Randomly perturb Randomly perturb q q at distance at distance R R Check hash buckets Check hash buckets

  • f perturbed points
  • f perturbed points

Issues: Issues:

  Difficult to choose

Difficult to choose R R

  Duplicate buckets

Duplicate buckets

q

g1 g1(q) gi gL gi(q) gL(q)

p4 R p2 p3 p1 q

g1(p1) gi(p1) gL(p1)

Inefficient probing

slide-10
SLIDE 10

10 10

Outline Outline

Motivations Motivations Locality Sensitive Hashing (LSH) Locality Sensitive Hashing (LSH)

  Basic LSH, entropy-based LSH

Basic LSH, entropy-based LSH

Multi-probe LSH indexing Multi-probe LSH indexing

  Step-wise probing, query-directed probing

Step-wise probing, query-directed probing

Evaluations Evaluations Conclusions & future work Conclusions & future work

slide-11
SLIDE 11

11 11

Multi-Probe LSH Indexing Multi-Probe LSH Indexing

Probes multiple hash buckets per table Probes multiple hash buckets per table Perturbs directly on hash values Perturbs directly on hash values

  Check left and right slots

Check left and right slots

  Perturbation vector

Perturbation vector ∆

g(q) = (2, g(q) = (2, 5, 3), 5, 3), ∆ = (-1, 1, 0), = (-1, 1, 0), g(q) + g(q) + ∆ = (1, 6, 3) = (1, 6, 3)

Systematic probing Systematic probing

  (

(∆1, ∆2, ∆3, ∆4, … )

w w w 4 h(q) = 5 6 h q

slide-12
SLIDE 12

12 12

A carefully derived A carefully derived probing sequence probing sequence Advantages Advantages

  Fast probing sequence

Fast probing sequence generation generation

  No duplicate buckets

No duplicate buckets

  More effective in finding

More effective in finding similar objects similar objects

Multi-Probe LSH Indexing Multi-Probe LSH Indexing

q

g1 g1(q) gi gL gi(q) gL(q)

probing sequence: ( ∆1, ∆2, ∆3, ∆4, … )

g1(q)+∆1 gi(q)+∆2 gi(q)+∆4 gL(q)+∆3

?

slide-13
SLIDE 13

13 13

Step-Wise Probing Step-Wise Probing

Given Given q q’ ’s s hash values hash values Intuitions Intuitions

  1-step buckets better than 2-step buckets

1-step buckets better than 2-step buckets

  All 1-step buckets are equally good

All 1-step buckets are equally good

g(q)=(3,2,5) (2,2,5) (4,2,5) (3,2,6) (2,1,5) (2,2,6) (3,3,6)

1-step buckets 2-step buckets

WRONG!

∆ = (0,0,1) ∆ = (-1,-1,0)

slide-14
SLIDE 14

14 14

Success Probability Estimation Success Probability Estimation

Hashed position within slot matters! Hashed position within slot matters! Estimation based on Estimation based on x xi

i (-1) and x

(-1) and xi

i (1)

(1)

slide-15
SLIDE 15

15 15

Query-Directed Probing Query-Directed Probing

h1(q) = 2 0.7 0.3 h2(q) = 5 0.4 0.6 h3(q) = 1 0.2 0.8 g(q) = (h1(q), h2(q), h3(q)) = (2, 5, 1) { 0.2, 0.3, 0.4, 0.6, 0.7, 0.8 } { 0.2 } { 0.2 } ∆1 = (0, 0, -1) (2, 5, 0) { 0.3 } ∆2 = (1, 0, 0) (3, 5, 1) { 0.2, 0.3 } ∆3 = (1, 0, -1) (3, 5, 0) { 0.2, 0.3 } { 0.3 } { 0.2, 0.4 } { 0.2, 0.3, 0.4 } { 0.4 } { 0.3, 0.4 } { x3(-1), x1(1), x2(-1), x2(1), x1(-1), x3(1) }

slide-16
SLIDE 16

16 16

Outline Outline

Motivations Motivations Locality Sensitive Hashing (LSH) Locality Sensitive Hashing (LSH)

  Basic LSH, entropy-based LSH

Basic LSH, entropy-based LSH

Multi-probe LSH indexing Multi-probe LSH indexing

  Step-wise probing, query-directed probing

Step-wise probing, query-directed probing

Evaluations Evaluations Conclusions & future work Conclusions & future work

slide-17
SLIDE 17

17 17

Evaluations Evaluations

Multi-probe vs. basic vs. entropy-based Multi-probe vs. basic vs. entropy-based

  Tradeoff among space, speed and quality

Tradeoff among space, speed and quality

  Space reduction

Space reduction

Query-directed vs. step-wise probing Query-directed vs. step-wise probing

  Tradeoff between search quality and

Tradeoff between search quality and number of probes number of probes

slide-18
SLIDE 18

18 18

I

Evaluation Methodology Evaluation Methodology

Benchmarks Benchmarks

  100 random queries, top K results

100 random queries, top K results

Evaluation metrics Evaluation metrics

  Search quality: recall, error ratio

Search quality: recall, error ratio

  Search speed: query latency

Search speed: query latency

  Space usage: #hash tables

Space usage: #hash tables

192 192 2.6 million 2.6 million Switchboard audio Switchboard audio 64 64 1.3 million 1.3 million Web images Web images #dimensions #dimensions #objects #objects Dataset Dataset

R recall =|I ∩ R| / |I|

slide-19
SLIDE 19

19 19

Multi-Probe vs. Basic vs. Entropy Multi-Probe vs. Basic vs. Entropy

Multi-probe LSH achieves higher recall with fewer hash tables

slide-20
SLIDE 20

20 20

Space Savings of Multi-Probe LSH Space Savings of Multi-Probe LSH

14x - 18x fewer tables than basic LSH 5x - 8x fewer tables than entropy LSH

30 11 2

slide-21
SLIDE 21

21 21

Multi-Probe Multi-Probe

  • vs. Entropy-Based
  • vs. Entropy-Based

Multi-probe LSH uses much fewer number of probes

slide-22
SLIDE 22

22 22

Query-Directed vs. Step-Wise Probing Query-Directed vs. Step-Wise Probing

250 20 150 10

Query-directed probing uses 10x fewer number of probes

slide-23
SLIDE 23

23 23

Conclusions Conclusions

Multi-probe LSH indexing Multi-probe LSH indexing

  Systematically probes multiple buckets per

Systematically probes multiple buckets per hash table hash table

  More space-efficient than basic LSH (14x-18x)

More space-efficient than basic LSH (14x-18x) and entropy-based LSH (5x-8x) and entropy-based LSH (5x-8x)

  More time-efficient than entropy-based LSH

More time-efficient than entropy-based LSH

10x fewer number of probes 10x fewer number of probes

  Query-directed probing is far superior to

Query-directed probing is far superior to step-wise probing step-wise probing

slide-24
SLIDE 24

24 24

Future Work Future Work

Multi-probe LSH on larger datasets Multi-probe LSH on larger datasets

  60 million images, out-of-core, distributed

60 million images, out-of-core, distributed

Self-tuning Self-tuning

  Analytical model,

Analytical model, LSH Forest LSH Forest

Compare with other indexing methods Compare with other indexing methods Evaluate on other data types, features Evaluate on other data types, features

  Genomic data,

Genomic data, video data, scientific sensor video data, scientific sensor data data … …

slide-25
SLIDE 25

25 25

Thanks! Thanks!

Princeton Princeton CASS CASS Project Project

  C

Content-

  • ntent-A

Aware ware S Search earch S Systems ystems

  http://www.

http://www.cs cs. .princeton princeton. .edu/cass/ edu/cass/

Qin Qin (Christine) (Christine) Lv Lv at Stony Brook at Stony Brook

  http://www.

http://www.cs cs. .sunysb sunysb. .edu/~qlv edu/~qlv