Multi-Probe LSH: Efficient Indexing for Efficient Indexing for - PowerPoint PPT Presentation

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH: High-Dimensional Similarity Search High-Dimensional Similarity Search Qin (Christine) (Christine) Lv Lv Qin Stony Brook University Stony Brook University Joint work with Zhe Zhe Wang, William Wang, William Josephson Josephson, , Joint work with Moses Charikar Charikar, Kai Li (Princeton University) , Kai Li (Princeton University) Moses

Motivations Motivations Massive amounts of feature-rich data Massive amounts of feature-rich data  Audio, video, digital photos, sensor data, Audio, video, digital photos, sensor data, … …  Fuzzy & high-dimensional Fuzzy & high-dimensional  Similarity search Similarity search in high dimensions in high dimensions   KNN or ANN in KNN or ANN in feature-vector space feature-vector space  Important in various areas Important in various areas  Databases, data mining, search engines Databases, data mining, search engines … …  2 2

Ideal Indexing for Similarity Search Ideal Indexing for Similarity Search Accurate Accurate  Return Return results that are close to brute-force search results that are close to brute-force search  Time efficient Time efficient  O(1) or O(log N) query time O(1) or O(log N) query time  Space efficient Space efficient  Small Small space usage for index space usage for index   May fit into main memory even for large datasets May fit into main memory even for large datasets  High-dimensional High-dimensional  Work well Work well for datasets with high dimensionality for datasets with high dimensionality  3 3

Previous Indexing Methods Previous Indexing Methods K-D tree, R-tree, X-tree, SR-tree … … K-D tree, R-tree, X-tree, SR-tree  “ “curse of dimensionality curse of dimensionality” ”   Linear scan outperforms when d > 10 Linear scan outperforms when d > 10 [WSB98] [WSB98]  Navigating nets [KL04] [KL04] , cover tree , cover tree [BKL06] [BKL06] Navigating nets  Based on Based on “ “intrinsic dimensionality intrinsic dimensionality” ”   Do not perform well with high intrinsic dimensionality Do not perform well with high intrinsic dimensionality  Locality sensitive hashing (LSH) sensitive hashing (LSH) Locality 4 4

Outline Outline Motivations Motivations Locality sensitive hashing (LSH) Locality sensitive hashing (LSH)  Basic Basic LSH, LSH, entropy-based LSH entropy-based LSH  Multi-probe LSH indexing Multi-probe LSH indexing  Step-wise probing, query-directed probing Step-wise probing, query-directed probing  Evaluations Evaluations Conclusions & future work Conclusions & future work 5 5

LSH: Locality Sensitive Hashing LSH: Locality Sensitive Hashing cr (r, cr cr, p , p 1 , p 2 )-sensitive [IM98] [IM98] (r, 1 , p 2 )-sensitive q  If If D(q,p) < r D(q,p) < r , then , then Pr [h(q) Pr [h(q)=h =h(p)] >= p (p)] >= p 1 r  1  If If D(q,p) > D(q,p) > cr cr , then , then Pr [h(q) Pr [h(q)=h =h(p)] <= p (p)] <= p 2  2  i.e. i.e. closer objects have higher collision probability closer objects have higher collision probability  LSH based on based on p p -stable -stable distributions distributions [DIIM04] [DIIM04] LSH  w w : slot width : slot width  a v b � + � � h b ( v ) = h � � a , w � � w w w Slot 1 Slot 2 Slot 3 6 6

LSH for Similarity Search LSH for Similarity Search False positive False positive h2  Intersection of Intersection of  multiple multiple hashes hashes q w False negative False negative w  Union of Union of  multiple multiple w h1 hashes hashes w w w 7 7

Basic LSH Indexing Basic LSH Indexing q [IM98, GIM99, DIIM04] [IM98, GIM99, DIIM04] M hash functions per table hash functions per table M g i ( v ) = ( h i,1 ( v ) , …, h i,M ( v ) ) L hash tables hash tables L G = { g 1 , …, g L } g L (q) Issues: Issues: g 1 (q)  Large number of tables Large number of tables  g i (q)  L > 100 in L > 100 in [GIM99] [GIM99]   L > 500 in L > 500 in [Buhler01] [Buhler01]  g 1 g i g L Impractical for large datasets 8 8

Entropy-Based LSH Indexing Entropy-Based LSH Indexing p 4 [Panigrahy Panigrahy, SODA , SODA’ ’06] 06] p 1 [ q R q Randomly perturb q q Randomly perturb p 2 p 3 at distance R R at distance Check hash buckets Check hash buckets of perturbed points of perturbed points g 1 (p 1 ) g L (q) Issues: Issues: g 1 (q)  Difficult to choose Difficult to choose R R g i (p 1 )  g L (p 1 )  Duplicate buckets Duplicate buckets g i (q)  Inefficient probing g 1 g i g L 9 9

Outline Outline Motivations Motivations Locality Sensitive Hashing (LSH) Locality Sensitive Hashing (LSH)  Basic LSH, entropy-based LSH Basic LSH, entropy-based LSH  Multi-probe LSH indexing Multi-probe LSH indexing  Step-wise probing, query-directed probing Step-wise probing, query-directed probing  Evaluations Evaluations Conclusions & future work Conclusions & future work 10 10

Multi-Probe LSH Indexing Multi-Probe LSH Indexing Probes multiple hash buckets per table Probes multiple hash buckets per table Perturbs directly on hash values Perturbs directly on hash values  Check left and right slots Check left and right slots   Perturbation vector Perturbation vector ∆  g(q) = (2, 5, 3), 5, 3), ∆ = (-1, 1, 0), = (-1, 1, 0), g(q) = (2, q g(q) + ∆ = (1, 6, 3) = (1, 6, 3) g(q) + h w w w Systematic probing Systematic probing 4 h(q) = 5 6  ( ( ∆ 1 , ∆ 2 , ∆ 3 , ∆ 4 , … )  11 11

Multi-Probe LSH Indexing Multi-Probe LSH Indexing ? probing sequence: A carefully derived A carefully derived ( ∆ 1 , ∆ 2 , ∆ 3 , ∆ 4 , … ) q probing sequence probing sequence Advantages Advantages  Fast probing sequence Fast probing sequence  generation generation g L (q) g 1 (q)+ ∆ 1  No duplicate buckets No duplicate buckets  g 1 (q) g i (q)+ ∆ 2  More effective in finding More effective in finding  g L (q)+ ∆ 3 similar objects similar objects g i (q) g i (q)+ ∆ 4 g 1 g i g L 12 12

Step-Wise Probing Step-Wise Probing Given q q ’ ’s s hash values hash values Given g(q)=(3,2,5) ∆ = (0,0,1) 1-step buckets (2,2,5) (4,2,5) (3,2,6) ∆ = (-1,-1,0) 2-step buckets (2,1,5) (2,2,6) (3,3,6) Intuitions Intuitions WRONG!  1-step buckets better than 2-step buckets 1-step buckets better than 2-step buckets   All 1-step buckets are equally good All 1-step buckets are equally good  13 13

Success Probability Estimation Success Probability Estimation Hashed position within slot matters! Hashed position within slot matters! Estimation based on x x i (-1) and x i (1) Estimation based on i (-1) and x i (1) 14 14

Query-Directed Probing Query-Directed Probing 0.7 0.3 g(q) = (h 1 (q), h 2 (q), h 3 (q)) = (2, 5, 1) h 1 (q) = 2 0.4 0.6 { 0.2, 0.3, 0.4, 0.6, 0.7, 0.8 } h 2 (q) = 5 0.2 0.8 { x 3 (-1), x 1 (1), x 2 (-1), x 2 (1), x 1 (-1), x 3 (1) } h 3 (q) = 1 { 0.2 } { 0.2, 0.3 } { 0.2, 0.3, 0.4 } { 0.2 } ∆ 1 = (0, 0, -1) (2, 5, 0) { 0.3 } ∆ 2 = (1, 0, 0) (3, 5, 1) { 0.2, 0.4 } { 0.3 } { 0.3, 0.4 } { 0.2, 0.3 } ∆ 3 = (1, 0, -1) (3, 5, 0) { 0.4 } 15 15

Outline Outline Motivations Motivations Locality Sensitive Hashing (LSH) Locality Sensitive Hashing (LSH)  Basic LSH, entropy-based LSH Basic LSH, entropy-based LSH  Multi-probe LSH indexing Multi-probe LSH indexing  Step-wise probing, query-directed probing Step-wise probing, query-directed probing  Evaluations Evaluations Conclusions & future work Conclusions & future work 16 16

Evaluations Evaluations Multi-probe vs. basic vs. entropy-based Multi-probe vs. basic vs. entropy-based  Tradeoff among space, speed and quality Tradeoff among space, speed and quality   Space reduction Space reduction  Query-directed vs. step-wise probing Query-directed vs. step-wise probing  Tradeoff between search quality and Tradeoff between search quality and  number of probes number of probes 17 17

Evaluation Methodology Evaluation Methodology Dataset Dataset #objects #objects #dimensions #dimensions Web images Web images 1.3 million 64 64 1.3 million Switchboard audio Switchboard audio 2.6 million 2.6 million 192 192 Benchmarks Benchmarks  100 random queries, top K results 100 random queries, top K results  Evaluation metrics Evaluation metrics I R  Search quality: recall, error ratio Search quality: recall, error ratio   Search speed: query latency Search speed: query latency   Space usage: #hash tables Space usage: #hash tables recall =|I ∩ R| / |I|  18 18

Multi-Probe vs. Basic vs. Entropy Multi-Probe vs. Basic vs. Entropy Multi-probe LSH achieves higher recall with fewer hash tables 19 19

Space Savings of Multi-Probe LSH Space Savings of Multi-Probe LSH 30 11 2 14x - 18x fewer tables than basic LSH 5x - 8x fewer tables than entropy LSH 20 20

Multi-Probe vs. Entropy-Based vs. Entropy-Based Multi-Probe Multi-probe LSH uses much fewer number of probes 21 21

Query-Directed vs. Step-Wise Probing Query-Directed vs. Step-Wise Probing 250 150 20 10 Query-directed probing uses 10x fewer number of probes 22 22

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for - PowerPoint PPT Presentation

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH: High-Dimensional Similarity Search High-Dimensional Similarity Search Qin (Christine) (Christine) Lv Lv Qin Stony Brook University Stony Brook University

POEMMA POEMMA: Probe of Extreme : Probe of Extreme Multi-Messenger Astrophysics Multi-Messenger

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

Wedge Probe Cards PCBs, Connectors, Applications HTT High Tech Trade GmbH HTT Wedge Probe Cards

Phased Array Probe The PA probe consists of many small elements, each one can be pulsedon

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 /

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

Interstellar Probe Study Webinar Series The Interstellar Probe Study Year 2 Update Ralph L.

On the Bi-Enhancement of Chordal-Bipartite Probe Graphs Elad Cohen Martin Charles Golumbic

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Optical Bio-im aging w ith Polym er Nanoparticles I ck Chan Kw on, Ph.D Biom edical Research

EEG Probe Project Grant G. Connell EEG Probe Project Design Objectives Investigate BCI

The Coolest, Hottest Mission under the Sun!! Dr. Nicola J. Fox Parker Solar Probe Project

Profometer PM-600 / PM-630 overview Profometer PM-600 / PM-630: - High resolution touch screen

Tail Loss Probe (TLP) Converting RTOs to fast recoveries draft-dukkipati-tcpm-tcp-loss-probe-00

E (h) o r out in ( h 1 ) E out ( h 1 ) | > o r in ( h 2 ) E out ( h 2 ) | >

Hausdorff operators in H p spaces, 0 < p < 1 Elijah Liflyand joint work with Akihiko

Reasoning Analytically About Password-Cracking Software Enze Alex Liu , Amanda Nakanishi,

Fully Homomorphic Encryption from the ground up Daniele Micciancio (UC San Diego) Eurocrypt

On Topological Entropy of Switched Linear Systems with Pairwise Commuting Matrices Guosong Yang

Bayesian Learning l A powerful approach in machine learning l Combine data seen so far with prior

Discuss: P rogramming L anguage What is a PL? CS 251

Circuits TM: A single program that works for every input length Circuits: A program tailored to

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for - PowerPoint PPT Presentation

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH: High-Dimensional Similarity Search High-Dimensional Similarity Search Qin (Christine) (Christine) Lv Lv Qin Stony Brook University Stony Brook University

POEMMA POEMMA: Probe of Extreme : Probe of Extreme Multi-Messenger Astrophysics Multi-Messenger

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

Wedge Probe Cards PCBs, Connectors, Applications HTT High Tech Trade GmbH HTT Wedge Probe Cards

Phased Array Probe The PA probe consists of many small elements, each one can be pulsedon

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 /

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

Interstellar Probe Study Webinar Series The Interstellar Probe Study Year 2 Update Ralph L.

On the Bi-Enhancement of Chordal-Bipartite Probe Graphs Elad Cohen Martin Charles Golumbic

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Optical Bio-im aging w ith Polym er Nanoparticles I ck Chan Kw on, Ph.D Biom edical Research

EEG Probe Project Grant G. Connell EEG Probe Project Design Objectives Investigate BCI

The Coolest, Hottest Mission under the Sun!! Dr. Nicola J. Fox Parker Solar Probe Project

Profometer PM-600 / PM-630 overview Profometer PM-600 / PM-630: - High resolution touch screen

Tail Loss Probe (TLP) Converting RTOs to fast recoveries draft-dukkipati-tcpm-tcp-loss-probe-00

E (h) o r out in ( h 1 ) E out ( h 1 ) | &gt; o r in ( h 2 ) E out ( h 2 ) | &gt;

Hausdorff operators in H p spaces, 0 &lt; p &lt; 1 Elijah Liflyand joint work with Akihiko

Reasoning Analytically About Password-Cracking Software Enze Alex Liu , Amanda Nakanishi,

Fully Homomorphic Encryption from the ground up Daniele Micciancio (UC San Diego) Eurocrypt

On Topological Entropy of Switched Linear Systems with Pairwise Commuting Matrices Guosong Yang

Bayesian Learning l A powerful approach in machine learning l Combine data seen so far with prior

Discuss: P rogramming L anguage What is a PL? CS 251

Circuits TM: A single program that works for every input length Circuits: A program tailored to

E (h) o r out in ( h 1 ) E out ( h 1 ) | > o r in ( h 2 ) E out ( h 2 ) | >

Hausdorff operators in H p spaces, 0 < p < 1 Elijah Liflyand joint work with Akihiko