LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data LSH for ℓ 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 / 21

LSH Approach for Approximate NNS Use locality-sensitive hashing to solve simplified decision problem Definition A family of hash functions is ( r , cr , p 1 , p 2 ) -LSH with p 1 > p 2 and c > 1 if h drawn randomly from the family satisfies the following: Pr[ h ( x ) = h ( y )] ≥ p 1 when dist ( x , y ) ≤ r Pr[ h ( x ) = h ( y )] ≤ p 2 when dist ( x , y ) ≥ cr Key parameter: the gap between p 1 and p 2 measured as ρ = log p 1 log p 2 usually small. Two-level hashing scheme: Amplify basic locality sensitive hash family to create better family by repetition Use several copies of amplified hash functions Layer binary search based on r on top of above scheme. Chandra (UIUC) CS498ABD 2 Fall 2020 2 / 21

LSH Approach for Approximate NNS Key parameter: the gap between p 1 and p 2 measured as ρ = log p 1 log p 2 usually small. L ≃ n ρ hash tables Storage: n 1+ ρ (ignoring log factors) Query time: kn ρ (ignoring log factors) where k = log 1 / p 2 n Chandra (UIUC) CS498ABD 3 Fall 2020 3 / 21

LSH for Euclidean Distances Now x 1 , x 2 , . . . , x n ∈ R d and dist ( x , y ) = � x − y � 2 First do dimensionality reduction (JL) to reduce d (if necessary) to O (log n ) (since we are using c -approximation anyway) Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 21

LSH for Euclidean Distances Now x 1 , x 2 , . . . , x n ∈ R d and dist ( x , y ) = � x − y � 2 First do dimensionality reduction (JL) to reduce d (if necessary) to O (log n ) (since we are using c -approximation anyway) What is a good basic locality-sensitive hashing scheme? That is, we want a hashing approach that makes nearby points more likely to collide than farther away points. Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 21

LSH for Euclidean Distances Now x 1 , x 2 , . . . , x n ∈ R d and dist ( x , y ) = � x − y � 2 First do dimensionality reduction (JL) to reduce d (if necessary) to O (log n ) (since we are using c -approximation anyway) What is a good basic locality-sensitive hashing scheme? That is, we want a hashing approach that makes nearby points more likely to collide than farther away points. Projections onto random lines plus bucketing Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 21

Random unit vector Question: How do we generate a random unit vector in R d (same as a uniform point on the sphere S n − 1 )? Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 21

Random unit vector Question: How do we generate a random unit vector in R d (same as a uniform point on the sphere S n − 1 )? Pick d independent rvs Z 1 , Z 2 , . . . , Z d where each Z i ≃ N (0 , 1) and let g = ( Z 1 , Z 2 , . . . , Z d ) (also called a random Guassian vector) g is symmetric and hence is a random direction to obtain random unit vector normalize g ′ = g / � g � 2 When d is large � g � 2 i Z 2 2 = � i is concentrated around d and √ hence � g � 2 = (1 ± ǫ ) d with high probability. √ Thus g / d is a proxy for random unit vector and is easier to work with in many cases Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 21

Projection onto a random guassian vector Lemma Suppose x ∈ R d and g is a random Guassian vector. Let Y = x · g . Then Y ∼ N (0 , � x � 2 ) and hence E [ Y 2 ] = ( � x � 2 ) 2 . Chandra (UIUC) CS498ABD 6 Fall 2020 6 / 21

Hashing scheme Pick a random unit Guassian vector u Pick a random shift a ∈ (0 , r ] For vector x set h u , a = ⌊ x · u + a ⌋ r Chandra (UIUC) CS498ABD 7 Fall 2020 7 / 21

Analysis Suppose x , y are such that � x − y � 2 ≤ r . What is p 1 = Pr[ h u , a ( x ) = h u , a ( y )] Suppose x , y are such that � x − y � 2 ≥ cr . What is p 2 = Pr[ h u , a ( x ) = h u , a ( y )] Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 21

Analysis Suppose x , y are such that � x − y � 2 ≤ r . What is p 1 = Pr[ h u , a ( x ) = h u , a ( y )] Suppose x , y are such that � x − y � 2 ≥ cr . What is p 2 = Pr[ h u , a ( x ) = h u , a ( y )] Let q = x − y . Let s = � q � 2 be length of q . From Lemma q · g is distributed as s N (0 , 1) . Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 21

Analysis Suppose x , y are such that � x − y � 2 ≤ r . What is p 1 = Pr[ h u , a ( x ) = h u , a ( y )] Suppose x , y are such that � x − y � 2 ≥ cr . What is p 2 = Pr[ h u , a ( x ) = h u , a ( y )] Let q = x − y . Let s = � q � 2 be length of q . From Lemma q · g is distributed as s N (0 , 1) . Observations: h ( x ) � = h ( y ) if | q · g | ≥ r If | q · g | < r then h ( x ) = h ( y ) with probability 1 − | q · g | / r Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 21

Analysis Suppose x , y are such that � x − y � 2 ≤ r . What is p 1 = Pr[ h u , a ( x ) = h u , a ( y )] Suppose x , y are such that � x − y � 2 ≥ cr . What is p 2 = Pr[ h u , a ( x ) = h u , a ( y )] Let q = x − y . Let s = � q � 2 be length of q . From Lemma q · g is distributed as s N (0 , 1) . Observations: h ( x ) � = h ( y ) if | q · g | ≥ r If | q · g | < r then h ( x ) = h ( y ) with probability 1 − | q · g | / r Thus collision probability depends only on s Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 21

Analysis Let q = x − y . Let s = � q � 2 be length of q . From Lemma q · g is distributed as s N (0 , 1) . Observations: h ( x ) � = h ( y ) if | q · g | ≥ r If | q · g | < r then h ( x ) = h ( y ) with probability 1 − | q · g | / r For a fixed s collision probability is � r f ( t )(1 − t / r ) dt p ( s ) = 0 where f is the density function of | s N (0 , 1) | . Rewriting � r 1 s f ( t s )(1 − t / r ) dt p ( s ) = 0 where f is the density function of the |N (0 , 1) | . Chandra (UIUC) CS498ABD 9 Fall 2020 9 / 21

ρ Analysis � r 1 s f ( t s )(1 − t / r ) dt p ( s ) = 0 where f is the density function of the |N (0 , 1) | . Recall p 1 = p ( r ) and p 2 = p ( cr ) and we are interested in ρ = log p 1 log p 2 . Chandra (UIUC) CS498ABD 10 Fall 2020 10 / 21

Analysis � r 1 s f ( t s )(1 − t / r ) dt p ( s ) = 0 where f is the density function of the |N (0 , 1) | . Recall p 1 = p ( r ) and p 2 = p ( cr ) and we are interested in ρ = log p 1 ρ log p 2 . Show ρ < 1 / c by plot 1 rho 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Approximation factor c Chandra (UIUC) CS498ABD 10 Fall 2020 10 / 21

NNS for Euclidean distances For any fixed c > 1 use above scheme to obtain Storage: O ( n 1+1 / c polylog ( n )) Query time: O ( dn 1 / c polylog ( n )) Can use JL to reduce d to O (log n ) . Chandra (UIUC) CS498ABD 11 Fall 2020 11 / 21

Improved LSH Scheme [Andoni-Indyk’06] Basic LSH scheme projects points into lines Better scheme: pick some small constant t and project points → into R t Use lattice based space partitioning scheme to “bucket” instead of intervals [Andoni-Indyk’06] ρ p p X t w → 6]: ntil ρ ≥ 0.45/c 2 w ρ Figures from Piotr Indyk’s slides ρ ≥ Chandra (UIUC) CS498ABD 12 Fall 2020 12 / 21

Improved LSH Scheme [Andoni-Indyk’06] Basic LSH scheme projects points into lines Better scheme: pick some small constant t and project points into R t Use lattice based space partitioning scheme to “bucket” instead of intervals Leads to ρ ≃ 1 / c 2 + O (log t / √ t ) and hence tends to 1 / c 2 for large t and fixed c Lower bound for LSH in ℓ 2 says ρ ≥ 1 / c 2 Chandra (UIUC) CS498ABD 13 Fall 2020 13 / 21

Data dependent LSH Scheme LSH is data oblivious. That is, the hash families are chosen before seeing the data. Can one do better by choosing hash functions based on the given set of points? Chandra (UIUC) CS498ABD 14 Fall 2020 14 / 21

Data dependent LSH Scheme LSH is data oblivious. That is, the hash families are chosen before seeing the data. Can one do better by choosing hash functions based on the given set of points? Yes. [Andoni-Indyk-Ngyuyen-Razenshteyn’14, Andoni-Razensteyn’15] ρ = 1 / (2 c 2 − 1) for ℓ 2 improving upon 1 / c 2 for data oblivious LSH (which is tight in worst case) ρ = 1 / ( c 2 − 1) for ℓ 1 /Hamming cube improving upon 1 / c for data oblivious LSH Chandra (UIUC) CS498ABD 14 Fall 2020 14 / 21

LSH Summary A modular hashing based scheme for similarity estimation Main competitors are space partitioning data structures such as variants of k-d trees Provides speedups but uses more memory Does not appear to be a clear winner Chandra (UIUC) CS498ABD 15 Fall 2020 15 / 21

Digression: p -stable distributions For F 2 estimation and JL and LSH we used important “stability” property of the Normal distribution. Lemma Let Y 1 , Y 2 , . . . , Y d be independent random variables with distribution N (0 , 1) . Z = � i x i Y i has distribution � x � 2 N (0 , 1) Standard Gaussian is 2 -stable. Chandra (UIUC) CS498ABD 16 Fall 2020 16 / 21

Digression: p -stable distributions For F 2 estimation and JL and LSH we used important “stability” property of the Normal distribution. Lemma Let Y 1 , Y 2 , . . . , Y d be independent random variables with distribution N (0 , 1) . Z = � i x i Y i has distribution � x � 2 N (0 , 1) Standard Gaussian is 2 -stable. Definition A distribution D is p -stable if Z = � i x i Y i has distribution � x � p D when the Y i are independent and each of them is distributed as D . Chandra (UIUC) CS498ABD 16 Fall 2020 16 / 21

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 / 21 LSH Approach for Approximate NNS Use locality-sensitive hashing to solve simplified decision problem

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

Dr Jeffrey Chow Research Consultant Civic Exchange Distances to public open spaces Distances to

A Sociolinguistic Analysis of Linguistically Sensitive Dialectal Word Pronunciation Distances

Phylogenetic trees II Estimating distances, estimating trees from distances Gerhard Jger

Metric Distances 28 Great Circle Distances North Pole (90N lat) North Pole C Prime

Geodesic distances and intrinsic distances on some fractal sets Masanori Hino (Kyoto Univ.)

Roy oyal l Wels lsh Coll llege of Mus Music ic & Dram ama National Conservatoire of

Course : Data mining Topic : Locality-sensitive hashing (LSH) Aristides Gionis Aalto University

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked Lists Koninika Pal and

Learned about: LSH/Similarity search & recommender systems Search: jaguar

Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer

Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a

Matching Using LSH Forest Michael Cochez * 1st International KEYSTONE Conference * Industrial

Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive Function Anil Maheshwari

Limitations of Realistic A Faster Method: . . . Monte-Carlo Techniques Monte-Carlo: . . . Proof

Towards A Neural-Based Cauchy Deviate . . . Werboss Idea: Use . . . Understanding of the We

Continuous Probability CS70 Summer 2016 - Lecture 6A David Dinh 25 July 2016 UC Berkeley

Some recent methods in non-rigid shape matching, with and without learning GAMES 2019 webinar

Continuous Probability 3 2 Continuous Probability Motivation I Sometimes you cant model

Higher Order Statistics Matthias Hennig School of Informatics, University of Edinburgh March 1,

Distributions of functions in noncommuting random variables Serban T. Belinschi CNRS - Institut

Summer School 2008, Disentis Gap Probabilities for Random Matrix Ensembles Felix Rubin July 21,

Sambuz

Useful Links

Newsletter

Mail Us

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 / 21 LSH Approach for Approximate NNS Use locality-sensitive hashing to solve simplified decision problem

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

Dr Jeffrey Chow Research Consultant Civic Exchange Distances to public open spaces Distances to

A Sociolinguistic Analysis of Linguistically Sensitive Dialectal Word Pronunciation Distances

Phylogenetic trees II Estimating distances, estimating trees from distances Gerhard Jger

Metric Distances 28 Great Circle Distances North Pole (90N lat) North Pole C Prime

Geodesic distances and intrinsic distances on some fractal sets Masanori Hino (Kyoto Univ.)

Roy oyal l Wels lsh Coll llege of Mus Music ic &amp; Dram ama National Conservatoire of

Course : Data mining Topic : Locality-sensitive hashing (LSH) Aristides Gionis Aalto University

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked Lists Koninika Pal and

Learned about: LSH/Similarity search &amp; recommender systems Search: jaguar

Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer

Locality-Sensitive Hashing &amp; Image Similarity Search Andrew Wylie Overview; LSH given a

Matching Using LSH Forest Michael Cochez * 1st International KEYSTONE Conference * Industrial

Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive Function Anil Maheshwari

Limitations of Realistic A Faster Method: . . . Monte-Carlo Techniques Monte-Carlo: . . . Proof

Towards A Neural-Based Cauchy Deviate . . . Werboss Idea: Use . . . Understanding of the We

Continuous Probability CS70 Summer 2016 - Lecture 6A David Dinh 25 July 2016 UC Berkeley

Some recent methods in non-rigid shape matching, with and without learning GAMES 2019 webinar

Continuous Probability 3 2 Continuous Probability Motivation I Sometimes you cant model

Higher Order Statistics Matthias Hennig School of Informatics, University of Edinburgh March 1,

Distributions of functions in noncommuting random variables Serban T. Belinschi CNRS - Institut

Summer School 2008, Disentis Gap Probabilities for Random Matrix Ensembles Felix Rubin July 21,

Sambuz

Useful Links

Newsletter

Mail Us

Roy oyal l Wels lsh Coll llege of Mus Music ic & Dram ama National Conservatoire of

Learned about: LSH/Similarity search & recommender systems Search: jaguar

Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a