LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) - - PowerPoint PPT Presentation

lsh for 2 distances
SMART_READER_LITE
LIVE PREVIEW

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) - - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 / 21 LSH Approach for Approximate NNS Use locality-sensitive hashing to solve simplified decision problem


slide-1
SLIDE 1

CS 498ABD: Algorithms for Big Data

LSH for ℓ2 distances

Lecture 15

October 15, 2020

Chandra (UIUC) CS498ABD 1 Fall 2020 1 / 21

slide-2
SLIDE 2

LSH Approach for Approximate NNS

Use locality-sensitive hashing to solve simplified decision problem Definition A family of hash functions is (r, cr, p1, p2)-LSH with p1 > p2 and c > 1 if h drawn randomly from the family satisfies the following: Pr[h(x) = h(y)] ≥ p1 when dist(x, y) ≤ r Pr[h(x) = h(y)] ≤ p2 when dist(x, y) ≥ cr Key parameter: the gap between p1 and p2 measured as ρ = log p1

log p2

usually small. Two-level hashing scheme: Amplify basic locality sensitive hash family to create better family by repetition Use several copies of amplified hash functions Layer binary search based on r on top of above scheme.

Chandra (UIUC) CS498ABD 2 Fall 2020 2 / 21

slide-3
SLIDE 3

LSH Approach for Approximate NNS

Key parameter: the gap between p1 and p2 measured as ρ = log p1

log p2

usually small. L ≃ nρ hash tables Storage: n1+ρ (ignoring log factors) Query time: knρ (ignoring log factors) where k = log1/p2 n

Chandra (UIUC) CS498ABD 3 Fall 2020 3 / 21

slide-4
SLIDE 4

LSH for Euclidean Distances

Now x1, x2, . . . , xn ∈ Rd and dist(x, y) = x − y2 First do dimensionality reduction (JL) to reduce d (if necessary) to O(log n) (since we are using c-approximation anyway)

Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 21

slide-5
SLIDE 5

LSH for Euclidean Distances

Now x1, x2, . . . , xn ∈ Rd and dist(x, y) = x − y2 First do dimensionality reduction (JL) to reduce d (if necessary) to O(log n) (since we are using c-approximation anyway) What is a good basic locality-sensitive hashing scheme? That is, we want a hashing approach that makes nearby points more likely to collide than farther away points.

Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 21

slide-6
SLIDE 6

LSH for Euclidean Distances

Now x1, x2, . . . , xn ∈ Rd and dist(x, y) = x − y2 First do dimensionality reduction (JL) to reduce d (if necessary) to O(log n) (since we are using c-approximation anyway) What is a good basic locality-sensitive hashing scheme? That is, we want a hashing approach that makes nearby points more likely to collide than farther away points. Projections onto random lines plus bucketing

Chandra (UIUC) CS498ABD 4 Fall 2020 4 / 21

slide-7
SLIDE 7

Random unit vector

Question: How do we generate a random unit vector in Rd (same as a uniform point on the sphere Sn−1)?

Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 21

slide-8
SLIDE 8

Random unit vector

Question: How do we generate a random unit vector in Rd (same as a uniform point on the sphere Sn−1)? Pick d independent rvs Z1, Z2, . . . , Zd where each Zi ≃ N (0, 1) and let g = (Z1, Z2, . . . , Zd) (also called a random Guassian vector) g is symmetric and hence is a random direction to obtain random unit vector normalize g ′ = g/g2 When d is large g2

2 = i Z 2 i is concentrated around d and

hence g2 = (1 ± ǫ) √ d with high probability. Thus g/ √ d is a proxy for random unit vector and is easier to work with in many cases

Chandra (UIUC) CS498ABD 5 Fall 2020 5 / 21

slide-9
SLIDE 9

Projection onto a random guassian vector

Lemma Suppose x ∈ Rd and g is a random Guassian vector. Let Y = x · g. Then Y ∼ N (0, x2) and hence E[Y 2] = (x2)2.

Chandra (UIUC) CS498ABD 6 Fall 2020 6 / 21

slide-10
SLIDE 10

Hashing scheme

Pick a random unit Guassian vector u Pick a random shift a ∈ (0, r] For vector x set hu,a = ⌊ x·u+a

r

Chandra (UIUC) CS498ABD 7 Fall 2020 7 / 21

slide-11
SLIDE 11

Analysis

Suppose x, y are such that x − y2 ≤ r. What is p1 = Pr[hu,a(x) = hu,a(y)] Suppose x, y are such that x − y2 ≥ cr. What is p2 = Pr[hu,a(x) = hu,a(y)]

Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 21

slide-12
SLIDE 12

Analysis

Suppose x, y are such that x − y2 ≤ r. What is p1 = Pr[hu,a(x) = hu,a(y)] Suppose x, y are such that x − y2 ≥ cr. What is p2 = Pr[hu,a(x) = hu,a(y)] Let q = x − y. Let s = q2 be length of q. From Lemma q · g is distributed as sN (0, 1).

Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 21

slide-13
SLIDE 13

Analysis

Suppose x, y are such that x − y2 ≤ r. What is p1 = Pr[hu,a(x) = hu,a(y)] Suppose x, y are such that x − y2 ≥ cr. What is p2 = Pr[hu,a(x) = hu,a(y)] Let q = x − y. Let s = q2 be length of q. From Lemma q · g is distributed as sN (0, 1). Observations: h(x) = h(y) if |q · g| ≥ r If |q · g| < r then h(x) = h(y) with probability 1 − |q · g|/r

Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 21

slide-14
SLIDE 14

Analysis

Suppose x, y are such that x − y2 ≤ r. What is p1 = Pr[hu,a(x) = hu,a(y)] Suppose x, y are such that x − y2 ≥ cr. What is p2 = Pr[hu,a(x) = hu,a(y)] Let q = x − y. Let s = q2 be length of q. From Lemma q · g is distributed as sN (0, 1). Observations: h(x) = h(y) if |q · g| ≥ r If |q · g| < r then h(x) = h(y) with probability 1 − |q · g|/r Thus collision probability depends only on s

Chandra (UIUC) CS498ABD 8 Fall 2020 8 / 21

slide-15
SLIDE 15

Analysis

Let q = x − y. Let s = q2 be length of q. From Lemma q · g is distributed as sN (0, 1). Observations: h(x) = h(y) if |q · g| ≥ r If |q · g| < r then h(x) = h(y) with probability 1 − |q · g|/r For a fixed s collision probability is p(s) = r f (t)(1 − t/r)dt where f is the density function of |sN (0, 1)|. Rewriting p(s) = r 1 s f (t s )(1 − t/r)dt where f is the density function of the |N (0, 1)|.

Chandra (UIUC) CS498ABD 9 Fall 2020 9 / 21

slide-16
SLIDE 16

Analysis

p(s) = r 1 s f (t s )(1 − t/r)dt where f is the density function of the |N (0, 1)|. Recall p1 = p(r) and p2 = p(cr) and we are interested in ρ = log p1

log p2.

ρ

Chandra (UIUC) CS498ABD 10 Fall 2020 10 / 21

slide-17
SLIDE 17

Analysis

p(s) = r 1 s f (t s )(1 − t/r)dt where f is the density function of the |N (0, 1)|. Recall p1 = p(r) and p2 = p(cr) and we are interested in ρ = log p1

log p2. Show ρ < 1/c by plot

ρ

1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Approximation factor c rho 1/c

Chandra (UIUC) CS498ABD 10 Fall 2020 10 / 21

slide-18
SLIDE 18

NNS for Euclidean distances

For any fixed c > 1 use above scheme to obtain

Storage: O(n1+1/cpolylog(n)) Query time: O(dn1/cpolylog(n))

Can use JL to reduce d to O(log n).

Chandra (UIUC) CS498ABD 11 Fall 2020 11 / 21

slide-19
SLIDE 19

Improved LSH Scheme

[Andoni-Indyk’06] Basic LSH scheme projects points into lines Better scheme: pick some small constant t and project points into Rt Use lattice based space partitioning scheme to “bucket” instead

  • f intervals

[Andoni-Indyk’06]

t →

ntil ρ

ρ ≥

X w w

p

ρ

6]: ρ ≥ 0.45/c2

p

Figures from Piotr Indyk’s slides

Chandra (UIUC) CS498ABD 12 Fall 2020 12 / 21

slide-20
SLIDE 20

Improved LSH Scheme

[Andoni-Indyk’06] Basic LSH scheme projects points into lines Better scheme: pick some small constant t and project points into Rt Use lattice based space partitioning scheme to “bucket” instead

  • f intervals

Leads to ρ ≃ 1/c2 + O(log t/√t) and hence tends to 1/c2 for large t and fixed c Lower bound for LSH in ℓ2 says ρ ≥ 1/c2

Chandra (UIUC) CS498ABD 13 Fall 2020 13 / 21

slide-21
SLIDE 21

Data dependent LSH Scheme

LSH is data oblivious. That is, the hash families are chosen before seeing the data. Can one do better by choosing hash functions based

  • n the given set of points?

Chandra (UIUC) CS498ABD 14 Fall 2020 14 / 21

slide-22
SLIDE 22

Data dependent LSH Scheme

LSH is data oblivious. That is, the hash families are chosen before seeing the data. Can one do better by choosing hash functions based

  • n the given set of points?

Yes. [Andoni-Indyk-Ngyuyen-Razenshteyn’14, Andoni-Razensteyn’15] ρ = 1/(2c2 − 1) for ℓ2 improving upon 1/c2 for data

  • blivious LSH (which is tight in worst case)

ρ = 1/(c2 − 1) for ℓ1/Hamming cube improving upon 1/c for data oblivious LSH

Chandra (UIUC) CS498ABD 14 Fall 2020 14 / 21

slide-23
SLIDE 23

LSH Summary

A modular hashing based scheme for similarity estimation Main competitors are space partitioning data structures such as variants of k-d trees Provides speedups but uses more memory Does not appear to be a clear winner

Chandra (UIUC) CS498ABD 15 Fall 2020 15 / 21

slide-24
SLIDE 24

Digression: p-stable distributions

For F2 estimation and JL and LSH we used important “stability” property of the Normal distribution. Lemma Let Y1, Y2, . . . , Yd be independent random variables with distribution N (0, 1). Z =

i xiYi has distribution x2N (0, 1)

Standard Gaussian is 2-stable.

Chandra (UIUC) CS498ABD 16 Fall 2020 16 / 21

slide-25
SLIDE 25

Digression: p-stable distributions

For F2 estimation and JL and LSH we used important “stability” property of the Normal distribution. Lemma Let Y1, Y2, . . . , Yd be independent random variables with distribution N (0, 1). Z =

i xiYi has distribution x2N (0, 1)

Standard Gaussian is 2-stable. Definition A distribution D is p-stable if Z =

i xiYi has distribution xpD

when the Yi are independent and each of them is distributed as D.

Chandra (UIUC) CS498ABD 16 Fall 2020 16 / 21

slide-26
SLIDE 26

Digression: p-stable distributions

For F2 estimation and JL and LSH we used important “stability” property of the Normal distribution. Lemma Let Y1, Y2, . . . , Yd be independent random variables with distribution N (0, 1). Z =

i xiYi has distribution x2N (0, 1)

Standard Gaussian is 2-stable. Definition A distribution D is p-stable if Z =

i xiYi has distribution xpD

when the Yi are independent and each of them is distributed as D. Question: Do p-stable distributions exist for p = 2?

Chandra (UIUC) CS498ABD 16 Fall 2020 16 / 21

slide-27
SLIDE 27

p-stable distributions

Fact: p-stable distributions exist for all p ∈ (0, 2] and do not exist for p > 2. p = 1 is the Cauchy distribution which is the distribution of the ratio

  • f two independent Guassian random variables. Has a closed form

density function

1 π(1+x2). Mean and variance are not finite.

Chandra (UIUC) CS498ABD 17 Fall 2020 17 / 21

slide-28
SLIDE 28

p-stable distributions

Fact: p-stable distributions exist for all p ∈ (0, 2] and do not exist for p > 2. p = 1 is the Cauchy distribution which is the distribution of the ratio

  • f two independent Guassian random variables. Has a closed form

density function

1 π(1+x2). Mean and variance are not finite.

For general p no closed form formula for density but can sample from the distribution.

Chandra (UIUC) CS498ABD 17 Fall 2020 17 / 21

slide-29
SLIDE 29

p-stable distributions

Fact: p-stable distributions exist for all p ∈ (0, 2] and do not exist for p > 2. p = 1 is the Cauchy distribution which is the distribution of the ratio

  • f two independent Guassian random variables. Has a closed form

density function

1 π(1+x2). Mean and variance are not finite.

For general p no closed form formula for density but can sample from the distribution. Streaming, sketching, LSH ideas for ℓ2 generalize to ℓp for p ∈ (0, 2] via p-stable distributions and additional technical work.

Chandra (UIUC) CS498ABD 17 Fall 2020 17 / 21

slide-30
SLIDE 30

Digression: Doubling dimension

Thesis/assumption: Real world data is high-dimensional in explicit representation but low-dimensional in ”content”. Several interpretations of what it means for data to be low-dimensional Data lies in a low-dimensional manifold Data can be projected into low dimensions while preserving certain properties (JL for instance) Data has a latent low-dimensional description (SVD, PCA, tensor decomposition, etc) Data has low doubling dimension · · ·

Chandra (UIUC) CS498ABD 18 Fall 2020 18 / 21

slide-31
SLIDE 31

Intrinsic dimension

Let (V , dist) be a finite metric space. dist(x, y) = dist(y, x) for all x, y ∈ V (symmetry) dist(x, x) = 0 for all x ∈ V (reflexivity) dist(x, y) + dist(y, z) ≥ dist(x, z) for all x, y, z ∈ V (triangle inequality) Question: Can we quantify whether (V , dist) behaves like a low-dimensional Euclidean space? Does this have any benefits?

Chandra (UIUC) CS498ABD 19 Fall 2020 19 / 21

slide-32
SLIDE 32

Doubling dimension

Property of Rd: A ball of radius r can be covered by cd balls of radius r/2 for some constant c ≤ 4.

Chandra (UIUC) CS498ABD 20 Fall 2020 20 / 21

slide-33
SLIDE 33

Doubling dimension

Property of Rd: A ball of radius r can be covered by cd balls of radius r/2 for some constant c ≤ 4. Given (V , d) let B(p, r) be the ball or radius r around p and view it as a set of points: B(p, r) = {q | dist(p, q) ≤ r}

Chandra (UIUC) CS498ABD 20 Fall 2020 20 / 21

slide-34
SLIDE 34

Doubling dimension

Property of Rd: A ball of radius r can be covered by cd balls of radius r/2 for some constant c ≤ 4. Given (V , d) let B(p, r) be the ball or radius r around p and view it as a set of points: B(p, r) = {q | dist(p, q) ≤ r} Definition A finite metric space (V , dist) has doubling dimension d if for all p ∈ V and all r > 0, B(p, r) can be covered by 2d balls of radius at most r/2.

Chandra (UIUC) CS498ABD 20 Fall 2020 20 / 21

slide-35
SLIDE 35

Doubling dimensions

Definition A finite metric space (V , dist) has doubling dimension d if for all p ∈ V and all r > 0, B(p, r) can be covered by 2d balls of radius at most r/2. Many algorithms/data structures for Rd can be extended to metric spaces with doubling dimension d with comparable running times. Including approximate NNS. See [Clarkson, Krauthgamer-Lee, HarPeled-Mendel]

Chandra (UIUC) CS498ABD 21 Fall 2020 21 / 21