COMP9313: Big Data Management High Dimensional Similarity Search - - PowerPoint PPT Presentation

β–Ά
comp9313
SMART_READER_LITE
LIVE PREVIEW

COMP9313: Big Data Management High Dimensional Similarity Search - - PowerPoint PPT Presentation

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem Definition: Given a query and dataset , find o , where is similar to Two types of similarity search


slide-1
SLIDE 1

COMP9313: Big Data Management

High Dimensional Similarity Search

slide-2
SLIDE 2
  • Problem Definition:
  • Given a query π‘Ÿ and dataset 𝐸, find o ∈ 𝐸, where

𝑝 is similar to π‘Ÿ

  • Two types of similarity search
  • Range search:
  • 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ ≀ Ο„
  • Nearest neighbor search
  • 𝑒𝑗𝑑𝑒 π‘βˆ—, π‘Ÿ ≀ 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ , βˆ€π‘ ∈ 𝐸
  • Top-k version
  • Distance/similarity function varies
  • Euclidean, Jaccard, inner product, …
  • Classic problem, with mutual solutions

2

Similarity Search

π‘Ÿ π‘βˆ— 𝜐

slide-3
SLIDE 3
  • Applications and relationship to Big Data
  • Almost every object can be and has been

represented by a high dimensional vector

  • Words, documents
  • Image, audio, video
  • …
  • Similarity search is a fundamental process in

information retrieval

  • E.g., Google search engine, face recognition system, …
  • High Dimension makes a huge difference!
  • Traditional solutions are no longer feasible
  • This lecture is about why and how
  • We focus on high dimensional vectors in Euclidean space

3

High Dimensional Similarity Search

slide-4
SLIDE 4

Similarity Search in Low Dimensional Space

4

slide-5
SLIDE 5

Similarity Search in One Dimensional Space

  • Just numbers, use binary search, binary

search tree, B+ Tree…

  • The essential idea behind: objects can be

sorted

slide-6
SLIDE 6

Similarity Search in Two Dimensional Space

  • Why binary search no longer works?
  • No order!
  • Voronoi diagram

Euclidean distance Manhattan distance

slide-7
SLIDE 7
  • Partition based algorithms
  • Partition data into β€œcells”
  • Nearest neighbors are in the same cell with query or

adjacent cells

  • How many β€œcells” to probe on 3-dimensional

space?

7

Similarity Search in Two Dimensional Space

slide-8
SLIDE 8
  • Triangle inequality
  • 𝑒𝑗𝑑𝑒 𝑦, π‘Ÿ ≀ 𝑒𝑗𝑑𝑒 𝑦, 𝑧 + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ
  • Orchard’s Algorithm
  • for each 𝑦 ∈ 𝐸, create a list of points in increasing
  • rder of distance to 𝑦
  • given query π‘Ÿ, randomly pick a point 𝑦 as the

initial candidate (i.e., pivot π‘ž), compute 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ

  • walk along the list of π‘ž, and compute the

distances to π‘Ÿ. If found 𝑧 closer to π‘Ÿ than π‘ž, then use 𝑧 as the new pivot (e.g., π‘ž ← 𝑧).

  • repeat the procedure, and stop when
  • 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 > 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ

8

Similarity Search in Metric Space

slide-9
SLIDE 9
  • Orchard’s Algorithm, stop when 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 >

2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 and 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 ≀ 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ β‡’ 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ ⇔ 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ

  • Since the list of π‘ž is in increasing order of

distance to π‘ž, 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 > 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ hold for all the rest 𝑧’s.

9

Similarity Search in Metric Space

slide-10
SLIDE 10

None of the Above Works in High Dimensional Space!

10

slide-11
SLIDE 11
  • Refers to various phenomena that arise in

high dimensional spaces that do not occur in low dimensional settings.

  • Triangle inequality
  • The pruning power reduces heavily
  • What is the volume of a high dimensional

β€œring” (i.e., hyperspherical shell)?

  • π‘Šπ‘ π‘—π‘œπ‘• π‘₯=1,𝑒=2

π‘Šπ‘π‘π‘šπ‘š 𝑠=10,𝑒=2 = 29%

  • π‘Šπ‘ π‘—π‘œπ‘• π‘₯=1,𝑒=100

π‘Šπ‘π‘π‘šπ‘š 𝑠=10,𝑒=100 = 99.997%

11

Curse of Dimensionality

slide-12
SLIDE 12
  • There is no sub-linear solution to find the

exact result of a nearest neighbor query

  • So we relax the condition
  • approximate nearest neighbor search (ANNS)
  • allow returned points to be not the NN of query
  • Success: returns the true NN
  • use success rate (e.g., percentage of succeed

queries) to evaluate the method

  • Hard to bound the success rate

12

Approximate Nearest Neighbor Search in High Dimensional Space

slide-13
SLIDE 13
  • Success: returns 𝑝 such that
  • 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ ≀ 𝑑 β‹… 𝑒𝑗𝑑𝑒(π‘βˆ—, π‘Ÿ)
  • Then we can bound the success

probability

  • Usually noted as 1 βˆ’ Ξ΄
  • Solution: Locality Sensitive Hashing

(LSH)

13

c-approximate NN Search

π‘Ÿ π‘βˆ— 𝑑𝑠 𝑠

slide-14
SLIDE 14
  • Hash function
  • Index: Map data/objects to values (e.g., hash key)
  • Same data β‡’ same hash key (with 100% probability)
  • Different data β‡’ different hash keys (with high probability)
  • Retrieval: Easy to retrieve identical objects (as

they have the same hash key)

  • Applications: hash map, hash join
  • Low cost
  • Space: 𝑃(π‘œ)
  • Time: 𝑃(1)
  • Why it cannot be used in nearest neighbor search?
  • Even a minor difference leads to totally different hash keys

14

Locality Sensitive Hashing

slide-15
SLIDE 15
  • Index: make the hash functions error tolerant
  • Similar data β‡’ same hash key (with high probability)
  • Dissimilar data β‡’ different hash keys (with high probability)
  • Retrieval:
  • Compute the hash key for the query
  • Obtain all the data has the same key with query (i.e.,

candidates)

  • Find the nearest one to the query
  • Cost:
  • Space: 𝑃(π‘œ)
  • Time: 𝑃 1 + 𝑃(|π‘‘π‘π‘œπ‘’|)
  • It is not the real Locality Sensitive Hashing!
  • We still have several unsolved issues…

15

Locality Sensitive Hashing

slide-16
SLIDE 16
  • Formal definition:
  • Given point 𝑝1, 𝑝2, distance 𝑠

1, 𝑠 2, probability

π‘ž1, π‘ž2

  • An LSH function β„Ž(β‹…) should satisfy
  • Pr β„Ž 𝑝1 = β„Ž 𝑝2

β‰₯ π‘ž1, if 𝑒𝑗𝑑𝑒 𝑝1, 𝑝2 ≀ 𝑠

1

  • Pr β„Ž 𝑝1 = β„Ž 𝑝2

≀ π‘ž2, if 𝑒𝑗𝑑𝑒 𝑝1, 𝑝2 > 𝑠

2

  • What is β„Ž β‹… for a given distance/similarity

function?

  • Jaccard similarity
  • Angular distance
  • Euclidean distance

16

LSH Functions

slide-17
SLIDE 17
  • Each data object is a set
  • 𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝑇1, 𝑇2 =

|π‘‡π‘—βˆ©π‘‡π‘˜| |𝑇𝑗βˆͺπ‘‡π‘˜|

  • Randomly generate a global order for all the

elements in C =Ϊ‚1

π‘œ 𝑇𝑗

  • Let β„Ž(𝑇) be the minimal member of 𝑇 with

respect to the global order

  • For example, 𝑇 = {𝑐, 𝑑, 𝑓, β„Ž, 𝑗}, we use inversed

alphabet order, then re-ordered 𝑇 = {𝑗, β„Ž, 𝑓, 𝑑, 𝑐}, hence β„Ž 𝑇 = 𝑗.

17

MinHash - LSH Function for Jaccard Similarity

slide-18
SLIDE 18
  • Now we compute Pr β„Ž 𝑇1 = β„Ž 𝑇2
  • Every element 𝑓 ∈ 𝑇1 βˆͺ 𝑇2 has equal chance

to be the first element among 𝑇1 βˆͺ 𝑇2 after re-

  • rdering
  • 𝑓 ∈ 𝑇1 ∩ 𝑇2 if and only if β„Ž 𝑇1 = β„Ž 𝑇2
  • 𝑓 βˆ‰ 𝑇1 ∩ 𝑇2 if and only if β„Ž 𝑇1 β‰  β„Ž 𝑇2
  • Pr β„Ž 𝑇1 = β„Ž 𝑇2

=

|{𝑓𝑗|β„Žπ‘— 𝑇1 =β„Žπ‘— 𝑇2 }| |{𝑓𝑗}|

=

|π‘‡π‘—βˆ©π‘‡π‘˜| |𝑇𝑗βˆͺπ‘‡π‘˜| =

𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝑇1, 𝑇2

18

MinHash

slide-19
SLIDE 19
  • Each data object is a d dimensional vector
  • πœ„(𝑦, 𝑧) is the angle between 𝑦 and 𝑧
  • Randomly generate a normal vector 𝑏, where

𝑏𝑗~𝑂(0,1)

  • Let β„Ž 𝑦; 𝑏 = sgn(π‘π‘ˆπ‘¦)
  • sgn o = α‰Š 1; 𝑗𝑔 𝑝 β‰₯ 0

βˆ’1; 𝑗𝑔 𝑝 < 0

  • 𝑦 lies on which side of 𝑏’s

corresponding hyperplane

19

SimHash – LSH Function for Angular Distance

slide-20
SLIDE 20
  • Now we compute Pr β„Ž 𝑝1 = β„Ž 𝑝2
  • β„Ž 𝑝1 β‰  β„Ž 𝑝2 iff 𝑝1 and 𝑝2 are on different

sides of the hyperplane with 𝑏 as its normal vector

  • Pr β„Ž 𝑝1 = β„Ž 𝑝2

= 1 βˆ’ πœ„

𝜌

20

SimHash

𝑝1 𝑝2 ΞΈ 𝑏 πœ„ 𝜌 =

slide-21
SLIDE 21
  • Each data object is a d dimensional vector
  • 𝑒𝑗𝑑𝑒 𝑦, 𝑧 =

Οƒ1

𝑒 𝑦𝑗 βˆ’ 𝑧𝑗 2

  • Randomly generate a normal vector 𝑏, where

𝑏𝑗~𝑂(0,1)

  • Normal distribution is 2-stable, i.e., if 𝑏𝑗~𝑂(0,1),

then Οƒ1

𝑒 𝑏𝑗 β‹… 𝑦𝑗 ~𝑂(0, 𝑦 2 2)

  • Let β„Ž 𝑦; 𝑏, 𝑐 =

π‘π‘ˆπ‘¦+𝑐 π‘₯

, where 𝑐~𝑉(0,1) and π‘₯ is user specified parameter

  • Pr β„Ž 𝑝1; 𝑏, 𝑐 = β„Ž 𝑝2; 𝑏, 𝑐

= Χ¬

π‘₯ 1 𝑝1,𝑝2 𝑔 π‘ž 𝑒 𝑝1,𝑝2

1 βˆ’

𝑒 π‘₯ 𝑒𝑒

  • 𝑔

π‘ž β‹… is the pdf of the absolute value of normal variable

21

p-stable LSH - LSH function for Euclidean distance

slide-22
SLIDE 22
  • Intuition of p-stable LSH
  • Similar points have higher chance to be hashed

together

22

p-stable LSH

slide-23
SLIDE 23

23

Pr β„Ž 𝑦 = β„Ž 𝑧 for different Hash Functions

MinHash SimHash p-stable LSH

slide-24
SLIDE 24
  • Hard to distinguish if two pairs have distances

close to each other

  • Pr β„Ž 𝑝1 = β„Ž 𝑝2

β‰₯ π‘ž1, if 𝑒𝑗𝑑𝑒 𝑝1, 𝑝2 ≀ 𝑠

1

  • Pr β„Ž 𝑝1 = β„Ž 𝑝2

≀ π‘ž2, if 𝑒𝑗𝑑𝑒 𝑝1, 𝑝2 > 𝑠

2

  • We also want to control where the drastic

change happens…

  • Close to 𝑒𝑗𝑑𝑒(π‘βˆ—, π‘Ÿ)
  • Given range

24

Problem of Single Hash Function

slide-25
SLIDE 25
  • Recall for a single hash function, we have
  • Pr β„Ž 𝑝1 = β„Ž 𝑝2

= π‘ž(𝑒𝑗𝑑𝑒(𝑝1, 𝑝2)), denoted as π‘žπ‘1,𝑝2

  • Now we consider two scenarios:
  • Combine 𝑙 hashes together, using AND operation
  • One must match all the hashes
  • Pr 𝐼𝐡𝑂𝐸 𝑝1 = 𝐼𝐡𝑂𝐸 𝑝2

= π‘žπ‘1,𝑝2

𝑙

  • Combine π‘š hashes together, using OR operation
  • One need to match at least one of the hashes
  • Pr 𝐼𝑃𝑆 𝑝1 = 𝐼𝑃𝑆 𝑝2

= 1 βˆ’ (1 βˆ’ π‘žπ‘1,𝑝2)π‘š

  • Not match only when all the hashes don’t match

25

AND-OR Composition

slide-26
SLIDE 26
  • Example with minHash, 𝑙 = 5, π‘š = 5

26

AND-OR Composition

Pr 𝐼𝐡𝑂𝐸 𝑝1 = 𝐼𝐡𝑂𝐸 𝑝2 Pr 𝐼𝑃𝑆 𝑝1 = 𝐼𝑃𝑆 𝑝2 Pr β„Ž 𝑝1 = β„Ž 𝑝2

slide-27
SLIDE 27
  • Let β„Žπ‘—,π‘˜ be LSH functions, where 𝑗 ∈

1,2, … , π‘š , π‘˜ ∈ {1,2, … , 𝑙}

  • Let 𝐼𝑗 𝑝 = [β„Žπ‘—,1 𝑝 , β„Žπ‘—,2 𝑝 , … , β„Žπ‘—,𝑙 𝑝 ]
  • super-hash
  • 𝐼𝑗 𝑝1 = 𝐼𝑗 𝑝2 ⇔ βˆ€π‘˜ ∈ 1,2, … , 𝑙 , β„Žπ‘—,π‘˜ 𝑝1 = β„Žπ‘—,π‘˜ 𝑝2
  • Consider query π‘Ÿ and any data point 𝑝, 𝑝 is a

nearest neighbor candidate of π‘Ÿ if

  • βˆƒπ‘— ∈ 1,2, … , π‘š , 𝐼𝑗 𝑝 = 𝐼𝑗 π‘Ÿ
  • The probability of 𝑝 is a nearest neighbor

candidate of π‘Ÿ is

  • 1 βˆ’ (1 βˆ’ π‘žπ‘Ÿ,𝑝

𝑙 )π‘š

27

AND-OR Composition in LSH

slide-28
SLIDE 28
  • 1 βˆ’ (1 βˆ’ π‘žπ‘Ÿ,𝑝

𝑙 )π‘š changes with π‘žπ‘Ÿ,𝑝 , (𝑙 = 20, π‘š =

5)

  • E.g., we are expected to retrieve 98.8% of the

data with Jaccard > 0.9

28

The Effectiveness of LSH

π‘žπ‘Ÿ,𝑝 1 βˆ’ (1 βˆ’ π‘žπ‘Ÿ,𝑝

𝑙 )π‘š

0.2 0.002 0.4 0.050 0.6 0.333 0.7 0.601 0.8 0.863 0.9 0.988

slide-29
SLIDE 29
  • False Positive:
  • returned data with dist o, q > 𝑠

2

  • False Negative
  • not returned data with dist o, q < 𝑠

1

  • They can be controlled by carefully chosen 𝑙

and π‘š

  • It’s a trade-off between space/time and accuracy

29

False Positives and False Negatives

slide-30
SLIDE 30
  • Pre-processing
  • Generate LSH functions
  • minHash: random permutations
  • simHash: random normal vectors
  • p-stable: random normal vectors and random uniform values
  • Index
  • Compute 𝐼𝑗 𝑝 for each data object 𝑝, 𝑗 ∈ {1, … , π‘š}
  • Index 𝑝 using 𝐼𝑗 𝑝 as key in the 𝑗-th hash table
  • Query
  • Compute 𝐼𝑗 π‘Ÿ for query π‘Ÿ, 𝑗 ∈ {1, … , π‘š}
  • Generate candidate set 𝑝 βˆƒπ‘— ∈ 1, … , π‘š , 𝐼𝑗 π‘Ÿ = 𝐼𝑗 𝑝
  • Compute the actual distance for all the candidates and

return the nearest one to the query

30

The Framework of NNS using LSH

slide-31
SLIDE 31
  • Concatenate k hashes is too β€œstrong”
  • β„Žπ‘—,π‘˜ 𝑝1 β‰  β„Žπ‘—,π‘˜ 𝑝2 β‡’ 𝐼𝑗 π‘Ÿ β‰  𝐼𝑗 𝑝 for any π‘˜
  • Not adaptive to the distribution of the

distances

  • What if not enough candidates?
  • Need to tune w (or build indexes different w’s) to

handle different cases

31

The Drawback of LSH

slide-32
SLIDE 32
  • Observation:
  • If π‘Ÿβ€™s nearest neighbor does not falls into π‘Ÿβ€™s hash

bucket, then most likely it will fall into the adjacent bucket to π‘Ÿβ€™s

  • Why? Οƒ1

𝑒 𝑏𝑗 β‹… 𝑦𝑗 βˆ’ Οƒ1 𝑒 𝑏𝑗 β‹… π‘Ÿπ‘— ~𝑂(0, 𝑦, π‘Ÿ 2 2)

  • Idea:
  • Not only look at the hash bucket where π‘Ÿ falls

into, but also those adjacent to it

  • Problem:
  • How many such bucket? 2𝑙
  • And they are not equally important!

32

Multi-Probe LSH

slide-33
SLIDE 33
  • Consider the case when 𝑙 = 2:
  • Note that 𝐼𝑗(𝑝) = (β„Žπ‘—,1 𝑝 , β„Žπ‘—,2(𝑝))
  • The ideal probe order would be:
  • β„Žπ‘—,1 π‘Ÿ , β„Žπ‘—,2 π‘Ÿ : 0.315
  • β„Žπ‘—,1 π‘Ÿ , β„Žπ‘—,2 π‘Ÿ βˆ’ 1: 0.284
  • β„Žπ‘—,1 π‘Ÿ + 1, β„Žπ‘—,2 π‘Ÿ : 0.150
  • β„Žπ‘—,1 π‘Ÿ βˆ’ 1, β„Žπ‘—,2 π‘Ÿ : 0.035
  • β„Žπ‘—,1 π‘Ÿ , β„Žπ‘—,2 π‘Ÿ + 1: 0.019

33

Multi-Probe LSH

β„Žπ‘—,1 β„Žπ‘—,2 We don’t have to compute the integration, but use the offset between f π‘Ÿ and the boundaries.

slide-34
SLIDE 34
  • Pros:
  • Requires less π‘š
  • Because we use hash tables more efficiently
  • More robust against the unlucky points
  • Cons:
  • Lose the theoretical guarantee about the results
  • Not parallel-friendly

34

Multi Probe LSH

slide-35
SLIDE 35
  • C2LSH (SIGMOG’12 paper)
  • Which one is closer to π‘Ÿ?
  • We will omit the theoretical parts hence leads to a

slightly different version to the paper.

  • But the essential ideas are the same
  • Project 1 is to implement C2LSH using PySpark!

35

Collision Counting LSH (C2LSH)

𝒓

1 1 1 1

𝑝1

1 1 1 2

𝑝2

1 1 1 1 1 1 1 1 1 2 1 1 2 2 3 4 1 1 1 1 2 1 1 1 1 2 3 4

slide-36
SLIDE 36
  • Collision: match on a single hash function
  • Use number of collisions to determine the

candidates

  • Match one of the super hash with π‘Ÿ β†’ collides at

least 𝛽𝑛 hash values with π‘Ÿ

  • Recall in LSH, The probability of 𝑝 with

dist 𝑝, π‘Ÿ ≀ 𝑠

1 is a nearest neighbor

candidate of π‘Ÿ is 1 βˆ’ (1 βˆ’ π‘ž1

𝑙)π‘š

  • Now we compute the case with collision

counting…

36

Counting the Collisions

slide-37
SLIDE 37
  • βˆ€π‘ with dist 𝑝, π‘Ÿ ≀ 𝑠

1, we have

  • Pr #π‘‘π‘π‘šπ‘šπ‘—π‘‘π‘—π‘π‘œ 𝑝 β‰₯ 𝛽𝑛 = σ𝑗=𝛽𝑛

𝑛 𝑛 𝑗

π‘žπ‘— 1 βˆ’ π‘ž π‘›βˆ’π‘—

  • π‘ž = Pr β„Žπ‘˜ 𝑝 = β„Žπ‘˜ π‘Ÿ

β‰₯ π‘ž1

  • We define 𝑛 Bernoulli random variables π‘Œπ‘— ∼

𝐢(𝑛, 1 βˆ’ π‘ž) with 1 ≀ 𝑗 ≀ 𝑛.

  • Let π‘Œπ‘— equal 1 if 𝑝 does not collide with π‘Ÿ
  • i.e., Pr π‘Œπ‘— = 1 = 1 βˆ’ π‘ž
  • Let π‘Œπ‘— equal 0 if 𝑝 collides with π‘Ÿ
  • i.e., Pr π‘Œπ‘— = 0 = π‘ž
  • Hence E π‘Œπ‘— = 1 βˆ’ π‘ž
  • Thus 𝐹( ΰ΄€

π‘Œ) = 1 βˆ’ π‘ž , where ΰ΄€ π‘Œ =

σ𝑗=1

𝑛

π‘Œπ‘— 𝑛

.

  • Let 𝑒 = π‘ž βˆ’ 𝛽 > 0, we have:
  • Pr ΰ΄€

π‘Œ βˆ’ 𝐹 ΰ΄€ π‘Œ β‰₯ 𝑒 = Pr

σ𝑗=1

𝑛

π‘Œπ‘— 𝑛

βˆ’ 1 βˆ’ π‘ž β‰₯ 𝑒 = Pr[Οƒπ‘Œπ‘— β‰₯ (1 βˆ’ 𝛽)𝑛]

37

The Collision Probability

slide-38
SLIDE 38
  • From Hoeffding’s Inequality, we have
  • Pr ΰ΄€

π‘Œ βˆ’ 𝐹 ΰ΄€ π‘Œ β‰₯ 𝑒 = Pr Οƒπ‘Œπ‘— β‰₯ 1 βˆ’ 𝛽 𝑛 ≀ exp βˆ’

2 π‘žβˆ’π›½ 2𝑛2 σ𝑗=1

𝑛

1βˆ’0 2

= exp βˆ’2 π‘ž βˆ’ 𝛽 2𝑛 ≀ exp βˆ’2 π‘ž1 βˆ’ 𝛽 2𝑛

  • Since the event β€œ#collision(𝑝) β‰₯ 𝛽𝑛” is equivalent to the

event β€œπ‘ misses the collision with π‘Ÿ less than (1 βˆ’ 𝛽)𝑛 times”,

  • Pr #π‘‘π‘π‘šπ‘šπ‘—π‘‘π‘—π‘π‘œ 𝑝 β‰₯ 𝛽𝑛 = Pr Οƒπ‘Œπ‘— < 1 βˆ’ 𝛽 𝑛 β‰₯ 1 βˆ’

exp βˆ’2 π‘ž1 βˆ’ 𝛽 2𝑛

  • Now you can compute the case for 𝑝 with dist 𝑝, π‘Ÿ β‰₯ 𝑠

2 in

a similar way…

  • Then we can accordingly set 𝛽 to control false positives and

false negatives

38

The Collision Probability

slide-39
SLIDE 39
  • When we are not getting enough candidates…
  • E.g., # of candidates < top-k
  • Observation:
  • A close point 𝑝 usually falls into adjacent hash

buckets of π‘Ÿβ€™s if it does not collide with π‘Ÿ

  • Why?
  • Οƒ1

𝑒 𝑏𝑗 β‹… 𝑦𝑗 βˆ’ Οƒ1 𝑒 𝑏𝑗 β‹… π‘Ÿπ‘— ~𝑂(0, 𝑦, π‘Ÿ 2 2)

  • Idea:
  • Include the adjacent hash buckets into

consideration

  • So you don’t need to re-hash them again…

39

Virtual Rehashing

slide-40
SLIDE 40
  • At first consider β„Ž 𝑝 = β„Ž π‘Ÿ
  • Consider β„Ž 𝑝 = β„Ž π‘Ÿ Β± 1 if not enough

candidates

  • Then β„Ž 𝑝 = β„Ž π‘Ÿ Β± 2 and so on…

40

Virtual Rehashing

𝒓

1 1 1 1 1 1 1 1 1 1

𝑝1

1 2

  • 1

1 2 4

  • 3
  • 1

𝒓

1 1 1 1 1 1 1 1 1 1

𝑝1

1 2

  • 1

1 2 4

  • 3
  • 1

𝒓

1 1 1 1 1 1 1 1 1 1

𝑝1

1 2

  • 1

1 2 4

  • 3
  • 1
slide-41
SLIDE 41
  • Pre-processing
  • Generate LSH functions
  • Random normal vectors and random uniform values
  • Index
  • Compute and store β„Žπ‘— 𝑝 for each data object 𝑝,

𝑗 ∈ {1, … , 𝑛}

  • Query
  • Compute β„Žπ‘— π‘Ÿ for query π‘Ÿ, 𝑗 ∈ {1, … , 𝑛}
  • Take those 𝑝 that shares at least 𝛽𝑛 hashes with π‘Ÿ as

candidates

  • Relax the collision condition (e.g., virtual rehashing) and

repeat the above step, until we got enough candidates

41

The Framework of NNS using C2LSH

slide-42
SLIDE 42

candGen(data_hashes, query_hashes, 𝛽𝑛, π›Ύπ‘œ):

  • ffset ← 0

cand ← βˆ… while true: for each (id, hashes) in data_hashes: if count(hashes, query_hashes, offset) β‰₯ 𝛽𝑛: cand ← cand βˆͺ {id} if π‘‘π‘π‘œπ‘’ < π›Ύπ‘œ:

  • ffset ← offset + 1

else: break return cand

42

Pseudo code of Candidate Generation in C2LSH

slide-43
SLIDE 43

count(hashes_1, hashes_2, offset): counter ← 0 for each β„Žπ‘π‘‘β„Ž1, β„Žπ‘π‘‘β„Ž2 in hashes_1, hashes_2: if β„Žπ‘π‘‘β„Ž1 βˆ’ β„Žπ‘π‘‘β„Ž2 ≀ offset: counter ← counter + 1 return counter

43

Pseudo code of Candidate Generation in C2LSH

slide-44
SLIDE 44
  • Spec has been released, deadline: 18 Jul, 2020
  • Late Penalty: 10% on day 1 and 30% on each

subsequent day.

  • Implement a light version of C2LSH (i.e., the
  • ne we introduced in the lecture)
  • Start working ASAP
  • Evaluation: Correctness and Efficiency
  • Must use PySpark, some python modules and

PySpark functions are banned.

  • E.g., numpy, pandas, collect(), take(), …
  • Use transformations!

44

Project 1

slide-45
SLIDE 45
  • There will be a bonus part (max 20 points) to

encourage efficient implementations.

  • Details in the spec
  • Make sure you have valid output
  • Make your own test cases, a real dataset would

be more desirable

  • Toy example in the spec is a real β€œtoy” (e.g., for

babies…)

  • Won’t accept excuses like β€œit works on my own

computer”

  • Don’t violate the Student Conduct!!!

45

Project 1

slide-46
SLIDE 46

Product Quantization

and K-Means Clustering

slide-47
SLIDE 47
  • NaΓ―ve (but exact) solution:
  • Linear scan: compute 𝑒𝑗𝑑𝑒(𝑝, π‘Ÿ) for all 𝑝 ∈ 𝐸
  • 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ =

σ𝑗=1

𝑒

𝑝𝑗 βˆ’ π‘Ÿπ‘— 2

  • 𝑃 π‘œπ‘’
  • π‘œ times (𝑒 subtractions + 𝑒 βˆ’ 1 additions + 𝑒 multiplications)
  • Storage is also costly: 𝑃 π‘œπ‘’
  • Could be problematic in DBMS and distributed systems
  • This motivates the idea of compression

47

Recall: NNS in High Dimensional Euclidean Space

slide-48
SLIDE 48
  • Idea: compressed representation of vectors
  • Each vector 𝑝 is represented by a representative
  • Denoted as π‘…π‘Ž(𝑝)
  • We will discuss how to get the representatives later
  • We control the total number of representatives for

the dataset (denoted as 𝑙)

  • One representative represents multiple vectors
  • Instead of store 𝑝, we store its representative id
  • 𝑒 floats => 1 integer
  • Instead of compute 𝑒𝑗𝑑𝑒(𝑝, π‘Ÿ), we compute

𝑒𝑗𝑑𝑒(π‘…π‘Ž(𝑝), π‘Ÿ)

  • We only need k computations of distance!

48

Vector Quantization

slide-49
SLIDE 49
  • Assigning representatives is essentially a

partition problem

  • Construct a β€œgood” partition of a database of π‘œ
  • bjects into a set of 𝑙 clusters
  • How to measure the β€œgoodness” of a given

partitioning scheme?

  • Cost of a cluster
  • 𝐷𝑝𝑑𝑒 𝐷𝑗 = Οƒπ‘π‘˜βˆˆπ·π‘— 𝑝

π‘˜ βˆ’ π‘‘π‘“π‘œπ‘’π‘“π‘  𝐷𝑗 2 2

  • Cost of 𝑙 clusters: sum of 𝐷𝑝𝑑𝑒 𝐷𝑗

49

How to Generate Representatives

slide-50
SLIDE 50
  • It’s an optimization problem!
  • Global optimal:
  • NP-hard (for a wide range of cost functions)
  • Requires exhaustively enumerate all π‘œ

𝑙 partitions

  • Stirling numbers of the second kind
  • π‘œ

𝑙 ~ π‘™π‘œ 𝑙! when π‘œ β†’ ∞

  • Heuristic methods:
  • k-means
  • Many variants

50

Partitioning Problem: Basic Concept

slide-51
SLIDE 51
  • Given 𝑙, the k-means algorithm is implemented in

four steps:

  • 1. Partition objects into k nonempty subsets (randomly)
  • 2. Compute seed points as the centroids of the clusters of

the current partitioning (the centroid is the center, i.e., mean point, of the cluster)

  • 3. Assign each object to the cluster with the nearest seed

point

  • 4. Go back to Step 2, stop when the assignment does not

change

51

The k-Means Clustering Method

slide-52
SLIDE 52

An Example of k-Means Clustering

K=2 Arbitrarily partition

  • bjects into

k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed

52

The initial data set

β—Ό

Partition objects into k nonempty subsets

β—Ό

Repeat

β—Ό

Compute centroid (i.e., mean point) for each partition

β—Ό

Assign each object to the cluster of its nearest centroid

β—Ό

Until no change

slide-53
SLIDE 53
  • Encode the vectors
  • Generate a codebook 𝑋 = {𝑑1, … , 𝑑𝑙} via k-

means

  • Assign 𝑝 to its nearest codeword in 𝑋
  • E.g., π‘…π‘Ž 𝑝 = 𝑑𝑗 𝑗 ∈ 1 … 𝑙 such that 𝑒𝑗𝑑𝑒 𝑝, 𝑑𝑗 ≀ 𝑒𝑗𝑑𝑒 𝑝, 𝑑

π‘˜ βˆ€π‘˜

  • Represent each vector 𝑝 by its assigned codeword
  • Assume 𝑒 = 256, 𝑙 = 216
  • Before: 4 bytes * 256 = 1024 bytes for each

vector

  • Now:
  • data: 16 bits = 2 bytes
  • codebook: 4 * 256 * 216

53

Vector Quantization

slide-54
SLIDE 54
  • Given query π‘Ÿ, how to find a

point close to π‘Ÿ?

  • Algorithm:
  • Compute π‘…π‘Ž(π‘Ÿ)
  • Candidate set 𝐷 = all data vectors

associated with π‘…π‘Ž(π‘Ÿ)

  • Verification: compute distance

between π‘Ÿ and 𝑝𝑗 ∈ 𝐷

  • Requires loading the vectors in C
  • Any problem/improvement?

54

Vector Quantization – Query Processing

Inverted index: a hash table that maps 𝑑

π‘˜ to a list of oi that

are associated with 𝑑

π‘˜

slide-55
SLIDE 55
  • To achieve better accuracy, fine-grained

quantizer with large 𝑙 is needed

  • Large 𝑙
  • Costly to run K-means
  • Computing π‘…π‘Ž(π‘Ÿ) is expensive: 𝑃(𝑙𝑒)
  • May need to look beyond π‘…π‘Ž(π‘Ÿ) cell
  • Solution:
  • Product Quantization

55

Limitations of VQ

slide-56
SLIDE 56
  • Idea
  • Partition the dimension into m partitions
  • Accordingly a vector => subvectors
  • Use separate VQ with 𝑙 codewords for each

chunk

  • Example:
  • 8-dim vector decomposed into 𝑛 = 2 subvectors
  • Each codebook has 𝑙 = 4 codewords, (i.e., 𝑑𝑗,π‘˜)
  • Total space in bits:
  • Data: n β‹… 𝑛 β‹… π‘šπ‘π‘•(𝑙)
  • Codebook: 𝑛 β‹…

𝑒 𝑛 β‹… 𝑙 β‹… 32

56

Product Quantization

slide-57
SLIDE 57

57

Example of PQ

2 4 6 5

  • 2

6 4 1 1 2 1 4 9

  • 1

2 … … … … … … … … … … … … … … … … … … … … … … … … 3.3 4.1 2.7 1.4 … … … … … … … … … … … … 00 00 01 11 … … … … … … 2.1 3.6 5.3 6.6 1.2 1.5 2.4 3.3 … … … … … … … …

𝑑1,0 𝑑1,2 𝑑1,3 𝑑1,1 𝑑2,1 𝑑2,3 𝑑2,0 𝑑2,2

slide-58
SLIDE 58
  • Euclidean distance between

a query point q and a data point encoded as t

  • Restore the virtual joint center

by looking up each partition

  • f t in the corresponding

codebooks => p

  • 𝑒2 π‘Ÿ, 𝑒 = σ𝑗=1

𝑒

π‘Ÿπ‘— βˆ’ π‘žπ‘— 2

  • Known as Asymmetric

Distance Computation (ADC)

  • 𝑒2 π‘Ÿ, 𝑒 = σ𝑗=1

𝑛

π‘Ÿ(𝑗) βˆ’ 𝑑𝑗,𝑒(𝑗)

2

58

Distance Estimation

  • 3

7 3 2 3.3 4.1 2.7 1.4 … … … … … … … … … … … …

𝑑1,0 𝑑1,2 𝑑1,3 𝑑1,1 𝑑2,1 𝑑2,3 𝑑2,0 𝑑2,2

2.1 3.6 5.3 6.6 1.2 1.5 2.4 3.3 … … … … … … … … 01 00 1.2 1.5 2.4 3.3 3.3 4.1 2.7 1.4

q t p

1 3 5 4

slide-59
SLIDE 59
  • Compute ADC for every point in the database
  • How?
  • Candidate = those with the π‘š smallest AD
  • [Optional] Reranking (if π‘š > 1):
  • Load the data vectors and compute the actual

Euclidean distance

  • Return the one with the smallest distance

59

Query Processing

slide-60
SLIDE 60

60

Query Processing

  • 3

7 3 2 01 00

q t1

1 3 5 4 11 10

t2

… … … …

π‘Ÿ(1) βˆ’ 𝑑1,𝑒1(1)

2 + π‘Ÿ(2) βˆ’ 𝑑2,𝑒1(2) 2

π‘Ÿ(1) βˆ’ 𝑑1,𝑒2(1)

2 + π‘Ÿ(2) βˆ’ 𝑑2,𝑒2(2) 2 3.3 4.1 2.7 1.4 … … … … … … … … … … … … 2.1 3.6 5.3 6.6 1.2 1.5 2.4 3.3 … … … … … … … …

𝑑1,0 𝑑1,2 𝑑1,3 𝑑1,1 𝑑2,1 𝑑2,3 𝑑2,0 𝑑2,2

slide-61
SLIDE 61
  • Pre-processing:
  • Step 1: partition data vectors
  • Step 2: generate codebooks (e.g., k-means)
  • Step 3: encode data
  • Query
  • Step 1: compute distance between q and

codewords

  • Step 2: compute AD for each point and return the

candidates

  • Step 3: re-ranking (optional)

61

Framework of PQ