Locality-Sensitive Hashing CS 395T: Visual Recognition and Search - - PowerPoint PPT Presentation

locality sensitive hashing
SMART_READER_LITE
LIVE PREVIEW

Locality-Sensitive Hashing CS 395T: Visual Recognition and Search - - PowerPoint PPT Presentation

Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban Feb 22, 2008 1 Nearest Neighbor Given a query any point , return the point q closest to . q Useful for finding similar objects in a database. Brute


slide-1
SLIDE 1

Feb 22, 2008 1

Locality-Sensitive Hashing

CS 395T: Visual Recognition and Search Marc Alban

slide-2
SLIDE 2

Feb 22, 2008 2

Nearest Neighbor

 Given a query any point , return the point

closest to .

 Useful for finding similar objects in a database.  Brute force linear search is not practical for

massive databases.

?

q q

slide-3
SLIDE 3

Feb 22, 2008 3

The “Curse of Dimensionality”

 For , data structures exist that

require sublinear time and near linear space to perform a NN search.

 Time or space requirements grow exponentially

in the dimension.

 The dimensionality of images or documents is

usually in the order of several hundred or more.

 Brute force linear search is the best we can do.

d < 10 to 20

slide-4
SLIDE 4

Feb 22, 2008 4

(r, )-Nearest Neighbor

 An approximate nearest neighbor should suffice

in most cases.

 Definition: If for any query point , there exists

a point such that , w.h.p return such that . q p

?

jjq ¡ p0jj · (1 + ²) r jjq ¡ pjj · r p0

²

slide-5
SLIDE 5

Feb 22, 2008 5

Locality-sensative Hash Families

Definition: A LSH family, , has the following properties for any :

  • 1. If then
  • 2. If then

jjp ¡ qjj · r H (c; r; P1; P2) jjp ¡ qjj ¸ cr q; p 2 S PrH [h (p) = h (q)] ¸ P1 PrH [h (p) = h (p)] · P2

slide-6
SLIDE 6

Feb 22, 2008 6

Hamming Space

 Definition: Hamming space is the set of all

binary strings of length .

 Definition: The Hamming distance between

two equal length binary strings is the number of positions for which the bits are different. 2N N k1110101; 1111101kH = 1 k1011101; 1001001kH = 2

slide-7
SLIDE 7

Feb 22, 2008 7

Hamming Space

 Let a hashing family be defined as

where is the bit of . Clearly, this family is locality sensative. hi(p) = pi

pi ith p PrH [h (p) = h (q)] = 1 ¡ kp; qkH d PrH [h (p) 6= h (q)] = kp; qkH d

slide-8
SLIDE 8

Feb 22, 2008 8

k-bit LSH Functions

 A k-bit locality-sensitive hash function (LSHF) is

defined as: g (p) = [h1 (p) ; h2 (p) ; : : : ; hk (p)]T

 Each is chosen randomly from .  Each results in a single bit.

 Pr(similar points collide)  Pr(dissimilar points collide) · P k

2

hi H hi ¸ 1 ¡ µ 1 ¡ 1 P1 ¶k

slide-9
SLIDE 9

Feb 22, 2008 9

1

LSH Preprocessing

 Each training example is entered into hash

tables indexed by independantly constructed .

 Preprocessing Space:

l g1; : : : ; gl O (lN)

...

l 2

slide-10
SLIDE 10

Feb 22, 2008 10

LSH Querying

 For each hash table

 Return the bin indexed by

 Perform a linear search on the union of the

bins.

...

i, 1 · i · l gi(q) q

slide-11
SLIDE 11

Feb 22, 2008 11

Parameter Selection

 Suppose we want to search at most

  • examples. Then setting

ensures that it will succeed with high probability. B k = log1=P2 µN B ¶ ; l = µN B ¶log (1=P1) log (1=P2)

slide-12
SLIDE 12

Feb 22, 2008 12

Experiment 1

 Compare LSH accuracy and performance to

exact NN search. Examine the influence of:

 k, the number of hash bits.  l, the number of hash tables.  B, the maximum search length.

 Dataset

 59500 20x20 patches taken from

motorcycle images.

 Represented as 400-dimensional

column vectors

slide-13
SLIDE 13

Feb 22, 2008 13

Hash Function

 Convert the feature vectors into binary strings

and use the Hamming hash functions.

 Given a vector we can create a unary

representation for each element .

 = 1's followed by 0's,

where is the max coordinate for all points.

  Note that for any two points :

x 2 Nd xi xi (C ¡ xi) C p; q kp; qk = ku (p) ; u (q) kH UnaryC(xi) u(x) = UnaryC(x1); : : : ; UnaryC(xd)

slide-14
SLIDE 14

Feb 22, 2008 14

Example Query

  Query =  Examples searched: 7,722 of 59,500  Result =  Actual NNs =

l = 20, k = 24, B = 1

slide-15
SLIDE 15

Feb 22, 2008 15

Average Search Length

 Let B = 1

l k

5 10 15 20 25 30 5 10 15 20 25 30 24 22 20 18 16 14 12 10 8 6 4 2 x1000

slide-16
SLIDE 16

Feb 22, 2008 16

5 10 15 20 25 30 5 10 15 20 25 30 24 22 20 18 16 14 12 10 8 6 4 2 x1000

Average Search Length

 Let B = 1

l k

More hash bits, (k), result in shorter searches.

More hash tables (l), result in longer searches.

slide-17
SLIDE 17

Feb 22, 2008 17

Average Approximation Error

 Let

5 10 15 20 25 30 5 10 15 20 25 30 1.11 1.1 1.09 1.08 1.07 1.06 1.05 1.04

l k B = 1

slide-18
SLIDE 18

Feb 22, 2008 18

Average Approximation Error

 Let

5 10 15 20 25 30 5 10 15 20 25 30 1.11 1.1 1.09 1.08 1.07 1.06 1.05 1.04

l k B = 1

Over hashing can result in too few candidates to return a good approximation.

Over hashing can cause algorithm to fail.

slide-19
SLIDE 19

Feb 22, 2008 19

Average Approximation Error

 Let

l k B = 1

Over hashing can result in too few candidates to return a good approximation.

Over hashing can cause algorithm to fail.

5 10 15 20 25 30 5 10 15 20 25 30 1.11 1.1 1.09 1.08 1.07 1.06 1.05 1.04

Average search length = 8000

slide-20
SLIDE 20

Feb 22, 2008 20

Average Approximation Error

 Let

5 10 15 20 25 30 5 10 15 20 25 30 1.15 1.14 1.13 1.12 1.11 1.1 1.09 1.08

l k B = 5500 ¼ N ln N

slide-21
SLIDE 21

Feb 22, 2008 21

Average Approximation Error

 Let B = 250 ¼

p N

5 10 15 20 25 30 5 10 15 20 25 30 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25

l k

slide-22
SLIDE 22

Feb 22, 2008 22

Experiment 2

 Examine the effect of the approximation on the

subjective quality of the results.

 Dataset

 D. Nistér and H. Stewénius.

Scalable recognition with a vocabulary tree

 2550 sets of 4 images

represented as document-term matrix of the visual words.

slide-23
SLIDE 23

Feb 22, 2008 23

Experiment 2: Issues

 LSH requires a vector representation.  Not clear how to easily convert a bag of words

representation into a vector one.

 A binary vector where the presence of each word is

a bit does not provide a good distance measure.

 Each image has roughly the same number of

different words from any other image.

 Boostmap?

slide-24
SLIDE 24

Feb 22, 2008 24

Conclusions

 Approximate Nearest Neighbors is neccessary

for very large high dimensional datasets.

 LSH is a simple approach to aNN.  LSH requires a vector representation.  Clear relationship between search length and

approximation error.

slide-25
SLIDE 25

Feb 22, 2008 25

Tools

 Octave (MATLAB)  LSH Matlab Toolbox -

http://www.cs.brown.edu/~gregory/code/lsh/

 Python  Gnuplot

slide-26
SLIDE 26

Feb 22, 2008 26

References

'Fast Pose Estimation with Parameter Senative Hashing' – Shakhnarovich et al.

'Similarity Search in High Dimensions via Hashing' – Gionis et al.

'Object Recognition Using Locality-Sensitive Hashing of Shape Contexts' - Andrea Frome and Jitendra Malik

'Nearest neighbors in high-dimensional spaces', Handbook of Discrete and Computational Geometry – Piotr Indyk

Algorithms for Nearest Neighbor Search - http://simsearch.yury.name/tutorial.html

LSH Matlab Toolbox - http://www.cs.brown.edu/~gregory/code/lsh/