locality sensitive hashing
play

Locality-Sensitive Hashing CS 395T: Visual Recognition and Search - PowerPoint PPT Presentation

Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban Feb 22, 2008 1 Nearest Neighbor Given a query any point , return the point q closest to . q Useful for finding similar objects in a database. Brute


  1. Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban Feb 22, 2008 1

  2. Nearest Neighbor  Given a query any point , return the point q closest to . q  Useful for finding similar objects in a database.  Brute force linear search is not practical for massive databases. ? Feb 22, 2008 2

  3. The “Curse of Dimensionality”  For , data structures exist that d < 10 to 20 require sublinear time and near linear space to perform a NN search.  Time or space requirements grow exponentially in the dimension.  The dimensionality of images or documents is usually in the order of several hundred or more.  Brute force linear search is the best we can do. Feb 22, 2008 3

  4. (r, )-Nearest Neighbor ²  An approximate nearest neighbor should suffice in most cases.  Definition: If for any query point , there exists q p 0 a point such that , w.h.p return jj q ¡ p jj · r p jj q ¡ p 0 jj · (1 + ² ) r such that . ? Feb 22, 2008 4

  5. Locality-sensative Hash Families Definition: A LSH family , , has the H ( c; r; P 1 ; P 2 ) following properties for any : q; p 2 S 1. If then jj p ¡ q jj · r Pr H [ h ( p ) = h ( q )] ¸ P 1 2. If then jj p ¡ q jj ¸ cr Pr H [ h ( p ) = h ( p )] · P 2 Feb 22, 2008 5

  6. Hamming Space  Definition: Hamming space is the set of all 2 N binary strings of length . N  Definition: The Hamming distance between two equal length binary strings is the number of positions for which the bits are different. k 1011101 ; 1001001 k H = 2 k 1110101 ; 1111101 k H = 1 Feb 22, 2008 6

  7. Hamming Space  Let a hashing family be defined as h i ( p ) = p i where is the bit of . i th p i p Pr H [ h ( p ) 6 = h ( q )] = k p; q k H d Pr H [ h ( p ) = h ( q )] = 1 ¡ k p; q k H d Clearly, this family is locality sensative. Feb 22, 2008 7

  8. k-bit LSH Functions  A k-bit locality-sensitive hash function (LSHF) is defined as: g ( p ) = [ h 1 ( p ) ; h 2 ( p ) ; : : : ; h k ( p )] T  Each is chosen randomly from . H h i  Each results in a single bit. h i µ ¶ k 1 ¡ 1  Pr(similar points collide) ¸ 1 ¡ P 1  Pr(dissimilar points collide) · P k 2 Feb 22, 2008 8

  9. LSH Preprocessing  Each training example is entered into hash l tables indexed by independantly constructed . g 1 ; : : : ; g l  Preprocessing Space: O ( lN ) 1 2 l ... Feb 22, 2008 9

  10. LSH Querying  For each hash table i , 1 · i · l  Return the bin indexed by g i ( q )  Perform a linear search on the union of the bins. q ... Feb 22, 2008 10

  11. Parameter Selection  Suppose we want to search at most B examples. Then setting ¶ log (1 =P 1 ) µ N ¶ µ N log (1 =P 2 ) k = log 1 =P 2 ; l = B B ensures that it will succeed with high probability. Feb 22, 2008 11

  12. Experiment 1  Compare LSH accuracy and performance to exact NN search. Examine the influence of:  k, the number of hash bits.  l, the number of hash tables.  B, the maximum search length.  Dataset  59500 20x20 patches taken from motorcycle images.  Represented as 400-dimensional column vectors Feb 22, 2008 12

  13. Hash Function  Convert the feature vectors into binary strings and use the Hamming hash functions.  Given a vector we can create a unary x 2 N d representation for each element . x i  = 1's followed by 0's, ( C ¡ x i ) x i Unary C ( x i ) where is the max coordinate for all points. C  u ( x ) = Unary C ( x 1 ) ; : : : ; Unary C ( x d )  Note that for any two points : p; q k p; q k = k u ( p ) ; u ( q ) k H Feb 22, 2008 13

  14. Example Query l = 20, k = 24, B = 1   Query =  Examples searched: 7,722 of 59,500  Result =  Actual NNs = Feb 22, 2008 14

  15. Average Search Length  Let B = 1 24 22 20 30 18 16 14 25 12 10 20 8 6 4 l 15 2 x1000 10 5 10 15 20 25 30 5 k Feb 22, 2008 15

  16. Average Search Length  Let B = 1 24 22 20 30 More hash bits, 18  (k), result in 16 shorter 14 25 searches. 12 10 More hash  20 8 tables (l), result 6 in longer 4 l 15 searches. 2 x1000 10 5 10 15 20 25 30 5 k Feb 22, 2008 16

  17. Average Approximation Error  Let B = 1 1.11 1.1 1.09 30 1.08 1.07 1.06 25 1.05 1.04 20 l 15 10 5 5 10 15 20 25 30 k Feb 22, 2008 17

  18. Average Approximation Error  Let B = 1 1.11 1.1 1.09 30 Over hashing 1.08  can result in too 1.07 few candidates 1.06 25 to return a good 1.05 approximation. 1.04 20 Over hashing  can cause l 15 algorithm to fail. 10 5 5 10 15 20 25 30 k Feb 22, 2008 18

  19. Average Approximation Error  Let B = 1 1.11 1.1 1.09 30 Over hashing 1.08  can result in too 1.07 few candidates 1.06 25 to return a good 1.05 approximation. 1.04 20 Over hashing  can cause l Average search 15 algorithm to fail. length = 8000 10 5 10 15 20 25 30 5 k Feb 22, 2008 19

  20. Average Approximation Error N  Let B = 5500 ¼ 1.15 ln N 1.14 1.13 30 1.12 1.11 1.1 25 1.09 1.08 20 l 15 10 5 5 10 15 20 25 30 k Feb 22, 2008 20

  21. Average Approximation Error p  Let B = 250 ¼ N 1.6 1.55 1.5 30 1.45 1.4 1.35 25 1.3 1.25 20 l 15 10 5 5 10 15 20 25 30 k Feb 22, 2008 21

  22. Experiment 2  Examine the effect of the approximation on the subjective quality of the results.  Dataset  D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree  2550 sets of 4 images represented as document-term matrix of the visual words. Feb 22, 2008 22

  23. Experiment 2: Issues  LSH requires a vector representation.  Not clear how to easily convert a bag of words representation into a vector one.  A binary vector where the presence of each word is a bit does not provide a good distance measure.  Each image has roughly the same number of different words from any other image.  Boostmap? Feb 22, 2008 23

  24. Conclusions  Approximate Nearest Neighbors is neccessary for very large high dimensional datasets.  LSH is a simple approach to aNN.  LSH requires a vector representation.  Clear relationship between search length and approximation error. Feb 22, 2008 24

  25. Tools  Octave (MATLAB)  LSH Matlab Toolbox - http://www.cs.brown.edu/~gregory/code/lsh/  Python  Gnuplot Feb 22, 2008 25

  26. References 'Fast Pose Estimation with Parameter Senative Hashing' –  Shakhnarovich et al. 'Similarity Search in High Dimensions via Hashing' – Gionis et al.  'Object Recognition Using Locality-Sensitive Hashing of Shape  Contexts' - Andrea Frome and Jitendra Malik 'Nearest neighbors in high-dimensional spaces', Handbook of  Discrete and Computational Geometry – Piotr Indyk Algorithms for Nearest Neighbor Search -  http://simsearch.yury.name/tutorial.html LSH Matlab Toolbox - http://www.cs.brown.edu/~gregory/code/lsh/  Feb 22, 2008 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend