comp9313
play

COMP9313: Big Data Management High Dimensional Similarity Search - PowerPoint PPT Presentation

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem Definition: Given a query and dataset , find o , where is similar to Two types of similarity search


  1. COMP9313: Big Data Management High Dimensional Similarity Search

  2. Similarity Search β€’ Problem Definition: β€’ Given a query π‘Ÿ and dataset 𝐸 , find o ∈ 𝐸 , where 𝑝 is similar to π‘Ÿ β€’ Two types of similarity search 𝑝 βˆ— β€’ Range search: 𝜐 β€’ 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ ≀ Ο„ π‘Ÿ β€’ Nearest neighbor search β€’ 𝑒𝑗𝑑𝑒 𝑝 βˆ— , π‘Ÿ ≀ 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ , βˆ€π‘ ∈ 𝐸 β€’ Top-k version β€’ Distance/similarity function varies β€’ Euclidean, Jaccard, inner product, … β€’ Classic problem, with mutual solutions 2

  3. High Dimensional Similarity Search β€’ Applications and relationship to Big Data β€’ Almost every object can be and has been represented by a high dimensional vector β€’ Words, documents β€’ Image, audio, video β€’ … β€’ Similarity search is a fundamental process in information retrieval β€’ E.g., Google search engine, face recognition system, … β€’ High Dimension makes a huge difference! β€’ Traditional solutions are no longer feasible β€’ This lecture is about why and how β€’ We focus on high dimensional vectors in Euclidean space 3

  4. Similarity Search in Low Dimensional Space 4

  5. Similarity Search in One Dimensional Space β€’ Just numbers, use binary search, binary search tree, B+ Tree… β€’ The essential idea behind: objects can be sorted

  6. Similarity Search in Two Dimensional Space β€’ Why binary search no longer works? β€’ No order! β€’ Voronoi diagram Euclidean distance Manhattan distance

  7. Similarity Search in Two Dimensional Space β€’ Partition based algorithms β€’ Partition data into β€œcells” β€’ Nearest neighbors are in the same cell with query or adjacent cells β€’ How many β€œcells” to probe on 3 -dimensional space? 7

  8. Similarity Search in Metric Space β€’ Triangle inequality β€’ 𝑒𝑗𝑑𝑒 𝑦, π‘Ÿ ≀ 𝑒𝑗𝑑𝑒 𝑦, 𝑧 + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ β€’ Orchard’s Algorithm β€’ for each 𝑦 ∈ 𝐸 , create a list of points in increasing order of distance to 𝑦 β€’ given query π‘Ÿ , randomly pick a point 𝑦 as the initial candidate (i.e., pivot π‘ž ), compute 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ β€’ walk along the list of π‘ž , and compute the distances to π‘Ÿ . If found 𝑧 closer to π‘Ÿ than π‘ž , then use 𝑧 as the new pivot (e.g., π‘ž ← 𝑧 ). β€’ repeat the procedure, and stop when β€’ 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 > 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ 8

  9. Similarity Search in Metric Space β€’ Orchard’s Algorithm, stop when 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 > 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 and 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 ≀ 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ β‡’ 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ ⇔ 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ β€’ Since the list of π‘ž is in increasing order of distance to π‘ž , 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 > 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ hold for all the rest 𝑧 ’s. 9

  10. None of the Above Works in High Dimensional Space! 10

  11. Curse of Dimensionality β€’ Refers to various phenomena that arise in high dimensional spaces that do not occur in low dimensional settings. β€’ Triangle inequality β€’ The pruning power reduces heavily β€’ What is the volume of a high dimensional β€œring” (i.e., hyperspherical shell)? π‘Š π‘ π‘—π‘œπ‘• π‘₯=1,𝑒=2 β€’ π‘Š π‘π‘π‘šπ‘š 𝑠=10,𝑒=2 = 29% π‘Š π‘ π‘—π‘œπ‘• π‘₯=1,𝑒=100 β€’ π‘Š π‘π‘π‘šπ‘š 𝑠=10,𝑒=100 = 99.997% 11

  12. Approximate Nearest Neighbor Search in High Dimensional Space β€’ There is no sub-linear solution to find the exact result of a nearest neighbor query β€’ So we relax the condition β€’ approximate nearest neighbor search (ANNS) β€’ allow returned points to be not the NN of query β€’ Success: returns the true NN β€’ use success rate (e.g., percentage of succeed queries) to evaluate the method β€’ Hard to bound the success rate 12

  13. c-approximate NN Search β€’ Success: returns 𝑝 such that 𝑝 βˆ— β€’ 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ ≀ 𝑑 β‹… 𝑒𝑗𝑑𝑒(𝑝 βˆ— , π‘Ÿ) 𝑑𝑠 𝑠 β€’ Then we can bound the success π‘Ÿ probability β€’ Usually noted as 1 βˆ’ Ξ΄ β€’ Solution: Locality Sensitive Hashing (LSH) 13

  14. Locality Sensitive Hashing β€’ Hash function β€’ Index: Map data/objects to values (e.g., hash key) β€’ Same data β‡’ same hash key (with 100% probability) β€’ Different data β‡’ different hash keys (with high probability) β€’ Retrieval: Easy to retrieve identical objects (as they have the same hash key) β€’ Applications: hash map, hash join β€’ Low cost β€’ Space: 𝑃(π‘œ) β€’ Time: 𝑃(1) β€’ Why it cannot be used in nearest neighbor search? β€’ Even a minor difference leads to totally different hash keys 14

  15. Locality Sensitive Hashing β€’ Index: make the hash functions error tolerant β€’ Similar data β‡’ same hash key (with high probability) β€’ Dissimilar data β‡’ different hash keys (with high probability) β€’ Retrieval: β€’ Compute the hash key for the query β€’ Obtain all the data has the same key with query (i.e., candidates) β€’ Find the nearest one to the query β€’ Cost: β€’ Space: 𝑃(π‘œ) β€’ Time: 𝑃 1 + 𝑃(|π‘‘π‘π‘œπ‘’|) β€’ It is not the real Locality Sensitive Hashing! β€’ We still have several unsolved issues… 15

  16. LSH Functions β€’ Formal definition: β€’ Given point 𝑝 1 , 𝑝 2 , distance 𝑠 1 , 𝑠 2 , probability π‘ž 1 , π‘ž 2 β€’ An LSH function β„Ž(β‹…) should satisfy β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 β‰₯ π‘ž 1 , if 𝑒𝑗𝑑𝑒 𝑝 1 , 𝑝 2 ≀ 𝑠 1 β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 ≀ π‘ž 2 , if 𝑒𝑗𝑑𝑒 𝑝 1 , 𝑝 2 > 𝑠 2 β€’ What is β„Ž β‹… for a given distance/similarity function? β€’ Jaccard similarity β€’ Angular distance β€’ Euclidean distance 16

  17. MinHash - LSH Function for Jaccard Similarity β€’ Each data object is a set |𝑇 𝑗 βˆ©π‘‡ π‘˜ | β€’ 𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝑇 1 , 𝑇 2 = |𝑇 𝑗 βˆͺ𝑇 π‘˜ | β€’ Randomly generate a global order for all the π‘œ 𝑇 𝑗 elements in C =Ϊ‚ 1 β€’ Let β„Ž(𝑇) be the minimal member of 𝑇 with respect to the global order β€’ For example, 𝑇 = {𝑐, 𝑑, 𝑓, β„Ž, 𝑗} , we use inversed alphabet order, then re-ordered 𝑇 = {𝑗, β„Ž, 𝑓, 𝑑, 𝑐} , hence β„Ž 𝑇 = 𝑗 . 17

  18. MinHash β€’ Now we compute Pr β„Ž 𝑇 1 = β„Ž 𝑇 2 β€’ Every element 𝑓 ∈ 𝑇 1 βˆͺ 𝑇 2 has equal chance to be the first element among 𝑇 1 βˆͺ 𝑇 2 after re- ordering β€’ 𝑓 ∈ 𝑇 1 ∩ 𝑇 2 if and only if β„Ž 𝑇 1 = β„Ž 𝑇 2 β€’ 𝑓 βˆ‰ 𝑇 1 ∩ 𝑇 2 if and only if β„Ž 𝑇 1 β‰  β„Ž 𝑇 2 |𝑇 𝑗 βˆ©π‘‡ π‘˜ | |{𝑓 𝑗 |β„Ž 𝑗 𝑇 1 =β„Ž 𝑗 𝑇 2 }| β€’ Pr β„Ž 𝑇 1 = β„Ž 𝑇 2 = = |𝑇 𝑗 βˆͺ𝑇 π‘˜ | = |{𝑓 𝑗 }| 𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝑇 1 , 𝑇 2 18

  19. SimHash – LSH Function for Angular Distance β€’ Each data object is a d dimensional vector β€’ πœ„(𝑦, 𝑧) is the angle between 𝑦 and 𝑧 β€’ Randomly generate a normal vector 𝑏 , where 𝑏 𝑗 ~𝑂(0,1) β€’ Let β„Ž 𝑦; 𝑏 = sgn(𝑏 π‘ˆ 𝑦) β€’ sgn o = α‰Š 1; 𝑗𝑔 𝑝 β‰₯ 0 βˆ’1; 𝑗𝑔 𝑝 < 0 β€’ 𝑦 lies on which side of 𝑏 ’s corresponding hyperplane 19

  20. SimHash β€’ Now we compute Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 β€’ β„Ž 𝑝 1 β‰  β„Ž 𝑝 2 iff 𝑝 1 and 𝑝 2 are on different sides of the hyperplane with 𝑝 1 𝑏 as its normal vector = 1 βˆ’ πœ„ 𝑏 β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 𝑝 2 ΞΈ 𝜌 πœ„ 𝜌 = 20

  21. p-stable LSH - LSH function for Euclidean distance β€’ Each data object is a d dimensional vector 𝑒 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 β€’ 𝑒𝑗𝑑𝑒 𝑦, 𝑧 = Οƒ 1 β€’ Randomly generate a normal vector 𝑏 , where 𝑏 𝑗 ~𝑂(0,1) β€’ Normal distribution is 2-stable, i.e., if 𝑏 𝑗 ~𝑂(0,1) , 𝑒 𝑏 𝑗 β‹… 𝑦 𝑗 ~𝑂(0, 𝑦 2 2 ) then Οƒ 1 𝑏 π‘ˆ 𝑦+𝑐 β€’ Let β„Ž 𝑦; 𝑏, 𝑐 = , where 𝑐~𝑉(0,1) and π‘₯ π‘₯ is user specified parameter π‘₯ 1 𝑒 𝑒 β€’ Pr β„Ž 𝑝 1 ; 𝑏, 𝑐 = β„Ž 𝑝 2 ; 𝑏, 𝑐 = Χ¬ 𝑝 1 ,𝑝 2 𝑔 1 βˆ’ π‘₯ 𝑒𝑒 π‘ž 0 𝑝 1 ,𝑝 2 β€’ 𝑔 π‘ž β‹… is the pdf of the absolute value of normal variable 21

  22. p-stable LSH β€’ Intuition of p-stable LSH β€’ Similar points have higher chance to be hashed together 22

  23. Pr β„Ž 𝑦 = β„Ž 𝑧 for different Hash Functions MinHash SimHash p-stable LSH 23

  24. Problem of Single Hash Function β€’ Hard to distinguish if two pairs have distances close to each other β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 β‰₯ π‘ž 1 , if 𝑒𝑗𝑑𝑒 𝑝 1 , 𝑝 2 ≀ 𝑠 1 β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 ≀ π‘ž 2 , if 𝑒𝑗𝑑𝑒 𝑝 1 , 𝑝 2 > 𝑠 2 β€’ We also want to control where the drastic change happens… β€’ Close to 𝑒𝑗𝑑𝑒(𝑝 βˆ— , π‘Ÿ) β€’ Given range 24

  25. AND-OR Composition β€’ Recall for a single hash function, we have β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 = π‘ž(𝑒𝑗𝑑𝑒(𝑝 1 , 𝑝 2 )) , denoted as π‘ž 𝑝 1 ,𝑝 2 β€’ Now we consider two scenarios: β€’ Combine 𝑙 hashes together, using AND operation β€’ One must match all the hashes 𝑙 β€’ Pr 𝐼 𝐡𝑂𝐸 𝑝 1 = 𝐼 𝐡𝑂𝐸 𝑝 2 = π‘ž 𝑝 1 ,𝑝 2 β€’ Combine π‘š hashes together, using OR operation β€’ One need to match at least one of the hashes β€’ Pr 𝐼 𝑃𝑆 𝑝 1 = 𝐼 𝑃𝑆 𝑝 2 = 1 βˆ’ (1 βˆ’ π‘ž 𝑝 1 ,𝑝 2 ) π‘š β€’ Not match only when all the hashes don’t match 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend