COMP9313: Big Data Management High Dimensional Similarity Search

Similarity Search • Problem Definition: • Given a query 𝑟 and dataset 𝐸 , find o ∈ 𝐸 , where 𝑝 is similar to 𝑟 • Two types of similarity search 𝑝 ∗ • Range search: 𝜐 • 𝑒𝑗𝑡𝑢 𝑝, 𝑟 ≤ τ 𝑟 • Nearest neighbor search • 𝑒𝑗𝑡𝑢 𝑝 ∗ , 𝑟 ≤ 𝑒𝑗𝑡𝑢 𝑝, 𝑟 , ∀𝑝 ∈ 𝐸 • Top-k version • Distance/similarity function varies • Euclidean, Jaccard, inner product, … • Classic problem, with mutual solutions 2

High Dimensional Similarity Search • Applications and relationship to Big Data • Almost every object can be and has been represented by a high dimensional vector • Words, documents • Image, audio, video • … • Similarity search is a fundamental process in information retrieval • E.g., Google search engine, face recognition system, … • High Dimension makes a huge difference! • Traditional solutions are no longer feasible • This lecture is about why and how • We focus on high dimensional vectors in Euclidean space 3

Similarity Search in Low Dimensional Space 4

Similarity Search in One Dimensional Space • Just numbers, use binary search, binary search tree, B+ Tree… • The essential idea behind: objects can be sorted

Similarity Search in Two Dimensional Space • Why binary search no longer works? • No order! • Voronoi diagram Euclidean distance Manhattan distance

Similarity Search in Two Dimensional Space • Partition based algorithms • Partition data into “cells” • Nearest neighbors are in the same cell with query or adjacent cells • How many “cells” to probe on 3 -dimensional space? 7

Similarity Search in Metric Space • Triangle inequality • 𝑒𝑗𝑡𝑢 𝑦, 𝑟 ≤ 𝑒𝑗𝑡𝑢 𝑦, 𝑧 + 𝑒𝑗𝑡𝑢 𝑧, 𝑟 • Orchard’s Algorithm • for each 𝑦 ∈ 𝐸 , create a list of points in increasing order of distance to 𝑦 • given query 𝑟 , randomly pick a point 𝑦 as the initial candidate (i.e., pivot 𝑞 ), compute 𝑒𝑗𝑡𝑢 𝑞, 𝑟 • walk along the list of 𝑞 , and compute the distances to 𝑟 . If found 𝑧 closer to 𝑟 than 𝑞 , then use 𝑧 as the new pivot (e.g., 𝑞 ← 𝑧 ). • repeat the procedure, and stop when • 𝑒𝑗𝑡𝑢 𝑞, 𝑧 > 2 ⋅ 𝑒𝑗𝑡𝑢 𝑞, 𝑟 8

Similarity Search in Metric Space • Orchard’s Algorithm, stop when 𝑒𝑗𝑡𝑢 𝑞, 𝑧 > 2 ⋅ 𝑒𝑗𝑡𝑢 𝑞, 𝑟 2 ⋅ 𝑒𝑗𝑡𝑢 𝑞, 𝑟 < 𝑒𝑗𝑡𝑢 𝑞, 𝑧 and 𝑒𝑗𝑡𝑢 𝑞, 𝑧 ≤ 𝑒𝑗𝑡𝑢 𝑞, 𝑟 + 𝑒𝑗𝑡𝑢 𝑧, 𝑟 ⇒ 2 ⋅ 𝑒𝑗𝑡𝑢 𝑞, 𝑟 < 𝑒𝑗𝑡𝑢 𝑞, 𝑟 + 𝑒𝑗𝑡𝑢 𝑧, 𝑟 ⇔ 𝑒𝑗𝑡𝑢 𝑞, 𝑟 < 𝑒𝑗𝑡𝑢 𝑧, 𝑟 • Since the list of 𝑞 is in increasing order of distance to 𝑞 , 𝑒𝑗𝑡𝑢 𝑞, 𝑧 > 2 ⋅ 𝑒𝑗𝑡𝑢 𝑞, 𝑟 hold for all the rest 𝑧 ’s. 9

None of the Above Works in High Dimensional Space! 10

Curse of Dimensionality • Refers to various phenomena that arise in high dimensional spaces that do not occur in low dimensional settings. • Triangle inequality • The pruning power reduces heavily • What is the volume of a high dimensional “ring” (i.e., hyperspherical shell)? 𝑊 𝑠𝑗𝑜𝑕 𝑥=1,𝑒=2 • 𝑊 𝑐𝑏𝑚𝑚 𝑠=10,𝑒=2 = 29% 𝑊 𝑠𝑗𝑜𝑕 𝑥=1,𝑒=100 • 𝑊 𝑐𝑏𝑚𝑚 𝑠=10,𝑒=100 = 99.997% 11

Approximate Nearest Neighbor Search in High Dimensional Space • There is no sub-linear solution to find the exact result of a nearest neighbor query • So we relax the condition • approximate nearest neighbor search (ANNS) • allow returned points to be not the NN of query • Success: returns the true NN • use success rate (e.g., percentage of succeed queries) to evaluate the method • Hard to bound the success rate 12

c-approximate NN Search • Success: returns 𝑝 such that 𝑝 ∗ • 𝑒𝑗𝑡𝑢 𝑝, 𝑟 ≤ 𝑑 ⋅ 𝑒𝑗𝑡𝑢(𝑝 ∗ , 𝑟) 𝑑𝑠 𝑠 • Then we can bound the success 𝑟 probability • Usually noted as 1 − δ • Solution: Locality Sensitive Hashing (LSH) 13

Locality Sensitive Hashing • Hash function • Index: Map data/objects to values (e.g., hash key) • Same data ⇒ same hash key (with 100% probability) • Different data ⇒ different hash keys (with high probability) • Retrieval: Easy to retrieve identical objects (as they have the same hash key) • Applications: hash map, hash join • Low cost • Space: 𝑃(𝑜) • Time: 𝑃(1) • Why it cannot be used in nearest neighbor search? • Even a minor difference leads to totally different hash keys 14

Locality Sensitive Hashing • Index: make the hash functions error tolerant • Similar data ⇒ same hash key (with high probability) • Dissimilar data ⇒ different hash keys (with high probability) • Retrieval: • Compute the hash key for the query • Obtain all the data has the same key with query (i.e., candidates) • Find the nearest one to the query • Cost: • Space: 𝑃(𝑜) • Time: 𝑃 1 + 𝑃(|𝑑𝑏𝑜𝑒|) • It is not the real Locality Sensitive Hashing! • We still have several unsolved issues… 15

LSH Functions • Formal definition: • Given point 𝑝 1 , 𝑝 2 , distance 𝑠 1 , 𝑠 2 , probability 𝑞 1 , 𝑞 2 • An LSH function ℎ(⋅) should satisfy • Pr ℎ 𝑝 1 = ℎ 𝑝 2 ≥ 𝑞 1 , if 𝑒𝑗𝑡𝑢 𝑝 1 , 𝑝 2 ≤ 𝑠 1 • Pr ℎ 𝑝 1 = ℎ 𝑝 2 ≤ 𝑞 2 , if 𝑒𝑗𝑡𝑢 𝑝 1 , 𝑝 2 > 𝑠 2 • What is ℎ ⋅ for a given distance/similarity function? • Jaccard similarity • Angular distance • Euclidean distance 16

MinHash - LSH Function for Jaccard Similarity • Each data object is a set |𝑇 𝑗 ∩𝑇 𝑘 | • 𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝑇 1 , 𝑇 2 = |𝑇 𝑗 ∪𝑇 𝑘 | • Randomly generate a global order for all the 𝑜 𝑇 𝑗 elements in C =ڂ 1 • Let ℎ(𝑇) be the minimal member of 𝑇 with respect to the global order • For example, 𝑇 = {𝑐, 𝑑, 𝑓, ℎ, 𝑗} , we use inversed alphabet order, then re-ordered 𝑇 = {𝑗, ℎ, 𝑓, 𝑑, 𝑐} , hence ℎ 𝑇 = 𝑗 . 17

MinHash • Now we compute Pr ℎ 𝑇 1 = ℎ 𝑇 2 • Every element 𝑓 ∈ 𝑇 1 ∪ 𝑇 2 has equal chance to be the first element among 𝑇 1 ∪ 𝑇 2 after re- ordering • 𝑓 ∈ 𝑇 1 ∩ 𝑇 2 if and only if ℎ 𝑇 1 = ℎ 𝑇 2 • 𝑓 ∉ 𝑇 1 ∩ 𝑇 2 if and only if ℎ 𝑇 1 ≠ ℎ 𝑇 2 |𝑇 𝑗 ∩𝑇 𝑘 | |{𝑓 𝑗 |ℎ 𝑗 𝑇 1 =ℎ 𝑗 𝑇 2 }| • Pr ℎ 𝑇 1 = ℎ 𝑇 2 = = |𝑇 𝑗 ∪𝑇 𝑘 | = |{𝑓 𝑗 }| 𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝑇 1 , 𝑇 2 18

SimHash – LSH Function for Angular Distance • Each data object is a d dimensional vector • 𝜄(𝑦, 𝑧) is the angle between 𝑦 and 𝑧 • Randomly generate a normal vector 𝑏 , where 𝑏 𝑗 ~𝑂(0,1) • Let ℎ 𝑦; 𝑏 = sgn(𝑏 𝑈 𝑦) • sgn o = ቊ 1; 𝑗𝑔 𝑝 ≥ 0 −1; 𝑗𝑔 𝑝 < 0 • 𝑦 lies on which side of 𝑏 ’s corresponding hyperplane 19

SimHash • Now we compute Pr ℎ 𝑝 1 = ℎ 𝑝 2 • ℎ 𝑝 1 ≠ ℎ 𝑝 2 iff 𝑝 1 and 𝑝 2 are on different sides of the hyperplane with 𝑝 1 𝑏 as its normal vector = 1 − 𝜄 𝑏 • Pr ℎ 𝑝 1 = ℎ 𝑝 2 𝑝 2 θ 𝜌 𝜄 𝜌 = 20

p-stable LSH - LSH function for Euclidean distance • Each data object is a d dimensional vector 𝑒 𝑦 𝑗 − 𝑧 𝑗 2 • 𝑒𝑗𝑡𝑢 𝑦, 𝑧 = σ 1 • Randomly generate a normal vector 𝑏 , where 𝑏 𝑗 ~𝑂(0,1) • Normal distribution is 2-stable, i.e., if 𝑏 𝑗 ~𝑂(0,1) , 𝑒 𝑏 𝑗 ⋅ 𝑦 𝑗 ~𝑂(0, 𝑦 2 2 ) then σ 1 𝑏 𝑈 𝑦+𝑐 • Let ℎ 𝑦; 𝑏, 𝑐 = , where 𝑐~𝑉(0,1) and 𝑥 𝑥 is user specified parameter 𝑥 1 𝑢 𝑢 • Pr ℎ 𝑝 1 ; 𝑏, 𝑐 = ℎ 𝑝 2 ; 𝑏, 𝑐 = ׬ 𝑝 1 ,𝑝 2 𝑔 1 − 𝑥 𝑒𝑢 𝑞 0 𝑝 1 ,𝑝 2 • 𝑔 𝑞 ⋅ is the pdf of the absolute value of normal variable 21

p-stable LSH • Intuition of p-stable LSH • Similar points have higher chance to be hashed together 22

Pr ℎ 𝑦 = ℎ 𝑧 for different Hash Functions MinHash SimHash p-stable LSH 23

Problem of Single Hash Function • Hard to distinguish if two pairs have distances close to each other • Pr ℎ 𝑝 1 = ℎ 𝑝 2 ≥ 𝑞 1 , if 𝑒𝑗𝑡𝑢 𝑝 1 , 𝑝 2 ≤ 𝑠 1 • Pr ℎ 𝑝 1 = ℎ 𝑝 2 ≤ 𝑞 2 , if 𝑒𝑗𝑡𝑢 𝑝 1 , 𝑝 2 > 𝑠 2 • We also want to control where the drastic change happens… • Close to 𝑒𝑗𝑡𝑢(𝑝 ∗ , 𝑟) • Given range 24

AND-OR Composition • Recall for a single hash function, we have • Pr ℎ 𝑝 1 = ℎ 𝑝 2 = 𝑞(𝑒𝑗𝑡𝑢(𝑝 1 , 𝑝 2 )) , denoted as 𝑞 𝑝 1 ,𝑝 2 • Now we consider two scenarios: • Combine 𝑙 hashes together, using AND operation • One must match all the hashes 𝑙 • Pr 𝐼 𝐵𝑂𝐸 𝑝 1 = 𝐼 𝐵𝑂𝐸 𝑝 2 = 𝑞 𝑝 1 ,𝑝 2 • Combine 𝑚 hashes together, using OR operation • One need to match at least one of the hashes • Pr 𝐼 𝑃𝑆 𝑝 1 = 𝐼 𝑃𝑆 𝑝 2 = 1 − (1 − 𝑞 𝑝 1 ,𝑝 2 ) 𝑚 • Not match only when all the hashes don’t match 25

COMP9313: Big Data Management High Dimensional Similarity Search - PowerPoint PPT Presentation

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem Definition: Given a query and dataset , find o , where is similar to Two types of similarity search

COMP9313: Big Data Management Course Introduction Lecture in Charge Lecturer: Yifang Sun

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

COMP9313: Big Data Management Recommender System Source from Dr. Xin Cao Recommendations

COMP9313: Big Data Management Introduction to MapReduce and Spark Motivation of MapReduce

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

COMP9313: Big Data Management Classification and PySpark MLlib PySpark MLlib MLlib is

CS 472 Homework CS 472 - Homework 1 Perceptron Homework Assume a 3 input perceptron plus bias

A quick review The parsimony principle: Find the tree that requires the fewest

Lecture 5: Search informed by lookahead heuristics: Greedy, Admissible A, Consistent A Mark

Learning Goals 1 5 A* Search with the Misplaced Tile Heuristic 5 3 A* Search with the

Informed search Lirong Xia Spring, 2017 Last class Search problems state space graph:

Points, Distances, and Cellular Automata: Geometric and Spatial Algorithmics Luidnel Maignan

K-means algorithm select K points (m 1 ,...,m K ) randomly do (w 1 ,...,w K ) = (m 1 ,...,m K )

CSC 1800 Organization of Programming Languages Scope 1 Scope and Names Scope determines

Types of Types Each type supports a set of valid operations. Types can be latent or

Processing Latent Variable Models and Signal Separation Bhiksha Raj Class 13. 15 Oct 2013

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Lecture 4: Permutation Methods Applied Statistics 2014 1 / 21 Randomization Model Population

EDSSU developing a successful MOC The Westmead Experience Amith Shetty Staff Specialist

1 General Population Individuals with CD HLADQ2 or HLADQ8 J Clin Invest.

2. Empirical analysis and comparisons of stochastic optimization algorithms Petr Po s k

Economies of Scope and Trade Niklas Herzig Bielefeld University June 16, 2015 Niklas Herzig