Diverse Near Neighbor Problem
Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa)
Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia - - PowerPoint PPT Presentation
Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa) Near Neighbor Problem Definition Set of points in -dimensional space
Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa)
β Set of π points πΈ in π-dimensional space β Query point π β Report one neighbor of π if there is any
query
β Major importance in databases (document, image, video), information retrieval, pattern recognition
Search: How many answers?
β Reporting π Nearest Neighbors may not be informative (could be identical texts)
β Time to retrieve them is high
Small output size which is
cluster, i.e. should be diverse
β Set of π points πΈ in π-dimensional space β Query point π β Report the k most diverse neighbors of π
β Points within distance π of query β We use Hamming distance
β div S = mππ
π,πβπ |π β π|
β π β π β© πΆ π, π β |Q| = k β πππ π is maximized
β πΆ π, π β πΆ π, ππ for some value of π > 1 β Result: query time of π(ππ
1 π)
β Bi-criterion approximation: distance and diversity β (π, π·)-Approximate π-diverse Near Neighbor β Let π β (green points) be the optimum solution for πΆ π, π
π β πΆ π, ππ
πππ π β₯
1 π½ πππ (π β) , π½ β₯ 1
Algorithm A Algorithm B Distance Apx. Factor c > 2 c >1 Diversity Apx. Factor Ξ± 6 6 Space (π log π)1+1/(πβ1)+ππ log π β π1+1/π + ππ Query Time
π2 + log π π π (log π)π/(πβ1)π1/(πβ1) π2 + log π π π β log π β π1/π
WWWβ13]
β Choose an arbitrary point β Repeat k-1 times
currently chosen points is maximized
β close points have higher probability of collision than far points β Hash functions: π1 , β¦ , ππ
π
1, π2, π , ππ -sensitive:
β If π β ππ β€ π then Pr β π = β ππ β₯ π
1
β If π β ππ β₯ ππ then Pr β π = β ππ β€ π2
β β π = ππ , i.e., the ith bit of π β Is (1 β π
π , 1 β π π π , π , π π)-sensitive
β π΄ and π are parameters of LSH
With constant probability
β Any neighbor of π falls into the same bucket as π in at least
β Total number of outliers is at most 3π β Outlier : point farther than ππ from the query point
Algorithm
β Retrieve the possible neighbors S = β π©[ππ(π)]
π π=1
β Remove the outliers S = S β© B q, cr β Report the approximate k most diverse points of S, or GMM(S)
β Should prune the buckets before collecting them
represents it.
problem
β Finding the k-diversity of S. β Instead we consider finding K-Center Cost of S
πβπ min πβ²βπβ² π β πβ²
min
πβ²βπ , πβ² =π πΏπΏ(π, πβ²)
β KC cost 2-approximates diversity
β Any neighbor of π falls into the same bucket as π in at least one hash function β There is no outlier
β π©ππ π = π―π―π― π©π π β Keep a 1/3 coreset of π©π π
β Retrieve the coresets from buckets S = β π©π[ππ(π)]
π π=1
β Run GMM(S) β Report the result
β Union of 1/3 coresets is a 1/3 coreset for the union β The last GMM call, adds a 2 approximation factor
β Space: π ππ = π((π log π)1+1/(πβ1) + ππ) β Time: π ππ2 = π( π2 + log π
π
π (log π)π/(πβ1)π1/(πβ1)) β Only makes sense for π > 2
β ANN query time is π(ππ
1 π)
β So if we want to improve over these we should be able to deal with outliers.
β for any set π of outliers of size at most π β (ππ\O) is a Ξ²-coreset for π
Yu,β06][Varadarajan, Xiao, β12]: β Repeat (π + 1) times
Note: if we order the points in ππ as we find them, then the first πβ² + 1 π points also form an ππ-robust Ξ²-coreset.
1 robust coreset
β Any neighbor of π falls into the same bucket as π in at least one hash function β Total number of outliers is at most 3π
β For each bucket π΅π[ππ(π)]
β Remove outliers from π β Return π»π»π»(π)
log π π
1 π)
Algorithm A Algorithm B ANN
Distance Apx. Factor c > 2 c >1 c >1 Diversity Apx. Factor Ξ± 6 6
πβ1
π
π
Query Time
1 πβ1
1 π
1 π