Data-dependent Hashing for Nearest Neighbor Search Alex Andoni - - PowerPoint PPT Presentation
Data-dependent Hashing for Nearest Neighbor Search Alex Andoni - - PowerPoint PPT Presentation
Data-dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Based on joint work with: Piotr Indyk, Huy Nguyen, Ilya Razenshteyn Nearest Neighbor Search (NNS) Preprocess: a set of points Query: given a
Nearest Neighbor Search (NNS)
ο½ Preprocess: a set π of points ο½ Query: given a query point π, report a
point πβ β π with the smallest distance to π
π πβ
2
Motivation
ο½ Generic setup:
ο½ Points model objects (e.g. images) ο½ Distance models (dis)similarity measure
ο½ Application areas:
ο½ machine learning: k-NN rule ο½ speech/image/video/music recognition, vector quantization, bioinformatics, etcβ¦
ο½ Distances:
ο½ Hamming, Euclidean, edit distance, earthmover distance, etcβ¦
ο½ Core primitive: closest pair, clustering, etcβ¦
000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111
π πβ
3
Curse of Dimensionality
ο½ All exact algorithms degrade rapidly with the
dimension π
4
Algorithm Query time Space Full indexing π β log π π 1 ππ(π) (Voronoi diagram size) No indexing β linear scan π(π β π) π(π β π)
Approximate NNS
ο½ π -near neighbor: given a query point π,
report a point πβ² β π s.t. πβ² β π β€ π
ο½ as long as there is some point within
distance π
ο½ Practice: use for exact NNS
ο½ Filtering: gives a set of candidates (hopefully
small)
π π πβ πβ² ππ
ππ
5
NNS algorithms
6
Exponential dependence on dimension
ο½ [Arya-Mountβ93], [Clarksonβ94], [Arya-Mount-Netanyahu-Silverman-
Weβ98], [Kleinbergβ97], [Har-Peledβ02],[Arya-Fonseca-Mountβ11],β¦
Linear/poly dependence on dimension
ο½ [Kushilevitz-Ostrovsky-Rabaniβ98], [Indyk-Motwaniβ98], [Indykβ98, β01],
[Gionis-Indyk-Motwaniβ99], [Charikarβ02], [Datar-Immorlica-Indyk- Mirrokniβ04], [Chakrabarti-Regevβ04], [Panigrahyβ06], [Ailon-Chazelleβ06], [A.-Indykβ06], [A.-Indyk-Nguyen-Razenshteynβ14], [A.-Razenshteynβ15], [Paghβ16],[Laarhovenβ16],β¦
Locality-Sensitive Hashing
Random hash function β on ππ satisfying:
ο½ for close pair: when π β π
β€ π Pr[β(π) = β(π)] is βhighβ
ο½ for far pair: when π β πβ²
> ππ Pr[β(π) = β(πβ²)] is βsmallβ
Use several hash tables:
π π π β π Pr[π(π) = π(π)] π ππ
1
π1 π2
ππ, where
[Indyk-Motwaniβ98]
π
βnot-so-smallβ
π1 = π2 =
π = log 1/π
1
log 1/π
2
7
πβ²
LSH Algorithms
8
Space Time Exponent π = π Reference π1+π ππ π = 1/π π = 1/2
[IMβ98]
π1+π ππ π = 1/π π = 1/2
[IMβ98, DIIMβ04]
Hamming space Euclidean space
π β₯ 1/π
[MNPβ06, OWZβ11]
π β₯ 1/π2
[MNPβ06, OWZβ11]
π β 1/π2 π = 1/4
[AIβ06]
LSH is tightβ¦ whatβs next?
9
But are we really done with basic NNS algorithms?
Lower bounds (cell probe)
[A.-Indyk-Patrascuβ06, Panigrahy-Talwar-Wiederβ08,β10, Kapralov-Panigrahyβ12]
Space-time trade-offs
[Panigrahyβ06, A.-Indykβ06]
Datasets with additional structure
[Clarksonβ99, Karger-Ruhlβ02, Krauthgamer-Leeβ04, Beygelzimer-Kakade-Langfordβ06, Indyk-Naorβ07, Dasgupta-Sinhaβ13, Abdullah-A.-Krauthgamer-Kannanβ14,β¦]
Beyond Locality Sensitive Hashing
Space Time Exponent π = π Reference π1+π ππ π = 1/π π = 1/2
[IMβ98]
π1+π ππ π β 1/π2 π = 1/4
[AIβ06]
Hamming space Euclidean space
π β₯ 1/π
[MNPβ06, OWZβ11]
π β₯ 1/π2
[MNPβ06, OWZβ11]
π β 1 2π β 1 π = 1/3
[ARβ15]
π β 1 2π2 β 1 π = 1/7
[ARβ15]
LSH LSH
10
π1+π ππ complicated π = 1/2 β π
[AINRβ14]
π1+π ππ complicated π = 1/4 β π
[AINRβ14]
New approach?
11
ο½ A random hash function,
chosen after seeing the given dataset
ο½ Efficiently computable
Data-dependent hashing
Construction of hash function
12
ο½ T
wo components:
ο½ Nice geometric structure ο½ Reduction to such structure
ο½ Like a (weak) βregularity lemmaβ for a set of points
has better LSH data-dependent
Nice geometric structure: average-case
13
ο½ Think: random dataset on a sphere
ο½ vectors perpendicular to each other ο½ s.t. random points at distance β ππ
ο½ Lemma: π =
1 2π2β1
ο½ via Cap Carving
ππ ππ / 2
Reduction to nice structure
14
ο½ Idea:
iteratively decrease the radius of minimum enclosing ball
ο½ Algorithm:
ο½ find dense clusters
ο½ with smaller radius ο½ large fraction of points
ο½ recurse on dense clusters ο½ apply cap carving on the rest
ο½ recurse on each βcapβ ο½ eg, dense clusters might reappear
radius = 99ππ
*picture not to scale & dimension
radius = 100ππ
Why ok? Why ok?
- no dense clusters
- like βrandom datasetβ
with radius=100ππ
- even better!
Hash function
15
ο½ Described by a tree (like a hash table)
radius = 100ππ *picture not to scale&dimension
Dense clusters
ο½ Current dataset: radius π ο½ A dense cluster:
ο½ Contains π1βπ points ο½ Smaller radius: 1 β Ξ© π2
π
ο½ After we remove all clusters:
ο½ For any point on the surface, there are at most π1βπ points
within distance 2 β π π
ο½ The other points are essentially orthogonal !
ο½ When applying Cap Carving with parameters (π
1, π 2):
ο½ Empirical number of far pts colliding with query: ππ2 + π1βπ ο½ As long as ππ2 β« π1βπ, the βimpurityβ doesnβt matter! 2 β π π
π trade-off π trade-off
?
Tree recap
17
ο½ During query:
ο½ Recurse in all clusters ο½ Just in one bucket in CapCarving
ο½ Will look in >1 leaf! ο½ How much branching?
ο½ Claim: at most ππ + 1
π(1/π2)
ο½ Each time we branch
ο½ at most ππ clusters (+1) ο½ a cluster reduces radius by Ξ©(π2) ο½ cluster-depth at most 100/Ξ© π2
ο½ Progress in 2 ways:
ο½ Clusters reduce radius ο½ CapCarving nodes reduce the # of far points (empirical π2)
ο½ A tree succeeds with probability β₯ πβ
1 2π2β1βπ(1)
π trade-off
Beyond βBeyond LSHβ
18
ο½ Practice: often optimize partition to your dataset
ο½ PCA-tree, spectral hashing, etc [S91, McN01,VKD09, WTF08,β¦] ο½ no guarantees (performance or correctness)
ο½ Theory: assume special structure in the dataset
ο½ low intrinsic dimension [KRβ02, KLβ04, BKLβ06, INβ07, DSβ13,β¦] ο½ structure + noise [Abdullah-A.-Krauthgamer-Kannanβ14]
Data-dependent hashing helps even when no a priori structure !
Data-dependent hashing wrap-up
19
ο½ Dynamicity?
ο½ Dynamization techniques [Overmars-van Leeuwenβ81]
ο½ Better bounds?
ο½ For dimension π = π(log π), can get better π! [Laarhovenβ16] ο½ For π > log1+π π: our π is optimal even for data-dependent hashing! [A-
Razenshteynβ??]:
ο½ in the right formalization (to rule out
Voronoi diagram):
ο½ description complexity of the hash function is π1βΞ©(1)
ο½ Practical variant [A-Indyk-Laarhoven-Razenshteyn-Schmidtβ15] ο½ NNS for ββ
ο½ [Indykβ98] gets approximation π(log log π) (poly space, sublinear qt)
ο½ Cf., ββ has no non-trivial sketch! ο½ Some matching lower bounds in the relevant model [ACPβ08, KPβ12]
ο½ Can be thought of as data-dependent hashing
ο½ NNS for any norm (eg, matrix norms, EMD) ?