SLIDE 1 New directions in approximate nearest neighbors for the angular distance
Thijs Laarhoven
mail@thijs.com http://www.thijs.com/
Proximity Workshop, College Park (MD), USA
(January 13, 2016)
SLIDE 2
O
Nearest neighbor searching
SLIDE 3
O
Nearest neighbor searching
Data set
SLIDE 4
O
Nearest neighbor searching
Target
SLIDE 5
O
Nearest neighbor searching
Nearest neighbor
SLIDE 6
O
Nearest neighbor searching
Nearest neighbor (ℓ1-norm)
SLIDE 7
O
Nearest neighbor searching
Nearest neighbor (angular distance)
SLIDE 8
O
Nearest neighbor searching
Nearest neighbor (ℓ2-norm)
SLIDE 9
O
r
Nearest neighbor searching
Distance guarantee
SLIDE 10
O
r
Nearest neighbor searching
Approximate nearest neighbor
SLIDE 11
O
r c · r
Nearest neighbor searching
Approximation factor c > 1
SLIDE 12
O
Nearest neighbor searching
Example: Precompute Voronoi cells
SLIDE 13
O
Nearest neighbor searching
Example: Precompute Voronoi cells
SLIDE 14
O
Nearest neighbor searching
Given a target...
SLIDE 15
O
Nearest neighbor searching
...quickly find the right cell
SLIDE 16
O
Nearest neighbor searching
Works well in low dimensions
SLIDE 17 Nearest neighbor searching
Problem setting
SLIDE 18 Nearest neighbor searching
Problem setting
- High dimensions d
- Large data set of size n = 2Ω(d/ log d)
◮ Smaller n? =
⇒ Use JLT to reduce d
SLIDE 19 Nearest neighbor searching
Problem setting
- High dimensions d
- Large data set of size n = 2Ω(d/ log d)
◮ Smaller n? =
⇒ Use JLT to reduce d
- Assumption: Data set lies on the sphere
◮ Angular NNS in Rd equivalent to Eucl. NNS on the sphere ◮ Reduction from Eucl. NNS in Rd to Eucl. NNS on the sphere [AR’15]
SLIDE 20 Nearest neighbor searching
Problem setting
- High dimensions d
- Large data set of size n = 2Ω(d/ log d)
◮ Smaller n? =
⇒ Use JLT to reduce d
- Assumption: Data set lies on the sphere
◮ Angular NNS in Rd equivalent to Eucl. NNS on the sphere ◮ Reduction from Eucl. NNS in Rd to Eucl. NNS on the sphere [AR’15]
√ 2
◮ Random unit vectors are usually approximately orthogonal
SLIDE 21 Nearest neighbor searching
Problem setting
- High dimensions d
- Large data set of size n = 2Ω(d/ log d)
◮ Smaller n? =
⇒ Use JLT to reduce d
- Assumption: Data set lies on the sphere
◮ Angular NNS in Rd equivalent to Eucl. NNS on the sphere ◮ Reduction from Eucl. NNS in Rd to Eucl. NNS on the sphere [AR’15]
√ 2
◮ Random unit vectors are usually approximately orthogonal
- Goal: Query time O(nρ) with ρ < 1
SLIDE 22
O
Nearest neighbor searching
“Random” instances
SLIDE 23
O
Nearest neighbor searching
“Random” instances
SLIDE 24
O
Nearest neighbor searching
“Random” instances
SLIDE 25
O
c · r = √ 2
Nearest neighbor searching
“Random” instances
SLIDE 26
O
c · r = √ 2
Nearest neighbor searching
“Random” instances
SLIDE 27
O
c · r = √ 2
r
Nearest neighbor searching
“Random” instances
SLIDE 28
Locality-sensitive hashing
Overview
SLIDE 29 Locality-sensitive hashing
Overview
- Idea: Use nice partitions of the space
◮ Nearby vectors are often in the same region ◮ Distant vectors are unlikely to be in the same region
SLIDE 30 Locality-sensitive hashing
Overview
- Idea: Use nice partitions of the space
◮ Nearby vectors are often in the same region ◮ Distant vectors are unlikely to be in the same region
- Precomputation: Store hash tables of vectors per region
◮ For each region, store contained vectors from data set ◮ Rerandomization: Many partitions to increase success probability
SLIDE 31 Locality-sensitive hashing
Overview
- Idea: Use nice partitions of the space
◮ Nearby vectors are often in the same region ◮ Distant vectors are unlikely to be in the same region
- Precomputation: Store hash tables of vectors per region
◮ For each region, store contained vectors from data set ◮ Rerandomization: Many partitions to increase success probability
- Query: Check hash tables for collisions
◮ Compute target’s region for each hash table ◮ Check corresponding buckets for potential nearest neighbors ◮ Reduces search space before doing a linear search
SLIDE 32
O
Hyperplane LSH
[Charikar, STOC’02]
SLIDE 33
O
Hyperplane LSH
Random point
SLIDE 34
O
Hyperplane LSH
Opposite point
SLIDE 35
O
Hyperplane LSH
Two Voronoi cells
SLIDE 36
O
Hyperplane LSH
Another pair of points
SLIDE 37
O
Hyperplane LSH
Another hyperplane
SLIDE 38
O
Hyperplane LSH
Defines partition
SLIDE 39
O
Hyperplane LSH
Preprocessing
SLIDE 40
O
Hyperplane LSH
Query
SLIDE 41
O
Hyperplane LSH
Collisions
SLIDE 42
O
Hyperplane LSH
Failure
SLIDE 43
O
Hyperplane LSH
Rerandomization
SLIDE 44
O
Hyperplane LSH
Collisions
SLIDE 45
O
Hyperplane LSH
Success
SLIDE 46
O
Hyperplane LSH
Overview
SLIDE 47 O
Hyperplane LSH
Overview
- 2 regions induced by each hyperplane
- Simple: one hyperplane corresponds to one inner product
- Fast: k hyperplanes give you 2k regions
SLIDE 48 O
Hyperplane LSH
Overview
- 2 regions induced by each hyperplane
- Simple: one hyperplane corresponds to one inner product
- Fast: k hyperplanes give you 2k regions
For “random” settings, query time O(nρ) with ρ = √ 2 π ln 2 · 1 c
SLIDE 49 O
Hyperplane LSH
Overview
- 2 regions induced by each hyperplane
- Simple: one hyperplane corresponds to one inner product
- Fast: k hyperplanes give you 2k regions
For “random” settings, query time O(nρ) with ρ = √ 2 π ln 2 · 1 c
Efficient but suboptimal as ρ ∝ 1
c2 is achievable
SLIDE 50
O
Cross-Polytope LSH
[Terasawa–Tanaka, WADS’07] [Andoni et al., NIPS’15]
SLIDE 51
O
Cross-Polytope LSH
Vertices of cross-polytope (simplex)
SLIDE 52
O
Cross-Polytope LSH
Random rotation
SLIDE 53
O
Cross-Polytope LSH
Voronoi regions
SLIDE 54
O
Cross-Polytope LSH
Defines partition
SLIDE 55
O
Cross-Polytope LSH
Overview
SLIDE 56 O
Cross-Polytope LSH
Overview
- 2d regions in d dimensions
- Advantage: regions same size and more symmetric
For “random” settings, query time O(nρ) with ρ = 1 2c2 − 1
SLIDE 57 O
Cross-Polytope LSH
Overview
- 2d regions in d dimensions
- Advantage: regions same size and more symmetric
For “random” settings, query time O(nρ) with ρ = 1 2c2 − 1
- 1 + od(1)
- Essentially optimal for large c and n = 2o(d) [Dub’10, AR’15]
SLIDE 58
O
Spherical/Voronoi LSH
[Andoni et al., SODA’14] [Andoni–Razenshteyn, STOC’15]
SLIDE 59
O
Spherical/Voronoi LSH
Random points
SLIDE 60
O
Spherical/Voronoi LSH
Voronoi cells
SLIDE 61
O
Spherical/Voronoi LSH
Defines partition
SLIDE 62
O
Spherical/Voronoi LSH
Overview
SLIDE 63 O
Spherical/Voronoi LSH
Overview
2O(
√ d) points in d dimensions
- More points improves performance
- More points makes decoding slower
SLIDE 64 O
Spherical/Voronoi LSH
Overview
2O(
√ d) points in d dimensions
- More points improves performance
- More points makes decoding slower
For “random” settings, query time O(nρ) with ρ = 1 2c2 − 1
SLIDE 65 O
Spherical/Voronoi LSH
Overview
2O(
√ d) points in d dimensions
- More points improves performance
- More points makes decoding slower
For “random” settings, query time O(nρ) with ρ = 1 2c2 − 1
Essentially optimal for large c and n = 2o(d)
SLIDE 66 O
LSH overview
- Hyperplane LSH: 2 Voronoi cells
◮ Efficient decoding ◮ Suboptimal for large d, c
- Cross-Polytope LSH: 2d Voronoi cells
◮ Reasonably efficient decoding ◮ Optimal for large c and n = 2o(d)
- Spherical/Voronoi LSH: 2O(
√ d) Voronoi cells
◮ Slow decoding ◮ Optimal for large c and n = 2o(d)
SLIDE 67 O
LSH overview
- Hyperplane LSH: 2 Voronoi cells
◮ Efficient decoding ◮ Suboptimal for large d, c
- Cross-Polytope LSH: 2d Voronoi cells
◮ Reasonably efficient decoding ◮ Optimal for large c and n = 2o(d)
- Spherical/Voronoi LSH: 2O(
√ d) Voronoi cells
◮ Slow decoding ◮ Optimal for large c and n = 2o(d)
- 1. Can we use even more Voronoi cells?
SLIDE 68 O
LSH overview
- Hyperplane LSH: 2 Voronoi cells
◮ Efficient decoding ◮ Suboptimal for large d, c
- Cross-Polytope LSH: 2d Voronoi cells
◮ Reasonably efficient decoding ◮ Optimal for large c and n = 2o(d)
- Spherical/Voronoi LSH: 2O(
√ d) Voronoi cells
◮ Slow decoding ◮ Optimal for large c and n = 2o(d)
- 1. Can we use even more Voronoi cells?
- 2. Can decoding be made faster?
SLIDE 69 O
LSH overview
- Hyperplane LSH: 2 Voronoi cells
◮ Efficient decoding ◮ Suboptimal for large d, c
- Cross-Polytope LSH: 2d Voronoi cells
◮ Reasonably efficient decoding ◮ Optimal for large c and n = 2o(d)
- Spherical/Voronoi LSH: 2O(
√ d) Voronoi cells
◮ Slow decoding ◮ Optimal for large c and n = 2o(d)
- 1. Can we use even more Voronoi cells?
- 2. Can decoding be made faster?
- 3. What about n = 2Ω(d)?
SLIDE 70
O
Structured filters
Overview
SLIDE 71
O
Structured filters
Partition dimensions into blocks
SLIDE 72
O
Structured filters
Random subcodes
SLIDE 73
O
Structured filters
Construct concatenated code
SLIDE 74
O
Structured filters
Construct concatenated code
SLIDE 75
O
Structured filters
Normalize (only for example)
SLIDE 76
O
Structured filters
Normalize (only for example)
SLIDE 77
O
Structured filters
Normalize (only for example)
SLIDE 78
O
Structured filters
Construct Voronoi cells
SLIDE 79
O
Structured filters
Defines partition
SLIDE 80
O
Structured filters
...with efficient decoding
SLIDE 81 O
Structured filters
Techniques
- Idea 1: Increase number of regions to 2Θ(d)
◮ Number of hash tables increases to 2Θ(d) – ok for n = 2Θ(d) ◮ Decoding cost potentially too large
SLIDE 82 O
Structured filters
Techniques
- Idea 1: Increase number of regions to 2Θ(d)
◮ Number of hash tables increases to 2Θ(d) – ok for n = 2Θ(d) ◮ Decoding cost potentially too large
- Idea 2: Use structured codes for random regions
◮ Spherical/Voronoi LSH with dependent random points ◮ Concatenated code of log d low-dim. spherical codes ◮ Allows for efficient list-decoding
SLIDE 83 O
Structured filters
Techniques
- Idea 1: Increase number of regions to 2Θ(d)
◮ Number of hash tables increases to 2Θ(d) – ok for n = 2Θ(d) ◮ Decoding cost potentially too large
- Idea 2: Use structured codes for random regions
◮ Spherical/Voronoi LSH with dependent random points ◮ Concatenated code of log d low-dim. spherical codes ◮ Allows for efficient list-decoding
- Idea 3: Replace partitions with filters
◮ Relaxation: filters need not partition the space ◮ Simplified analysis ◮ Might not be needed to achieve improvement
SLIDE 84 O
Structured filters
Results
For random sparse settings (n = 2o(d)), query time O(nρ) with ρ = 1 2c2 − 1
SLIDE 85 O
Structured filters
Results
For random sparse settings (n = 2o(d)), query time O(nρ) with ρ = 1 2c2 − 1
For random dense settings (n = 2κd with small κ), we obtain ρ = 1 − κ 2c2 − 1
SLIDE 86 O
Structured filters
Results
For random sparse settings (n = 2o(d)), query time O(nρ) with ρ = 1 2c2 − 1
For random dense settings (n = 2κd with small κ), we obtain ρ = 1 − κ 2c2 − 1
For random dense settings (n = 2κd with large κ), we obtain ρ = −1 2κ log
1 2c2 − 1 1 + od(1)
SLIDE 87 Asymmetric nearest neighbors
Previous results: symmetric NNS
- Query time: O(nρ)
- Update time: O(nρ)
- Preprocessing time: O(n1+ρ)
- Space complexity: O(n1+ρ)
SLIDE 88 Asymmetric nearest neighbors
Previous results: symmetric NNS
- Query time: O(nρ)
- Update time: O(nρ)
- Preprocessing time: O(n1+ρ)
- Space complexity: O(n1+ρ)
Can we get a tradeoff between these costs?
SLIDE 89
O
Asymmetric nearest neighbors
Voronoi regions
SLIDE 90
Asymmetric nearest neighbors
Spherical cap
SLIDE 91
α
Asymmetric nearest neighbors
Cap height α
SLIDE 92
α
Asymmetric nearest neighbors
Smaller α = ⇒ Larger caps, more work
SLIDE 93
α
Asymmetric nearest neighbors
Larger α = ⇒ Smaller caps, less work
SLIDE 94
αu αq
Asymmetric nearest neighbors
αq > αu = ⇒ Faster queries, slower updates
SLIDE 95
αq αu
Asymmetric nearest neighbors
αq < αu = ⇒ Slower queries, faster updates
SLIDE 96
αq αu
Asymmetric nearest neighbors
Results
General expressions Minimize space (αq/αu = cos θ) ρq = (2c2 − 1)/c4 ρu = 0 Balance costs (αq/αu = 1) ρq = 1/(2c2 − 1) ρu = 1/(2c2 − 1) Minimize time (αq/αu = 1/ cos θ) ρq = 0 ρu = (2c2 − 1)/(c2 − 1)2 Query time O(nρq), update time O(nρu), preprocessing time O(n1+ρu), space complexity O(n1+ρu)
SLIDE 97
αq αu
Asymmetric nearest neighbors
Results
General expressions Small c = 1 + ε Minimize space (αq/αu = cos θ) ρq = (2c2 − 1)/c4 ρu = 0 ρq = 1 − 4ε2 + O(ε3) ρu = 0 Balance costs (αq/αu = 1) ρq = 1/(2c2 − 1) ρu = 1/(2c2 − 1) ρq = 1 − 4ε + O(ε2) ρu = 1 − 4ε + O(ε2) Minimize time (αq/αu = 1/ cos θ) ρq = 0 ρu = (2c2 − 1)/(c2 − 1)2 ρq = 0 ρu = 1/(4ε2) + O(1/ε) Query time O(nρq), update time O(nρu), preprocessing time O(n1+ρu), space complexity O(n1+ρu)
SLIDE 98
αq αu
Asymmetric nearest neighbors
Results
General expressions Large c → ∞ Minimize space (αq/αu = cos θ) ρq = (2c2 − 1)/c4 ρu = 0 ρq = 2/c2 + O(1/c4) ρu = 0 Balance costs (αq/αu = 1) ρq = 1/(2c2 − 1) ρu = 1/(2c2 − 1) ρq = 1/(2c2) + O(1/c4) ρu = 1/(2c2) + O(1/c4) Minimize time (αq/αu = 1/ cos θ) ρq = 0 ρu = (2c2 − 1)/(c2 − 1)2 ρq = 0 ρu = 2/c2 + O(1/c4) Query time O(nρq), update time O(nρu), preprocessing time O(n1+ρu), space complexity O(n1+ρu)
SLIDE 99
αq αu
Asymmetric nearest neighbors
Tradeoffs
SLIDE 100 Conclusions
Main result: Allow using more regions with list-decodable codes
- For n = 2o(d), non-asymptotic improvement
- For n = 2Θ(d), asymptotic improvement
- Corollary: Lower bounds for n = 2o(d) do not hold for n = 2Θ(d)
- Improved tradeoffs between query and update complexities
SLIDE 101 Conclusions
Main result: Allow using more regions with list-decodable codes
- For n = 2o(d), non-asymptotic improvement
- For n = 2Θ(d), asymptotic improvement
- Corollary: Lower bounds for n = 2o(d) do not hold for n = 2Θ(d)
- Improved tradeoffs between query and update complexities
Open problems
- Tradeoff for n = 2o(d) optimal?
- Lower bounds for n = 2Θ(d)?
- Apply similar ideas to other norms?
- Practicality?
SLIDE 102 Conclusions
Main result: Allow using more regions with list-decodable codes
- For n = 2o(d), non-asymptotic improvement
- For n = 2Θ(d), asymptotic improvement
- Corollary: Lower bounds for n = 2o(d) do not hold for n = 2Θ(d)
- Improved tradeoffs between query and update complexities
Open problems
- Tradeoff for n = 2o(d) optimal?
- Lower bounds for n = 2Θ(d)?
- Apply similar ideas to other norms?
- Practicality?
Questions?