1
Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, - - PowerPoint PPT Presentation
Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, - - PowerPoint PPT Presentation
Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong University of Science and Technology Clear Water Bay, Hong Kong 1 Point Nearest Neighbor (NN) Queries [Roussopoulos et al SIGMOD95, Hjaltason and Samet
2
Point Nearest Neighbor (NN) Queries [Roussopoulos et al SIGMOD95, Hjaltason and Samet TODS 99]
- Branch and bound algorithms use mindist between the query point q
and an R-tree entry E, to prune the search space: – mindist(E, q) = The minimum distance between E and q
2 4 6 8 10 2 4 6 8
x axis y axis
E1 E2 q mindist(q,E2) mindist(q,E1)
3
Nearest Neighbor Search (NN) with R-Trees
E
2 4 6 8 10 2 4 6 8 10
x axis y axis
b
E
f query point
- mitted
1
E
2 e d c a h g
E
3
E
5
E
6
E
4
E7
8 search region contents
E
9 i E 1 1 E 2 2
Visit Root
E 13 7
follow E
1 E 2 2 E 5 4 E 5 5 E 8 3 E 9 6 E 8 3
Action Heap follow E
2 E 2 8 E 5 4 E 5 5 E 8 3 E 9 6
follow E
8
Report h and terminate
E 17 9 E 13 7 E 5 4 E 5 5 E 8 3 E 9 6 E 17 9
Result {empty} {empty} {empty} {(h,
2 )}
a 5 b 13 c 18 d 13 e 13 f 10 h 2 g 13 E1 1 E2 2 E3 8 E4 5 E5 5 E6 9 E7 13 E8 2 Root E9 17 i 10 E1 E2 E4 E5 E8
- Depth-first (DF) and Best-first (BF) algorihms:
4
Problem: Continuous Nearest Neighbor
Data: A set of points Query: A line segment q=[s, e] Result: The nearest neighbor (NN) of every point on q. Result representation: {s(.NN=a), s1(.NN=c), s2(.NN=f), s3(.NN=h), e} For the sake of simplicity we present Continuous 1-NN, while the solution generalizes to k-NN, and trajectories of multiple line segments (see paper).
5
Previous Approach – Time Parameterized Queries (Tao and Papadias, SIGMOD 02)
Step 1: Find the NN of the start point s, i.e., point a. Step 2: Use the TP technique to find: The first point on the line segment (sc ) where there is a change in the NN (i.e., point c) will become the next NN.
6
TP NN (cont)
From Step 2 we have decided the next NN change is point c at s1 Step 3: Perform another TP NN to find: Starting from s1, how far we need to travel for the current NN (i.e., c) to change. Repeat this until we finish the entire segment. Problem: # of TP queries = # of NN changes (i.e., output sensitive)
7
Our Goal
Find all split points s1, s2, s3 (as well as the corresponding NN for each partition) with a single traversal of the dataset. Term1: The set of split points (including s and e) constitute the split list. Term2: The circle that centers at split point si with radius dist(si, si.NN) is the vicinity circle of si. Term3: We say a data point u covers a point s if u=s.NN. E.g., points a, c, f, h cover segments [s, s1], [s1, s2], [s2, s3], [s3, e].
8
Lemma 1
Given a split list SL {s0, s1, …, s|SL−1|}, and a new data point p, then: p covers some point on query segment q if and only if p covers a split point.
After processing a After processing c
9
Lemma 2 (Covering Continuity)
The split points covered by a point p are continuous. Namely, if p covers split point si but not si−1 (or si+1), then p cannot cover si−j (or si+j) for any value of j>1. a f b si-1 SL={si-1 (.NN=a), si (.NN=b), si+1 (.NN=c), si+2 (.NN=d), si+3 (.NN=f)} d c . . . g si si+1 si+2 si+3 . . . p q
10
Algorithm with R-trees Overview
Use branch-and-bound techniques to prune the search space. When a leaf entry (i.e., a data point) p is encountered SL is updated if p covers any split point (i.e., p is a qualifying entry) – By Lemma 1. For an intermediate entry We visit its subtree only if it may contain any qualifying data point – Use heuristics.
11
Heuristic 1
Given an intermediate entry E and query segment q, the subtree of E may contain qualifying points only if mindist(E,q) < SLMAXD, where mindist(E,q) denotes the minimum distance between the MBR of E and q SLMAXD is the maximum distance between a split point and its NN.
12
Heuristic 2
Given an intermediate entry E and query segment q, the subtree of E must be searched if and only if there exists a split point si∈SL such that dist(si,si.NN) > mindist(si, E). Heuristic 2 requires mindist computation between E and all split points. Hence it is applied only if E passes heuristic 1, which requires only one computation.
13
Heuristic 3 (Access Order)
Entries (satisfying heuristics 1 and 2) are accessed in increasing order of their minimum distances to the query segment q. Before processing E1 After processing E1
14
An Example (Depth First Approach)
E E2 E1 E6
5
E3 E4 h j i m k l c b a d g f
e s SL={s(.NN=f), e(.NN=f)}
E E2 E1 E6
5
E3 E4 h j i m k l c b a d g f
e s
1
s SL={s(.NN=f), s1(.NN=g), e(.NN=g)}
E E2 E1 E6
5
E3 E4 h j i m k l c b a d g f
e s
2
s
1
s SL={s(.NN=b), s1(.NN=f), s2(.NN=g), e(.NN=g)}
E E2 E1 E6
5
E3 E4 h j i m k l c b a d g f
e s
2
s SL={s(.NN=k ), s1(.NN=f ), e(.NN=g)}
1
s
15
Cost Model for Uniform Data (real data are handled with histograms)
a b c d E1 f e E2 s s1 e
e s d NN d NN
Actual search region Approximated search region
( )
1/
NN
d N π ≈
An optimal algorithm on R-trees must access only those nodes whose MBRs intersect the actual search region (i.e., E1 but not E2). To facilitate the analysis we focus on a more regular (approximated) region
16
Node Access Probability
( )
, ( )
ACCESS EXT
P E q area E = =
( ) ( )
2 1 2 1 2 1 2
. . 2 . . . 2 . . | cos | . | sin |
NN NN
d E l E l d E l E l q l q l E l E l π θ θ + ⋅ + + + + ⋅ + ⋅
17
Cost Model
( ) ( ) ( )
1 2 2 1
( ) . , . 2 2 . . 2 . . | cos | | sin |
h i ACCESS i i h NN NN i i
NA CNN N P E l q d E l d E l q l N q l E l π θ θ
− = − =
= ⋅ + + ⋅ ⋅ + = ⋅ + ⋅ ⋅ +
∑ ∑
Various models have been proposed for E.l and Ni in the context of R-tree analysis. Our algorithm is I/O-bounded. Hence the above model (producing number of node accesses) reflects the performance. The performance of non-uniform data can be easily captured with histograms.
( )
2
( ) 2 .
NN SEARCH NN NN
n N area R N d d q l π = ⋅ = + ⋅
18
Experimental Settings
Datasets: Uniform Real: CA (130K points), ST (2M points). Queries (each a segment): Location and orientation randomly generated Length is set as a parameter Performance is measured as the average of running 200 queries. Machine: 1Ghz CPU, 256M memory Page size=4K (R-tree node capacity=200) Compare CNN and TP (the only existing solution)
19
Exp 1: Cost Model Evaluation
2 4 6 8 10 12 14 1% 5% 10% 15% 20% 25% query length node accesses 6 6.5 7 7.5 8 8.5 9 1 3 5 7 9 k node accesses 5 10 15 1% 5% 10% 15% 20% 25% node accesses query length 7 7.5 8 8.5 9 9.5 1 3 5 7 9 k Depth-First Best-first Estimation for optimal algorithm
Uniform CA
(k=5) (k=5) (query length=12,5%) (query length=12,5%) node accesses
20
Exp 2: Performance vs Query Length
1 10 100 1000 1% 5% 10% 15% 20% 25% CNN TP node accesses query length 0.001 0.01 0.1 1 10 1% 5% 10% 15% 20% 25% CNN TP CPU cost (sec) query length 0.1 1 10 1% 5% 10% 15% 20% 25% CNN TP total cost (sec) query length
78% 77% 76% 74% 68% 41% CPU percentage 10% 8% 6% 4% 2% 1%
1 10 100 1000 10000 1% 5% 10% 15% 20% 25% CNN TP node accesses query length
0.01 0.1 1 10 100 1% 5% 10% 15% 20% 25% CNN TP CPU time (sec) query length
total cost (sec) query length
CPU percentage
0.1 1 10 100 1% 5% 10% 15% 20% 25% CNN TP
91% 91% 90% 84% 80% 75% 3% 7% 14% 25% 38% 42%
CA ST
21
Exp 3: Performance vs k (number of neighbors to be retrieved for each point)
CA ST
node accesses k 1 10 100 1000 1 3 5 7 9 CNN TP CPU cost (sec) k 0.001 0.01 0.1 1 10 1 3 5 7 9 CNN TP total cost (sec) k
88% CPU percentage
0.1 1 10 1 3 5 7 9 CNN TP
81% 71% 52% 17% 1% 3% 5% 8% 12%
node accesses k 1 10 100 1000 10000 1 3 5 7 9 CNN TP CPU time (sec) k 0.01 0.1 1 10 100 1 3 5 7 9 CNN TP
total cost (sec) k
CPU percentage 94%
1 10 100 1 3 5 7 9 CNN TP
91% 84% 71% 51% 42% 30% 20% 8% 3%
22