Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, - - PowerPoint PPT Presentation

continuous nearest neighbor search
SMART_READER_LITE
LIVE PREVIEW

Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, - - PowerPoint PPT Presentation

Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong University of Science and Technology Clear Water Bay, Hong Kong 1 Point Nearest Neighbor (NN) Queries [Roussopoulos et al SIGMOD95, Hjaltason and Samet


slide-1
SLIDE 1

1

Continuous Nearest Neighbor Search

Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong University of Science and Technology Clear Water Bay, Hong Kong

slide-2
SLIDE 2

2

Point Nearest Neighbor (NN) Queries [Roussopoulos et al SIGMOD95, Hjaltason and Samet TODS 99]

  • Branch and bound algorithms use mindist between the query point q

and an R-tree entry E, to prune the search space: – mindist(E, q) = The minimum distance between E and q

2 4 6 8 10 2 4 6 8

x axis y axis

E1 E2 q mindist(q,E2) mindist(q,E1)

slide-3
SLIDE 3

3

Nearest Neighbor Search (NN) with R-Trees

E

2 4 6 8 10 2 4 6 8 10

x axis y axis

b

E

f query point

  • mitted

1

E

2 e d c a h g

E

3

E

5

E

6

E

4

E7

8 search region contents

E

9 i E 1 1 E 2 2

Visit Root

E 13 7

follow E

1 E 2 2 E 5 4 E 5 5 E 8 3 E 9 6 E 8 3

Action Heap follow E

2 E 2 8 E 5 4 E 5 5 E 8 3 E 9 6

follow E

8

Report h and terminate

E 17 9 E 13 7 E 5 4 E 5 5 E 8 3 E 9 6 E 17 9

Result {empty} {empty} {empty} {(h,

2 )}

a 5 b 13 c 18 d 13 e 13 f 10 h 2 g 13 E1 1 E2 2 E3 8 E4 5 E5 5 E6 9 E7 13 E8 2 Root E9 17 i 10 E1 E2 E4 E5 E8

  • Depth-first (DF) and Best-first (BF) algorihms:
slide-4
SLIDE 4

4

Problem: Continuous Nearest Neighbor

Data: A set of points Query: A line segment q=[s, e] Result: The nearest neighbor (NN) of every point on q. Result representation: {s(.NN=a), s1(.NN=c), s2(.NN=f), s3(.NN=h), e} For the sake of simplicity we present Continuous 1-NN, while the solution generalizes to k-NN, and trajectories of multiple line segments (see paper).

slide-5
SLIDE 5

5

Previous Approach – Time Parameterized Queries (Tao and Papadias, SIGMOD 02)

Step 1: Find the NN of the start point s, i.e., point a. Step 2: Use the TP technique to find: The first point on the line segment (sc ) where there is a change in the NN (i.e., point c) will become the next NN.

slide-6
SLIDE 6

6

TP NN (cont)

From Step 2 we have decided the next NN change is point c at s1 Step 3: Perform another TP NN to find: Starting from s1, how far we need to travel for the current NN (i.e., c) to change. Repeat this until we finish the entire segment. Problem: # of TP queries = # of NN changes (i.e., output sensitive)

slide-7
SLIDE 7

7

Our Goal

Find all split points s1, s2, s3 (as well as the corresponding NN for each partition) with a single traversal of the dataset. Term1: The set of split points (including s and e) constitute the split list. Term2: The circle that centers at split point si with radius dist(si, si.NN) is the vicinity circle of si. Term3: We say a data point u covers a point s if u=s.NN. E.g., points a, c, f, h cover segments [s, s1], [s1, s2], [s2, s3], [s3, e].

slide-8
SLIDE 8

8

Lemma 1

Given a split list SL {s0, s1, …, s|SL−1|}, and a new data point p, then: p covers some point on query segment q if and only if p covers a split point.

After processing a After processing c

slide-9
SLIDE 9

9

Lemma 2 (Covering Continuity)

The split points covered by a point p are continuous. Namely, if p covers split point si but not si−1 (or si+1), then p cannot cover si−j (or si+j) for any value of j>1. a f b si-1 SL={si-1 (.NN=a), si (.NN=b), si+1 (.NN=c), si+2 (.NN=d), si+3 (.NN=f)} d c . . . g si si+1 si+2 si+3 . . . p q

slide-10
SLIDE 10

10

Algorithm with R-trees Overview

Use branch-and-bound techniques to prune the search space. When a leaf entry (i.e., a data point) p is encountered SL is updated if p covers any split point (i.e., p is a qualifying entry) – By Lemma 1. For an intermediate entry We visit its subtree only if it may contain any qualifying data point – Use heuristics.

slide-11
SLIDE 11

11

Heuristic 1

Given an intermediate entry E and query segment q, the subtree of E may contain qualifying points only if mindist(E,q) < SLMAXD, where mindist(E,q) denotes the minimum distance between the MBR of E and q SLMAXD is the maximum distance between a split point and its NN.

slide-12
SLIDE 12

12

Heuristic 2

Given an intermediate entry E and query segment q, the subtree of E must be searched if and only if there exists a split point si∈SL such that dist(si,si.NN) > mindist(si, E). Heuristic 2 requires mindist computation between E and all split points. Hence it is applied only if E passes heuristic 1, which requires only one computation.

slide-13
SLIDE 13

13

Heuristic 3 (Access Order)

Entries (satisfying heuristics 1 and 2) are accessed in increasing order of their minimum distances to the query segment q. Before processing E1 After processing E1

slide-14
SLIDE 14

14

An Example (Depth First Approach)

E E2 E1 E6

5

E3 E4 h j i m k l c b a d g f

e s SL={s(.NN=f), e(.NN=f)}

E E2 E1 E6

5

E3 E4 h j i m k l c b a d g f

e s

1

s SL={s(.NN=f), s1(.NN=g), e(.NN=g)}

E E2 E1 E6

5

E3 E4 h j i m k l c b a d g f

e s

2

s

1

s SL={s(.NN=b), s1(.NN=f), s2(.NN=g), e(.NN=g)}

E E2 E1 E6

5

E3 E4 h j i m k l c b a d g f

e s

2

s SL={s(.NN=k ), s1(.NN=f ), e(.NN=g)}

1

s

slide-15
SLIDE 15

15

Cost Model for Uniform Data (real data are handled with histograms)

a b c d E1 f e E2 s s1 e

e s d NN d NN

Actual search region Approximated search region

( )

1/

NN

d N π ≈

An optimal algorithm on R-trees must access only those nodes whose MBRs intersect the actual search region (i.e., E1 but not E2). To facilitate the analysis we focus on a more regular (approximated) region

slide-16
SLIDE 16

16

Node Access Probability

( )

, ( )

ACCESS EXT

P E q area E = =

( ) ( )

2 1 2 1 2 1 2

. . 2 . . . 2 . . | cos | . | sin |

NN NN

d E l E l d E l E l q l q l E l E l π θ θ + ⋅ + + + + ⋅ + ⋅

slide-17
SLIDE 17

17

Cost Model

( ) ( ) ( )

1 2 2 1

( ) . , . 2 2 . . 2 . . | cos | | sin |

h i ACCESS i i h NN NN i i

NA CNN N P E l q d E l d E l q l N q l E l π θ θ

− = − =

= ⋅   + + ⋅ ⋅ + = ⋅  + ⋅ ⋅ +    

∑ ∑

Various models have been proposed for E.l and Ni in the context of R-tree analysis. Our algorithm is I/O-bounded. Hence the above model (producing number of node accesses) reflects the performance. The performance of non-uniform data can be easily captured with histograms.

( )

2

( ) 2 .

NN SEARCH NN NN

n N area R N d d q l π = ⋅ = + ⋅

slide-18
SLIDE 18

18

Experimental Settings

Datasets: Uniform Real: CA (130K points), ST (2M points). Queries (each a segment): Location and orientation randomly generated Length is set as a parameter Performance is measured as the average of running 200 queries. Machine: 1Ghz CPU, 256M memory Page size=4K (R-tree node capacity=200) Compare CNN and TP (the only existing solution)

slide-19
SLIDE 19

19

Exp 1: Cost Model Evaluation

2 4 6 8 10 12 14 1% 5% 10% 15% 20% 25% query length node accesses 6 6.5 7 7.5 8 8.5 9 1 3 5 7 9 k node accesses 5 10 15 1% 5% 10% 15% 20% 25% node accesses query length 7 7.5 8 8.5 9 9.5 1 3 5 7 9 k Depth-First Best-first Estimation for optimal algorithm

Uniform CA

(k=5) (k=5) (query length=12,5%) (query length=12,5%) node accesses

slide-20
SLIDE 20

20

Exp 2: Performance vs Query Length

1 10 100 1000 1% 5% 10% 15% 20% 25% CNN TP node accesses query length 0.001 0.01 0.1 1 10 1% 5% 10% 15% 20% 25% CNN TP CPU cost (sec) query length 0.1 1 10 1% 5% 10% 15% 20% 25% CNN TP total cost (sec) query length

78% 77% 76% 74% 68% 41% CPU percentage 10% 8% 6% 4% 2% 1%

1 10 100 1000 10000 1% 5% 10% 15% 20% 25% CNN TP node accesses query length

0.01 0.1 1 10 100 1% 5% 10% 15% 20% 25% CNN TP CPU time (sec) query length

total cost (sec) query length

CPU percentage

0.1 1 10 100 1% 5% 10% 15% 20% 25% CNN TP

91% 91% 90% 84% 80% 75% 3% 7% 14% 25% 38% 42%

CA ST

slide-21
SLIDE 21

21

Exp 3: Performance vs k (number of neighbors to be retrieved for each point)

CA ST

node accesses k 1 10 100 1000 1 3 5 7 9 CNN TP CPU cost (sec) k 0.001 0.01 0.1 1 10 1 3 5 7 9 CNN TP total cost (sec) k

88% CPU percentage

0.1 1 10 1 3 5 7 9 CNN TP

81% 71% 52% 17% 1% 3% 5% 8% 12%

node accesses k 1 10 100 1000 10000 1 3 5 7 9 CNN TP CPU time (sec) k 0.01 0.1 1 10 100 1 3 5 7 9 CNN TP

total cost (sec) k

CPU percentage 94%

1 10 100 1 3 5 7 9 CNN TP

91% 84% 71% 51% 42% 30% 20% 8% 3%

slide-22
SLIDE 22

22

Conclusion

A fast algorithm for C-kNN query. Future work: Rectangle data Moving data points Application to road networks (i.e., travel instead of Euclidean distance)