9 23 2009
play

9/23/2009 C O NFERENC ES Short Name Full Name Special Interest - PDF document

9/23/2009 C O NFERENC ES Short Name Full Name Special Interest Group on Management Of SIGMOD Data VLDB Very Large Data Base C O NT INUO US N EAREST N EIG HBO R S EARC H ICDE International Conference on Data Engineering Yufei Tao, Dimitris


  1. 9/23/2009 C O NFERENC ES Short Name Full Name Special Interest Group on Management Of SIGMOD Data VLDB Very Large Data Base C O NT INUO US N EAREST N EIG HBO R S EARC H ICDE International Conference on Data Engineering Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong University of Science and Technology 2 Presented : Penny Bei Pan O VERVIEW I NT RO DUC T IO N Object � Introduction � Continuous Nearest Neighbor � Preliminary & Related Work � Continuous k-Nearest Neighbor Query(CkNN) � Definition � Problem Characteristics P bl Ch t i ti � R-tree algorithm Query � Query analysis Point � Complex CNN extension � Why called “continuous”? � Experiments � Nearest neighbor of every points in the trajectory � Discussion and Conclusion 3 4 P REL IMINARY - - P O INT NN Q UERIES P REL IMINARY - - P O INT NN Q UERIES � Branch and bound algorithms use mindist between � Depth-first (DF) and Best-first (BF) algorithms the query point q and an R-tree entry E , to prune � E: R-tree entry the search space: � q: query point � – mindist(E, q) = The minimum distance between E and � DF : choose the entrance with minimum min-dist q � BF: choose the min among all those visited (heap) � BF: choose the min among all those visited (heap) E 1 E 1 E 2 E 2 E 2 E 3 E 4 E 4 E 5 E 6 E 6 E 6 f k l l l m 5 6 1

  2. 9/23/2009 P REL IMINARY - - C O NT INUO US N EAREST N EIG HBO R R EL ED WO RK – S AMPL AT ING � Try to convert the continuous-NN to point-NN f � Every point on the line -> unlimited points � Sampling � Drawback: � Sample Rate: low -> incorrect Sample Rate: low > incorrect h � Sample Rate: high -> overhead (still cannot guarantee a c accuracy) � Data: A set of points (P={ a,b,c,d,f,g,h }) � Query: A line segment q=[s, e] � Time Parameterized queries � Result: The nearest neighbor (NN) of every point on q. � Output (R, T, C) : result, time period, changing point � Result representation: {<a,[s,s 1 ]>, <c,[s 1 ,s 2 ]>, � Tao, Y., Papadias, D. Time Parameterized Queries in 7 8 <f,[s 2 ,s 3 ]>, <h, [s 3 ,e]>} Spatio-Temporal Databases. ACM SIGMOD, 2002. R EL ED WO RK – T IME P ARAMET ERIZED NN R EL ED WO RK – T P NN ( C O NT .) AT AT s 1 � Step 1: Find the NN of the start point s , i.e., point a . � Step 3: Perform another TP NN to find: � Step 2: Use the TP technique to find: The first point � Starting from s1, how far we need to travel for the on the line segment ( s 1 ) where there is a change in current NN (i.e., c ) to change to f . the NN (i.e., point c ) will become the next NN 9 10 � Repeat this until we finish the entire segment. R EL ED WO RK – T P NN ( C O NT .) C K NN - D EFINIT AT IO N d f s 1 s d s f s h s g h g � Goal: Find all split points(as well as the corresponding NN for each partition) with a single traversal. � Intuitively: perpendicular bisector & [s,e] segment � Split list: The set of split points (including s and e). � Not only NN, but support k-NN � Vicinity circle: The circle that centers at split point s i � Still overhead: n times with radius dist(s i , s i .NN) Yufei , Dimitris Tao Papadias 11 � We say a data point u covers a point s if u=s.NN. E.g., 12 points a, c cover segments [s, s1], [s1, s2] 2

  3. 9/23/2009 C K NN – P RO BL EM C HARAC T C K NN - P RO BL EM C HARAC T ERIST IC S ERIST IC S � Lemma 1: Given a split list SL { s 0 , s 1 , …, s |SL − 1| }, and a � Lemma 2: (Covering Continuity) new data point p, then: p covers some point on query � The split points covered by a point p are continuous. segment q if and only if p covers a split point. � Namely, if p covers split point s i but not s i − 1 (or s i+1 ), then p cannot cover s i − j (or s i+j ) for any value of j>1. ┐ s 1 13 14 C K NN - P RO BL EM C HARAC T C K NN – R- T ERIST IC S REE AL G O RIT HM � How about the k-NN? � General key notes: � Lemma 1 : Fit || Lemma 2 : Cannot Fit � Use branch-and-bound techniques to prune the search space. � Eg: � K=3 � R-tree traverse principle: � When a leaf entry (i.e., a data point) p is encountered, SL is updated if p covers any split point (i.e., p is a qualifying entry) – By Lemma 1. � For an intermediate entry, We visit its subtree only if it may contain any qualifying data point – Use heuristics. � Avoid accessing not qualified nodes 15 16 R- T HM – H EURIST IC 1 R- T HM – H EURIST IC 2 ( AFT ER 1) REE AL G O RIT REE AL G O RIT � Given an intermediate entry E and query segment q , � Given an intermediate entry E and query segment q , the sub-tree of E may contain qualifying points only if the subtree of E must be searched if and only if there mindist(E,q) < SL MAXD , where SL MAXD is the maximum exists a split point s i � SL such that dist(s i , s i .NN) > distance between a split point and its NN. mindist(s i , E). 17 18 Compute Mindist(E,q) 3

  4. 9/23/2009 R- T HM – H EURIST IC 3 (O RDER ) R- T HM – L REE AL G O RIT REE AL G O RIT EAF ENT RY � Entries (satisfying heuristics 1 and 2) are accessed in � Input: New entry p , SL ={s 1 ,…s 10 } increasing order of their minimum distances to the � 1) retrieve the split points covered by p query segment q. � 2) update SL � Binary search: Start at s 5 , then s 2 … � Using bisector to judge the direction U i bi t t j d th di ti 19 20 C K NN – R- T HM ( EXAMPL E ) A NAL YSIS - C O ST M O DEL FO R U NIFO RM D AT REE AL G O RIT A � Depth First Actual Search region Approximate Search region � An optimal algorithm on R-trees must access only those nodes whose MBRs intersect the actual search region (i.e., E1 but not E2). � To facilitate the analysis we focus on a more regular 21 22 (approximated) region A NAL YSIS – N O DE A C C ESS P RO BABIL A NAL YSIS – C O ST M O DEL (N O DE A C C ESS ) IT Y � P ACCESS is the probability the MBR E of a node Intersects the Intersects the search region � Dataset cardinality N � R tree structure (Height: h) � The query length: q.l � The orientation angle 23 24 4

  5. 9/23/2009 A NAL YSIS – C O ST M O DEL ( C O NT .) O T HER C NN Q UERY � kCNN query (k=2) � Updating Vicinity circle � The number of distinct neighbors in the final result neighbors in the final result. � Trajectory NN query (TNN) � CPU overhead comparison � q1 = [s,u] � q2 = [u,v] � TP: increase with n NN � q3 = [v,e] � This paper: increase with dataset size N, query � Each segment has a SL length l… � Treated one by one 25 26 26 E E XP 1: C O ST M O DEL E XPERIMENT S VAL UAT IO N � Datasets: � Uniform � Real street segments: CA (130K points), ST (2M points). � Queries (each a segment): � Location and orientation randomly generated Location and orientation randomly generated � Length is set as a parameter � Performance is measured as the average of running 200 queries. � Machine: � 1Ghz CPU, 256M memory � Page size=4K (R-tree node capacity=200) 27 28 � Compare CNN and TP (the only existing solution) E XP 2: P ERFO RMANC E VS Q UERY L E XP 3: P ERFO RMANC E VS K ENG T H 29 30 5

  6. 9/23/2009 E S – KEY NO T XPERIMENT ES D ISC USSIO N AND C O NC L USIO N � A fast algorithm for C- kNN query. In general, CNN outperform TP significantly � � Future work: Single traversal � � Rectangle data For cost model: � � Moving data points BF better than DF (consistent with previous work) � � Application to road networks (i.e., travel instead of Application to road networks (i.e., travel instead of Th The cost model is accurate t d l i t � Euclidean distance) Performance & query Length � Length increase, split points increase � CPU for TP: keep repeat retrieving the same objects � Thank you! Performance & k � For CNN: k has not much influenced on NA, but k influences � CPU: higher number of split points 31 32 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend