similarit y query pro cessing using disk arra ys ap
play

' $ \Similarit y Query Pro cessing using Disk Arra ys" - PowerPoint PPT Presentation

' $ \Similarit y Query Pro cessing using Disk Arra ys" Ap ostolos N. P apadop oulos & Y annis Manolop oulos Departmen t of Informatics, Aristotle Univ ersit y Thessaloniki, Greece A CM SIGMOD Conference,


  1. ' $ \Similarit y Query Pro cessing using Disk Arra ys" Ap ostolos N. P apadop oulos & Y annis Manolop oulos Departmen t of Informatics, Aristotle Univ ersit y Thessaloniki, Greece A CM SIGMOD Conference, Seattle, June 1998 & % 1

  2. ' $ Outline � In tro duction (Disk Arra ys & Similarit y Queries) � Assumptions & Problem De�nition � Similarit y Searc h Algorithms � P erformance Ev aluation � Concluding Remarks & F uture W ork & % 2

  3. ' $ In tro duction - Disk Arra ys � P o w erful storage media of increasing imp ortance. � Can handle large n um b er of requests. � Exploit I/O parallelism (b oth in terquery & in traquery). � F ault toleran t (e.g. RAID lev els 1, 3, 5). & % 3

  4. ' $ In tro duction - Disk Arra ys (con t.) CPU DMA cont. MEMORY SCSI BUS cont. cont. cont. cont. & % 4

  5. ' $ In tro duction - Similarit y Queries W e fo cus on the v ector mo del, where an ob ject is represen ted b y a set of attributes, comp osing a v ector in a m ulti-dimensional space. There are t w o fundamen tal t yp es of similarit y queries that can b e applied in the v ector mo del: � Query , where the user sp eci�es a shap e and asks for all R ange the ob jects falling inside the corresp onding region, � Query , where the user giv es an ob ject, and Ne ar est-Neighb or requests for the nearest ob jects. k & % 5

  6. ' $ Assumptions � The set of n -d p oin ts is organized in an R -tree. � � The R -tree is partitioned in the disk arra y b y means of the � heuristic. Pr oximity Index � The partitioning is p erformed no de-wise, i.e. after a split, the new page is assigned to a disk. & % 6

  7. ' $ Problem De�nition a set of ob jects ( n -d p oin ts), a query ob ject P , and an Giv en q in teger n um b er , k an e�cien t plan to access the parallel R -tree, in order to determine � rep ort the k nearest neigh b ors of P , q to : trying (i) maximize parallelism, (ii) access as few no des as p ossible, and (iii) reduce query resp onse time. & % 7

  8. ' $ Similarit y Searc h Algorithms - Distances D min D mm R2 D max R1 Pq � D ( P ; R ): the min distance b et w een a p oin t and an MBR. min q x � ( P ): ensures the existence of at least one p oin t. D ; R mm q x � D ( P ; R ): the max distance b et w een a p oin t and an MBR. max q x & % 8

  9. ' $ Similarit y Searc h Algorithms - BBSS � Prop osed b y Roussop oulos et. al., for answ ering NN queries in R-trees. � It is a branc h-and-b ound algorithm, and in eac h step, a new no de is accessed according to the distance b et w een the query p oin t and the no de MBR. � The distance of the query p oin t to an MBR can b e either the (optimistic), or the (p essimistic). Exp erimen ts ha v e D D min mm demonstrated that using D is more e�cien t. min Limitation: in traquery parallelism can not b e exploited, since eac h & time a single no de is accessed. % 9

  10. ' $ Similarit y Searc h Algorithms - FPSS � It op erates in a greedy philosoph y , trying to access in parallel as man y no des as p ossible. � If a no de MBR is in tersected b y the curren t query h yp ersphere, then the no de is accessed, otherwise it is rejected. � The algorithm �rst determines a threshold distance D and thr es then descen ts the R -tree, fetc hing the no des from the � corresp onding disks. Limitation: a large n um b er of no des is accessed, leading to p erformance degradation. & % 10

  11. ' $ Similarit y Searc h Algorithms - FPSS (con t.) 10 R3 10 P R1 5 R2 Let =5. The circle determined b y and ( P ), guaran tees k P D ; R max 2 the existence of � 5 p oin ts. FPSS fetc hes ALL pages that in tersect the circle (i.e. , and ). The pro cess is applied to all R -tree R R R � 1 2 3 & % lev els. 11

  12. ' $ Similarit y Searc h Algorithms - CRSS Candidate Reduction Criterion: Giv en a query p oin t P , a threshold distance D and a set of MBRs q th R = f R g then for a 2 R : ; :::; R R m x 1 � if D < D ( P ; R ), then R is rejected. th min q x x � if � ( P ), then is set activ e. D D ; R R th mm q x x � if � ( P ) and ( P ), then is D D ; R D < D ; R R th min q x th mm q x x sa v ed for p ossible future reference. & % 12

  13. ' $ Similarit y Searc h Algorithms - CRSS (con t.) 10 R3 10 P R1 5 R2 MBRs and will b ecome activ e, and the corresp onding pages R R 1 2 will b e fetc hed, whereas MBR will b e sa v ed as a candidate for R 3 future reference, since ( P ) and ( P ). D > D ; R D < D ; R th min 3 th mm 3 & % 13

  14. ' $ Similarit y Searc h Algorithms - CRSS (con t.) The CRSS algorithm op erates in four mo des: 1. The algorithm op erates in AD APTIVE mo de un til the leaf-lev el is reac hed for the �rst time. Distance is adapted. D th 2. Ev ery time the leaf-lev el is reac hed, the algorithm passes to UPD A TE mo de. The b est distances are (p ossibly) up dated. k 3. The NORMAL mo de refers to cases where the algorithm op erates in an in termediate tree-lev el, but after the AD APTIVE mo de. 4. The TERMINA TE mo de signals that there are no candidate no des left, and the NNs ha v e b een determined. k & % 14

  15. ' $ Similarit y Searc h Algorithms - CRSS (con t.) Imp ortan t Optimizations: � Let N D denote the n um b er of disks, and AN the n um b er of activ e no des. If , then only pages will b e fetc hed. AN > N D N D Thanks to the e�ciency of the sc heme, w e Pr oximity Index an ticipate that these no des are assigned to di�eren t disks. The rest AN � N D no des are sa v ed as candidates. � During the AD APTIVE mo de it is imp ortan t that the activ e no des con tain � ob jects. This guaran tees that when the k leaf-lev el is reac hed for the �rst time, � distances are a v ailable. k (In eac h no de, a sp ecial �eld giv es the n um b er of ob jects lo cated under the corresp onding subtree). & % 15

  16. ' $ Similarit y Searc h Algorithms - OPTIMAL De�nition: 1. A similarit y searc h algorithm is called if exactly strict optimal k ob jects are insp ected, when answ ering a k -NN query . 2. A similarit y searc h algorithm is called if the we ak optimal minim um n um b er of pages is retriev ed, when answ ering a k -NN query . Observ ation: Algorithms BBSS, FPSS and CRSS are neither strict optimal nor w eak optimal. & % 16

  17. ' $ Similarit y Searc h Algorithms - OPTIMAL (con t.) W e assume a h yp othetical w eak optimal algorithm (W OPTSS). Let the distance D from the query p oin t P to its k -th nearest neigh b or k q b e kno wn in adv ance. Then, W OPTSS will retriev e the pages that in tersect the h yp ersphere with cen ter and radius . P D q k The n um b er of pages retriev ed b y this algorithms serv es as a lo w er b ound for an y similarit y searc h algorithm. & % 17

  18. ' $ P erformance Ev aluation The sim ulation mo del is depicted b elo w. new queries pending disk requests �� �� �� �� �� �� �� �� CPU �� �� �� �� �� �� �� �� �� �� RAM �� �� �� �� �� �� �� �� �� �� �� �� �� �� DMA �� �� pending bus requests ��� ��� I/O bus ��� ��� ��� ��� & % 18

  19. ' $ P erformance Ev aluation (con t.) Set: Gaussian, Population: 80000, Disks: 10, Dimensions: 10 Number of Accessed Nodes (normalized to WOPTSS) 1.14 1.12 BBSS CRSS WOPTSS 1.1 1.08 1.06 1.04 1.02 1 0.98 0.96 0 100 200 300 400 500 600 700 Nearest Neighbors Requested (1 - 700) & % 19

  20. ' $ P erformance Ev aluation (con t.) Set: California, Population: 62173, Disks: 10, NNs: 100, Dimensions: 2 0.4 BBSS FPSS 0.35 CRSS WOPTSS 0.3 Mean Response Time (sec) 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 20 Queries per second (0.1 - 20) & % 20

  21. ' $ P erformance Ev aluation (con t.) Set: Gaussian, Population: 50000, Dimensions: 5, NNs: 10 8 7 BBSS CRSS WOPTSS Normalized Mean Response Time 6 5 4 3 2 1 0 5 10 15 20 25 30 Number of Disks (1 - 30) & % 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend