survey
play

Survey Similarity search for complex similarity models Analysis of - PowerPoint PPT Presentation

Optimal Multi-Step k -Nearest Neighbor Search Thomas Seidl and Hans-Peter Kriegel University of Munich, Germany ACM SIGMOD 98, Seattle Survey Similarity search for complex similarity models Analysis of previous solution for k -nn


  1. Optimal Multi-Step k -Nearest Neighbor Search Thomas Seidl and Hans-Peter Kriegel University of Munich, Germany ACM SIGMOD ‘98, Seattle

  2. Survey • Similarity search for complex similarity models • Analysis of previous solution for k -nn search • An optimality criterion for k -nn search • Optimal algorithm for k -nn search • Performance analysis (c) 1998 Thomas Seidl SIGMOD ‘98 - 2

  3. Distance-based Similarity Search Principle: Small Distances ↔ Strong Similarity k -NearestNeighborQuery ( q , k ): { } ( ) RangeQuery , : ( , ) ε ∈ ≤ ε q o DB d o q { } monotonous d q − 1 , ,     → � k DB 4th 2nd 3rd 1st no answer too many answers k nearest neighbors (c) 1998 Thomas Seidl SIGMOD ‘98 - 3

  4. Complex Similarity Models • Quadratic Form Distance Functions A 2 ( , ) ( ) ( ) = − ⋅ ⋅ − T d p q p q p q A – Color Histograms for Image Databases (QBIC) 256-D histograms (Niblack et al. 93) (Hafner et al. 95) – Shape Similarity for 2D and 3D: Up to 4,096-D vectors (Thesis Seidl 97) – … • Max-Morphological Distance – 2D images: Tumor shapes (Korn et al. 96) (c) 1998 Thomas Seidl SIGMOD ‘98 - 4

  5. Cost of Single Evaluations – Quadratic Form Distance Functions 100,000 evaluation 1,656 102 time [msec] 1,000 6.2 1.1 10 0.4 0.23 0 21 64 112 256 1,024 4,096 dimension – Max-Morphological Distance (Korn et al. 96) 12.69 seconds (avg) per distance evaluation (c) 1998 Thomas Seidl SIGMOD ‘98 - 5

  6. Multi-Step Query Processing • Multi-Step Similarity Search Range Queries (Faloutsos et al. 94) Filter Step k -Nearest Neighbor Queries (Korn et al. 96) (index-based) • No False Drops? candidates Lower-Bounding Property Refinement Step ≤ ( , ) ( , ) d p q d p q (exact evaluation) f o filter distance object distance results (c) 1998 Thomas Seidl SIGMOD ‘98 - 6

  7. Previous k -nn Algorithm (Korn et al. 96) query (q,k) First More candidates k -nn query on Index ( d f ) Phase generated Index than necessary primary d max (d o ) k Second in d max query on Index ( d f ) d Objects Fixed Phase x a m 2nd Phase! >>k final k -nn (d o ) (c) 1998 Thomas Seidl SIGMOD ‘98 - 7

  8. Number of Candidates 1.2 object and d max filter distances 1 k -th object 0.8 distance 0.6 0.4 dmax object distance 0.2 filter distance 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 rank according to filter distance (c) 1998 Thomas Seidl SIGMOD ‘98 - 8

  9. Optimality of k -NN Algorithms Lemma d ( , ) d ( , ) – Let d f be a lower-bounding filter of d o : ≤ p q p q f o – For a multi-step k -nn algorithm based on d o and d f , the optimal set of candidates is: { } d ( , ) ∈ ≤ ε o DB o q f k – where ε k is the k -th object similarity distance: { } ( ) max d ( , ) ε k = ∈ o q o NN k o q (c) 1998 Thomas Seidl SIGMOD ‘98 - 9

  10. Optimal k -nn Algorithm (new) query (q,k) THEOREM: No false drops 1 No unnecessary init ranking on Index (d f ) 2 candidates Index while d f (o,q) ≤ d max do get next o from index is adjusted and adjust d max (d o ) step by step! d x a m Objects result final k -nn: d o (o,q) ≤ d max Required: Incremental Ranking on index (Hjaltason & Samet 95) (c) 1998 Thomas Seidl SIGMOD ‘98 - 10

  11. Minimal Set of Candidates 1.2 object and primary d max filter distances 1 optimal d max 0.8 primary dmax 0.6 optimal dmax 0.4 filter distance 0.2 object distance 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 rank according to filter distance The higher the filter distance, the better the filter selectivity (c) 1998 Thomas Seidl SIGMOD ‘98 - 11

  12. Uniformly Distributed Data (20-D) 71,610 80000 number of candidates Experimental Setting 60000 42,891 • 100,000 Objects, 20-D 40000 26,546 • Matrices: sim-id, 1-0, 2-2 20000 370 358 1,118 • Queries: k = 10 (0.01%) 0 previous sim-id sim-1-0 sim-2-2 algorithm • Index: 15-D 1200 1,117 overall runtime [sec] 1000 optimal Avg. Improvement Factors algorithm 800 664 600 419 • Candidates: 72, 120, 64 400 • Overall Time: 26, 48, 23 200 16 14 48 0 sim-id sim-1-0 sim-2-2 similarity matrix (c) 1998 Thomas Seidl SIGMOD ‘98 - 12

  13. 2-D Shape Similarity (1,024-D) 2500 number of candidates Experimental Setting 2000 1500 • 10,000 Images, 32x32 Pixel 1000 • ‘Neighborhood Area’: 9-1 500 • Queries: k = 5 (0.05%) 0 previous algorithm 16-D 32-D 48-D 64-D • Index (KLT): 16-D, …, 64-D 300 overall runtime [sec] optimal 250 Avg. Improvement Factors algorithm 200 150 • Candidates: 2.3 100 • Overall Time: 1.6 to 2.3 50 0 16-D 32-D 48-D 64-D dimension of index (c) 1998 Thomas Seidl SIGMOD ‘98 - 13

  14. Color Histograms (112-D) 10000 number of candidates Experimental Setting 8000 6000 • 112,700 Histograms (112-D) 4000 • Quadratic Form Distance 2000 • Queries: k = 2,…,12 (0.01%) previous 0 algorithm • Index (KLT): 12-D 2 4 6 8 10 12 120 optim al overall runtime [sec] 100 algorithm Avg. Improvement Factors 80 60 • Candidates: 17 40 • Overall Time: 8.5 20 0 2 4 6 8 10 12 query parameter k (c) 1998 Thomas Seidl SIGMOD ‘98 - 14

  15. Conclusions • Complex Similarity Search : Expensive similarity evaluations • Multi-Step Approach : Lower-bounding filter distance function • Optimal Algorithm : Minimum number of exact evaluations • Average Improvement Factors : – up to 120 (number of candidates) – up to 48 (overall runtime) • Future Work : New applications; Integration with Data Mining (c) 1998 Thomas Seidl SIGMOD ‘98 - 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend