beyond unleashing bond
play

BeyOND Unleashing BOND Thomas Bernecker, Franz Graf, Hans-Peter - PowerPoint PPT Presentation

DBRank 2011 LUDWIG- August 29, 2011 MAXIMILIANS- DEPARTMENT DATABASE UNIVERSITT INSTITUTE FOR SYSTEMS Seattle, WA Seattle WA MNCHEN MNCHEN INFORMATICS INFORMATICS GROUP GROUP BeyOND Unleashing BOND Thomas Bernecker, Franz


  1. DBRank 2011 LUDWIG- August 29, 2011 MAXIMILIANS- DEPARTMENT DATABASE UNIVERSITÄT INSTITUTE FOR SYSTEMS Seattle, WA Seattle WA MÜNCHEN MÜNCHEN INFORMATICS INFORMATICS GROUP GROUP BeyOND – Unleashing BOND Thomas Bernecker, Franz Graf, Hans-Peter Kriegel, , , g , Christian Moennig and Arthur Zimek Ludwig-Maximilians-Universität München (LMU) Munich, Germany http://www.dbs.ifi.lmu.de {bernecker, graf, kriegel, zimek}@dbs.ifi.lmu.de moennig@cip.ifi.lmu.de

  2. Outline DATABASE SYSTEMS GROUP 1. Background Motivation: k-nearest neighbor search in high-dimensional g g – databases – BOND revisited 2. Introducing BeyOND – Filtering objects via distance approximations – Sub Cubes, MBRs 3. Experimental Evaluation 4. Conclusions BeyOND – Unleashing BOND 2

  3. Motivation DATABASE SYSTEMS GROUP • Similarity search in high-dimensional space is ☺ important in cases of images, e-commerce, etc. � slow • The suitability of index-based solutions depends on the data di t ib ti distribution • Open question: relevant vs. irrelevant attributes • Similarity search in subspaces: Si il it h i b – Fix query attributes beforehand – Use multiple pivot points to derive upper and lower bounds Use multiple pivot points to derive upper and lower bounds – Process data vertically to reduce the high-dimensional space BeyOND – Unleashing BOND 3

  4. BOND Revisited (1) DATABASE SYSTEMS GROUP • BOND [1] : k-nearest neighbor search on high-dimensional data – Resolves feature vectors (FVs) column-wise – Ranking of columns w.r.t. relevance – Pruning of columns using a branch-and-bound approach – Resolved part is known exactly – Unresolved part has to be approximated Unresolved part has to be approximated – Resolving stops when approximation is „good enough“ – Support of subspace queries pp p q – Distance metrics: • Histogram intersection (uncorrelated dimensions) • Euclidean distance E lid di t [1] de Vries, Mamoulis, Nes, Kersten: Efficient k-NN Search On Vertically Decomposed Data (SIGMOD’02) BeyOND – Unleashing BOND 4

  5. BOND Revisited (2) DATABASE SYSTEMS GROUP • Restrictions of BOND: 1. The approach works only on Zipfian distributed data. 2. The feature values are normalized to [0,1] in each dimension. 3 3. The proposed bounds are loose The validity of stricter bounds The proposed bounds are loose. The validity of stricter bounds (Bond advanced) depends on a certain resolve order of the columns. BeyOND – Unleashing BOND 5

  6. BOND Revisited (3) DATABASE SYSTEMS GROUP • Notation: – query vector q y , , database vector q q v − ∪ − + + = – Splitting of : resolved part , unresolved part ⇒ v v v v v v − − + + = + • Approximated distance: S approx ( q , v ) S ( q , v ) S ( q , v ) 1 2 ∑ ∑ − − − − = − – Resolved part: 2 2 S ( q , v ) ( q v ) 1 i { i } = ∑ i + + + + 2 + + − ≥ – Unresolved part: S ( q , v ) max q , 1 q S ( q , v ) 2 i i 1 i • Distance bounds: − − + + = = + + ≥ ≥ S S upper ( ( q q , v v ) ) S S ( ( q q , v v ) ) S S ( ( q q , v v ) ) S S ( ( q q , v v ) ) 1 2 1 − − = + ≤ S lower ( q , v ) S ( q , v ) 0 S ( q , v ) 1 1 BeyOND – Unleashing BOND 6

  7. Beyond BOND DATABASE SYSTEMS GROUP • Benefits of BeyOND: 1. Independence of the data distribution. p ☺ ☺ 2. No restriction to a normalized data space. ☺ 3. No specific resolve order of the dimensions is needed. ☺ ⇒ Price: Distance approximations are no more suitable! � • Solution: Combining the idea of BOND with well-known t techniques: h i – VA-file (data space partitioning) – MBR (Minimum Bounding Rectangle) approximation (data organizing) MBR (Minimum Bounding Rectangle) approximation (data organizing) ⇒ Remaining restriction: minimum/maximum values for each ⇒ Remaining restriction: minimum/maximum values for each dimension need to be known � BeyOND – Unleashing BOND 7

  8. Sub Cubes (1) DATABASE SYSTEMS GROUP • First extension: VA-file [2] with one split ⇒ 2 d sub cubes ⇒ 2 sub cubes 1 ⇒ Addressing via Z-IDs ⇒ Improved bounds based on the close / far ⇒ Improved bounds based on the close / far sub cube borders and lower 1 upper 2 c c 2 v i v i • Memory-efficient representation (8 bytes → 1 bit) – Sub cube need not be kept in main memory p y • Split positions stored in one separate array per dimension • Dependence on split level: p p – FV: 8 bytes per dimension – s splits: s / 8 bytes ( s bits) per dimension [2] Weber, Schek, Blott. A Quantitative Analysis and Performance Study for Similarity Search Methods in High-Dimensional Spaces (VLDB‘98) BeyOND – Unleashing BOND 8

  9. Sub Cubes (2) DATABASE SYSTEMS GROUP • Old distance bounds: { { } } ∑ ∑ − − + + 2 = = + + − S S ( ( q q , v v ) ) S S ( ( q q , v v ) ) max max q q , 1 1 q q upper 1 i i i − v − = + S lower ( q , v ) S ( q , ) 0 1 • Approximations of unresolved dimensions: { { } } ∑ 2 2 ′ + + + + = − − lower upper S ( q , v ) max q c , q c [ ] + + 2 i i v v i i i ⎧ + ∈ lower upper 0 if q c , c ⎪ ⎪ + + ∑ ∑ i { } v v ′ ′ ′ ′ + + + + = i i i i ⎨ ⎨ S S ( ( q , v ) ) 2 2 + + − − i lower upper ⎪ min q c , q c else ⎩ + + i i v v i i • New distance bounds: • New distance bounds: ′ ′ = − − + + + ≥ S upper ( q , v ) S ( q , v ) S ( q , v ) S ( q , v ) 1 2 1 ′ ′ ′ = − − + + + ≤ S lower ( q , v ) S ( q , v ) S ( q , v ) S ( q , v ) 1 2 1 BeyOND – Unleashing BOND 9

  10. MBR Caching (1) DATABASE SYSTEMS GROUP • Most sub cubes are (very) sparse, i.e. occupied by at most one FV • Dense sub cubes allow a tighter Dense sub cubes allow a tighter approximation via MBRs – Restrict the number of MBRs in order to avoid a memory overhead – Ranking function for MBRs: V V = ⋅ sub cube f ( MBR ) card ( MBR ) V MBR d ⋅ d 16 16 – 8 byte coordinates: memory increase is limited to bytes card ( MBR ) per feature vector (+ pointer to Z-ID) BeyOND – Unleashing BOND 10

  11. MBR Caching (2) DATABASE SYSTEMS GROUP • Limit the number of MBRs to 1% of the database size • Threshold as a trade-off between pruning power and Threshold as a trade off between pruning power and additional memory consumption • Requirements: Requirements: – Either all MBRs can be kept in memory, – or the time for loading the MBRs is less than the time for resolving the respective FVs. • Adaption of the equations for lower and upper bounds BeyOND – Unleashing BOND 11

  12. Experimental Evaluation (1) DATABASE SYSTEMS GROUP • Evaluated approaches: 1. BondAdvanced (stricter bounds, but resolve order dependent) 2. Bond (original bounds)* 3. Sequential* 4. Beyond-1 (1 split) 5. BeyondMBR-1 (1 split + MBRs) y ( p ) 6. Beyond-2 7. BeyondMBR-2 8. Beyond-3* 9. BeyondMBR-3* BeyOND – Unleashing BOND 12

  13. Experimental Evaluation (2) DATABASE SYSTEMS GROUP • Data set descriptions: Data Set Dims Size Type ALOI 27 110,250 Color Histograms, Zipfian CLUSTERED CLUSTERED 20 20 500 000 500,000 S Synthetic, 50 Clusters, Gaussian th ti 50 Cl t G i PHOG [3] 110 10,715 CT Histograms, PCA‘ed SIFT [4] SIFT 133 133 335 583 335,583 Image Features Image Features [3] Graf, Kriegel, Schubert, Poelsterl, Cavallaro. 2D Image Registration in CT Images Using Radial Image Descriptors (MICCAI‘11) [4] Lowe. Distinctive Image Features from Scale-Invariant Keypoints (Int. Journal of Computer Vision, 2004) BeyOND – Unleashing BOND 13

  14. Experimental Evaluation (3) DATABASE SYSTEMS GROUP • Experimental settings: – 50 k-nearest neighbor queries g q – k = 10 – Averaged cumulative number of pruned FVs after resolving a column – AUC: data not resolved – AOC: data resolved for refinement BeyOND – Unleashing BOND 14

  15. Experimental Evaluation (4) DATABASE SYSTEMS GROUP ALOI 27 110,250 Color Histograms, Zipfian BondAdvanced Bond Beyond-2 Beyond-1 BeyondMBR-1 BeyOND – Unleashing BOND 15

  16. Experimental Evaluation (5) DATABASE SYSTEMS GROUP CLUSTERED 20 500,000 Synthetic, 50 Clusters, Gaussian BondAdvanced Bond Beyond-2 Beyond-1 BeyondMBR-1 BeyOND – Unleashing BOND 16

  17. Experimental Evaluation (6) DATABASE SYSTEMS GROUP PHOG 110 10,715 CT Histograms, PCA‘ed BondAdvanced Bond Beyond-2 BeyondMBR-1 Beyond-1 BeyOND – Unleashing BOND 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend