SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe - PowerPoint PPT Presentation

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

Table of Contents Part I: Metric searching in a nutshell  Foundations of metric space searching  Survey of existing approaches Part II: Metric searching in large collections  Centralized index structures  Approximate similarity search  Parallel and distributed indexes P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 2

Survey of existing approaches ball partitioning methods 1. generalized hyper-plane partitioning approaches 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 3

Survey of existing approaches ball partitioning methods 1. Burkhard-Keller Tree 1. Fixed Queries Tree 2. Fixed Queries Array 3. Vantage Point Tree 4. Multi-Way Vantage Point Tree 1. Excluded Middle Vantage Point Forest 5. generalized hyper-plane partitioning approaches 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 4

Burkhard-Keller Tree (BKT) [BK73]  Applicable to discrete distance functions only  Recursively divides a given dataset X  Choose an arbitrary point p j  X, form subsets: X i = { o  X, d ( o,p j ) = i } for each distance i ≥ 0.  For each X i create a sub-tree of p j  empty subsets are ignored p j X 2 p j X 2 X 3 X 4 X 3 X 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 5

BKT: Range Query Given a query R(q,r) :  traverse the tree starting from root  in each internal node p j , do: if d ( q,p j ) ≤ r  report p j on output if max{ d ( q,p j ) – r, 0 } ≤ i ≤ d ( q,p j ) + r  enter a child i p 1 r 2 3 4 p 2 q p 2 p 3 p 1 3 5 p 3 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 6

Fixed Queries Tree (FQT)  modification of BKT  each level has a single pivot  all objects stored in leaves  during search distance computations are saved  usually more branches are accessed  one distance comp. p 1 r 0 4 2 3 p 2 q p 2 p 2 p 1 p 1 0 3 4 5 p 2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 7

Fixed-Height FQT (FHFQT)  extension of FQT  all leaf nodes at the same level r p 2 q  increased filtering using more routing p 1 objects  extended tree depth does not typically introduce further computations p 1 p 1 0 4 0 4 2 3 2 3 p 2 p 2 p 2 p 1 0 3 4 5 2 0 3 4 5 6 p 2 p 1 p 2 FQT FHFQT P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 8

Fixed Queries Array (FQA)  based on FHFQT  an h -level tree is transformed to an array of paths  every leaf node is represented with a path from the root node  each path is encoded as h values of distance  a search algorithm turns to a binary search in array intervals p 1 0 4 2 3 p 1 p 2 0 2 2 3 3 4 p 2 2 0 3 4 5 6 2 0 3 4 5 6 p 1 p 2 FHFQT P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 9

Vantage Point Tree (VPT)  uses ball partitioning  recursively divides given data set X  choose vantage point p  X, compute median m  S 1 = { x  X – { p } | d ( x,p ) ≤ m }  S 2 = { x  X – { p } | d ( x,p ) ≥ m }  the equality sign ensures balancing m 1 p 1 m 2 p 1 p 2 p 2 S 1,1 S 1,2 S 1,1 S 1,2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 10

VPT (cont.)  One or more objects can be accommodated in leaves.  VP tree is a balanced binary tree. p 1 m 1  Static structure p 2 m 2 p 3 m 3 o 4 o 1 o 3 o 8 o 9 o 11 o 7 o 2 o 6 o 5 o 10 o 12  Pivots p 1 ,p 2 and p 3 belong to the database!  In the following, we assume just one object in a leaf. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 11

VPT: Range Search Given a query R ( q,r ) :  traverse the tree starting from its root  in each internal node ( p i ,m i ) , do:  if d ( q,p i ) ≤ r report p i on output  if d ( q,p i ) - r ≤ m i search the left sub-tree (a,b)  if d ( q,p i ) + r ≥ m i search the right sub-tree (b) m i m i r p i p i r q q (a) (b) P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 12

VPT: k-NN Search Given a query NN ( q ):  initialization: d NN =d max NN=nil  traverse the tree starting from its root  in each internal node (p i ,m i ), do:  if d ( q,p i ) ≤ d NN set d NN = d ( q,p i ) , NN=p i  if d ( q,p i ) - d NN ≤ m i search the left sub-tree  if d ( q,p i ) + d NN ≥ m i search the right sub-tree  k-NN search only requires the arrays d NN [ k ] and NN [ k ]  The arrays are kept ordered with respect to the distance to q . P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 13

Multi-Way Vantage Point Tree  inherits all principles from VPT  but partitioning is modified  m -ary balanced tree  applies multi-way ball partitioning m 3 m 2 p 1 p 1 m 1 S 1,1 S 1,1 S 1,2 S 1,3 S 1,4 S 1,2 S 1,3 S 1,4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 14

Vantage Point Forest (VPF)  a forest of binary trees  uses excluded middle partitioning 2 r m i m i p i p i  middle area is excluded from the process of tree building P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 15

VPF (cont.)  given data set X is recursively divided and a binary tree is built  excluded middle areas are used for building another binary tree X M 1 + M 2 + M 3 p’ 1 p 1 p’ 2 M’ 1 p’ 3 p 2 M 1 p 3 S’ 1,1 M’ 2 S’ 1,2 S’ 2,1 M’ 3 S’ 2,2 S 1,1 M 2 S 1,2 S 2,1 M 3 S 2,2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 16

VPF: Range Search Given a query R(q,r) :  start with the first tree  traverse the tree starting from its root  in each internal node (p i ,m i ), do: if d(q,p i ) ≤ r report p i  if d(q,p i ) – r ≤ m i – r search the left sub-tree   if d(q,p i ) + r ≥ m i – r search the next tree !!! if d(q,p i ) + r ≥ m i + r search the right sub-tree   if d(q,p i ) – r ≤ m i + r search the next tree !!! if d(q,p i ) – r ≥ m i – r and  d(q,p i ) + r ≤ m i + r search only the next tree !!! P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 17

VPF: Range Search (cont.)  Query intersects all  Query collides only with partitions exclusion  Search both sub-trees  Search just the next tree  Search the next tree 2 r 2 r m i m i p i p i r q q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 18

Survey of existing approaches ball partitioning methods 1. generalized hyper-plane partitioning 2. approaches Bisector Tree 1. Generalized Hyper-plane Tree 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 19

Bisector Tree (BT)  Applies generalized hyper-plane partitioning  Recursively divides a given dataset X  Choose two arbitrary points p 1 ,p 2  X c r 2  Form subsets from remaining objects: S 1 = { o  X, d ( o,p 1 ) ≤ d ( o,p 2 )} S 2 = { o  X, d ( o,p 1 ) > d ( o,p 2 )} p 2 c and r 2 c are  Covering radii r 1 c r 1 established:  The balls can intersect! p 1 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 20

BT: Range Query Given a query R ( q,r ) :  traverse the tree starting from its root  in each internal node < p i ,p j >, do: c r j if d ( q,p x ) ≤ r  report p x on output if d ( q,p x ) – r ≤ r x c  enter a child of p x p j p i p j r c r i q p i P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 21

Monotonous Bisector Tree (MBT)  A variant of Bisector Tree  Child nodes inherit one pivot from the parent.  For convenience, no covering radii are shown. Bisector Tree Monotonous Bisector Tree p 2 p 2 p 5 p 6 p 3 p 3 p 4 p 1 p 1 p 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 22

MBT (cont.)  Fewer pivots used  fewer distance evaluations during query processing & more objects in leaves. Bisector Tree Monotonous Bisector Tree p 1 p 2 p 1 p 2 p 3 p 4 p 5 p 6 p 1 p 3 p 2 p 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 23

Voronoi Tree  Extension of Bisector Tree  Uses more pivots in each internal node  Usually three pivots p 2 c r 2 p 3 p 1 c r 3 c r 1 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 24

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe - PowerPoint PPT Presentation

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching Survey of existing approaches Part

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

Welcome back... Metric spaces. Approximate metric using a tree. Tree metric: 16 16 A metric

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

Metric Spaces Definition If d is a metric on X , then the metric topology on X induced by d is

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in

Information- -Velocity Metric Velocity Metric Information-Velocity Metric Information for the

Similarity search Evaluating Strategies for Given a query Web page q , return Web Similarity

Survey Similarity search for complex similarity models Analysis of previous solution for k

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

How do people come to know the love of God? Psalm 139 How precious are you thoughts about me, O

Total (Co)Programming with Guarded Recursion Andrea Vezzosi Department of Computer Science and

Multiple Sequence Alignment using Profile HMM based on Chapter 5 and Section 6.5 from

Automated Testing of Debian Packages Holger Levsen debian@layer-acht.org Lucas Nussbaum

Formal Methods for Critical Systems: A verified implementation of nested procedures Tristan

CS686: RRT Sung-Eui Yoon ( ) Course URL: http://sglab.kaist.ac.kr/~sungeui/MPA Class

Combining program verification with component-based architectures Alexander Senier BOB 2018

Cross-compiling Linux Kernels on x86_64: A tutorial on How to Get Started Shuah Khan Senior