Prof. Paolo Ciaccia - PDF document

�� Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ �� In the following we will go through 3 distinct topics, all of them being � related by the common objective to provide efficient support to the execution of MM similarity queries �� !�� "� #�� $�� 1

�� Remind: � Recursive bottom-up B A aggregation of objects based L E on MBR’s � Regions can overlap D G H I � Each node can contain up to F C entries, but not less than c ≤ 0.5*C J M � The root makes an exception K A B C N O P A B C C D E F G H I J K L M N O P …………………………... D P �� !�� We start from the root and move down the tree one step at a time, trying to � find a “nice place” where to accommodate the new object p #��$��%�� $�� &'�(� Which child node At each step we have a same question to answer: � is the most suitable to B accommodate p? A B A p p And here? C C �� " 2

��#�� The recursive algorithm that descends the tree to insert a new object p, � together with its TID, is called ChooseSubtree ChooseSubtree (Ep=(p,TID),ptr(N)) Read(N); 1. If N is a leaf then: return N // we are done 2. else: { choose among the entries Ec in N 3. the one, Ec*, for which Penalty(Ep,Ec*) is minimum; 4. return ChooseSubtree(Ep,Ec*.ptr) } // recursive call end. 5. We invoke the method on the index root � The specific criterion used to decide “how bad” an entry is, should we � choose it to insert p, is encapsulated in the Penalty method )��*�� This insertion algorithm is the one used by most multi-dimensional and � metric trees �� $ ��%�� If point p is inside the region of an entry Ec, then the penalty is 0 � Otherwise, Penalty can be computed as the increment of volume (area) of � the MBR +��,'-./012��(�� [BKS+90] introduces the R*-tree, the most common variant of R-tree � Both criteria aim to obtain trees with better performance: � !��3��4�� !��3�� A B p B is better than A �� & 3

��'�� When p has to be inserted into a leaf node that already contains C entries, � an overflow occurs, and N has to be split For leaf nodes whose entries are points the solution aims to split the set of � C+1 points into 2 subsets, each with at least c and at most C points Among the several possibilities, one could consider the choice that leads to � have a minimum overall area +��5*�+�� N N2 N1 p p N1 p ? N2 C = 16 c = 6 �� ( ��'�� As in B+-trees, splits propagate upward and can recursively trigger splits at � higher levels of the tree The problem to be faced now is how to split a set of C+1 (hyper-)rectangles � 5��&'�(� � The original proposal just aims to minimize the sum of resulting areas � The R*-tree implements a more sophisticated criterion, which takes into account � the areas, overlap, and perimeters of the resulting regions N1 N N1 ? C = 7 N2 N2 c = 3 �� ) 4

*�%��'�� It’s a matter of fact that vector spaces, equipped with some (weighted) � Lp-norm, are not general enough to deal with the whole variety of feature types and distance functions needed for MM data Example: ��+�� 3 1 ∀ (red) point of s1 find the closest (blue) point in s2 Let h(s1,s2) be the maximum of such distances 2 ∀ (blue) point in s2 find the closest (red) point in s1 Let h(s2,s1) be the maximum of such distances 3 Let d Haus (s1,s2) = max{ h(s1,s2), h(s2,s1) } Used for matching shapes �� + ,��'��% We have logs of WWW accesses, where each log entry has a format like: � www-db.deis.unibo.it pciaccia - [11/Jan/1999:10:41:37 +0100] “GET /~mpatella/ HTTP/1.0” 200 1573 Log entries are grouped into sessions (= sets of visited pages): � s = <ip_address, user_id, [url 1 , … … ,url k ]> … … and we want to compare “similar sessions” (i.e., similar sets), using: s1 s2 − + − s1 s2 s2 s1 ( ) = d setdiff s1, s2 + s1 s2 �� -. 5

Prof. Paolo Ciaccia - PDF document

Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ In the following we will go through 3 distinct topics,

Time Series Time Series Time Series Prof. Paolo Ciaccia Prof. Paolo Ciaccia http://www-

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

The BBS Algorithm The BBS Algorithm The BBS Algorithm Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Preference Relations Relations Preference Preference Relations Prof. Paolo Ciaccia Prof. Paolo

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Searching Documents and Pages Searching Documents and Pages Searching Documents and Pages Prof.

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Skyline queries Information Systems M Prof. Paolo Ciaccia http://www

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia

Information Filtering Information Systems M Prof. Paolo Ciaccia

Information Systems M Prof. Paolo Ciaccia

Information Systems M Prof. Paolo Ciaccia

Environmental Impact Statement DEIS Report DEIS Process Historical Resources study

Principles of Information Filtering in Metric Spaces Paolo Ciaccia and Marco Patella DEIS,

Dusty Galaxies in Protoclusters Dave Clements, Josh Greenslade, Tai-an Cheng, Imperial

in a probability-based internet panel CIPHER conference - February 21st - Washington Elodie

Math 211 Math 211 Lecture #29 Phase Plane Portraits Systems of Higher Dimension November 4,

Hubbles Law expanding universe REVIEW v = H o d Each dot on the balloon can be

Old Galaxies and New Instruments Facing the Future: A Festival for Frank Bash Andrew J. Baker

Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets Gregory Kucherov CNRS/LIGM

Seeing the (Infrared) Light Marco Viero KIPAC/Stanford w/ Lorenzo Moncelsi (Caltech), Ryan

Rudy Gilmore SISSA Santa Cruz Galaxies Conference UCSC August 10, 2011 Wednesday, August 10,