Content-based Similarity Queries on Complex Data: Challenges and Real Applications
1
Agma J. Machado Traina
agma@icmc.usp.br
31st Brazilian Symposium on Databases 04-07, October 2016 Salvador - Bahia
Content-based Similarity Queries on Complex Data: Challenges and - - PowerPoint PPT Presentation
31 st Brazilian Symposium on Databases 04-07, October 2016 Salvador - Bahia Content-based Similarity Queries on Complex Data: Challenges and Real Applications Agma J. Machado Traina agma@icmc.usp.br 1 Outline Outline Introduction and
1
Agma J. Machado Traina
agma@icmc.usp.br
31st Brazilian Symposium on Databases 04-07, October 2016 Salvador - Bahia
Complex data are everywhere:
3
Time Series
2 4 6 8 1 2 3 4 5 6 7 8 9 1T im e
Data stream sliding window
There is a great need for developing automated systems that could help users to retrieve complex data from the databases, employing their inherent content Thus, for each type of data it is important: Extract the relevant features that best describe the data Get dependable data Index the data for fast retrieval/processing (scalability) Process queries on their content (Similarity queries)
Learning
(Decision Making)
Access Methods
(Structured)
Features
(Content and Context)
6
Content and context can be combined to improve image mining results…
GBdI, ICMC-USP
Content Content
Context Context
…but there are challenges in combining content and context: discarding non-useful information, finding correlations between content and context, rely on dependable information …
Let´s focus on Images one of the main types of complex data. Nowadays, digital images are produced in ever-increasing quantities and stored in large databases.
Large dataset
Retrieve Similar Images
7
There is a great need for developing automated systems that could help users to retrieve images from these databases A common approach to support image searches relies on the use of CBIR (content-based image retrieval) systems designed for retrieving the images most similar to a given query image.
Large dataset
Retrieve Similar Images
stored CBIR
8
Images + features
There is a large number of feature extraction methods for images.
Features
(Content and Context)
Challenge: to bridge the “semantic gap”
Comply to the user´s expectations when querying and retrieving the data
10
ranking
ranking
1 2 ... n
Feature Vector
The success of a retrieval system depends heavily on the way the data is represented.
11
Two main approaches to represent the image content:
1 2 ... n
Feature Vector
Global representation
Global features have the ability to generalize an entire image with a single vector
12
Ex: Color Histograms, LBP, Fourier, Moment Invariants, Haralick Descriptors...
1 2 ... n
Feature Vector
1 2 ... n
Feature Vector
1 2 ... n
Feature Vector
Local representation Local features are computed at multiple points in the image
13
to encode local image features;
region around the point, generating a 128-D feature vector for each keypoint.
1 2 ... 128
Feature Vector
14 1 2 ... 128
Feature Vector
1 2 ... 128
Feature Vector
from each image
dictionary;
the image
15
A B C A A A D D A B C D Visual Dictionary
Local features extraction Representation
Bag-of-Visual-Words
A C B 4 3 2 1 D
16
\
Local features detection Image database subset
\
Clustering Feature space Clustered feature space
a set of training images from the database
Visual Dictionary
A visual word is the centroid of a cluster
17
This final representation loses the spatial relationship between the visual words in the image
A B C A A A D D A B C D Visual Dictionary
Local features extraction Representation
Bag-of-Visual-Words
A C B 4 3 2 1 D
18 * Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
The idea is to divide the image space into sub-regions and to compute a bag in each sub-region. Not invariant to rotation.
The idea is to divide the image space into quadrants using each keypoint as the
quadrant.
19
** Penatti, O. A.; Silva, F. B.; Valle, E.; Gouet-Brunet, V.; Torres, R.
20
A C C A A B B B A C C A A B B B
Different representations
21
22
A C C A A B B B
to define the quadrant.
Gradient always points to the direction of higher values
Proposed Method: Global Spatial Arrangement
23
To reduce the feature vector dimensionality, we group each quarter
The proposed final representation is a 2-uple considering the relation between top-down and left-right, computed as follows
24
Representing the spatial information of visual words of image (a) and image (b) using: (c) the WSA (Words Spatial Arrangement) , (d) the SP (Spatial Pyramid) and (e) the GSA (Global Spatial Arrangement).
smallest one
25
Corel1000 dataset Texture dataset
26
C F D A E
Bag-of-Visual-Words Dictionary of words
Representation
B C F D A E B A A A A A B B C D C C F F A C F A B D A B A A C C F
Image
Local features extraction
27
The main idea of using visual dictionaries is to consider that the visual patterns present in images are similar to textual words present in textual documents. Therefore, an image is composed of visual words as a textual document is composed of textual words
We used 2-grams (bigrams) for generating visual phrases
28
We have an exponential combination of phrases!
Which phrases are the representative ones to encode image information?
Problem: Solution: In textual area, 2-gram (bi-gram) is represented by two sequences of words, such as : {data bases, computer vision, medical systems, artificial systems}
The Bag-of-Visual-Words ignores the spatial information of visual words in the final representation
C F D A E
Bag-of-Visual-Words Dictionary of words
Representation
B C F D A E B A A A A A B B C D C C F F A C F A B D A B A A C C F
Image
Local features extraction
29
C F D A E
Bag-of-Visual-Words Dictionary of words
Representation
B C F D A E B A A A A A B B C D C C F F A C F A B D A B A A C C F
Image
Local features extraction
C D F A B C A A B E F
Bag-of-Visual-Phrases
C D F A B C A A B E F A B A B C D F C A C A Representation
Dictionary of Phrases
30 AB CD F ABEF CA AB
ABEF
AB AB
CDF CDF
CA CA CA
Bag-of-Visual-Phrases
A B A C A A D D
The 2-grams can be generated by placing a region over each keypoint
All pairs of words formed with the center point are considered an 2-gram.
31 31
A B A C A A D A
BC, BA, BA BD, BA, BA
We divided the area in two zones to extract orientation
32
2-grams extraction Dictionary of 2-grams
AB, BC, AC, AB, CA, BB, CC, CA, CA, BA, AA ...
AB AC AD BC BD CD
AB AD AC 4 3 2 1 BC BD CD
Bag-of-2-grams
BB, CC, CA, CA, BA, AA AB, BC, AC, AB, CA, ...
Bag-of-2-grams
AB AD AC 4 3 2 1 BC BD CD
Bag 1 Bag 2
A B C A A A D D
33
34
Dataset Evaluated: ImageCLEF 2012 Medical Task: composed of 5,042 bio-medical images classified in 32 categories and 3 levels Comparative results using 80-20 classification test CLD = Color Layour Descriptor EHD = Edge Histogram Descriptor CEED = Color and Edge Directivity Descriptor BoVW = Bag-of-Visual-Words
It is a Bag-of-Words model to retrieve shapes by similarity using salient points
Image is usually represented by color, texture and/or shape
color texture shape
within an image
36
However, the development of a shape descriptor is a challenging task in the computer vision area The main reason is the fact that a same object may present a rich variability of shapes and different objects may present shapes with a high visual similarity
Shapes of different objects with high visual similarity Shapes of same objects with different visual similarity
37
38
Contour-based approaches Statistical-based approaches Analyses only the salient points
Region-based approach
Salient (corner) points of some shapes
39
How to represent each salient point? 4 saliences 10 saliences
How to measure the dissimilarity?
40
The idea is to model the representation as a Bag-of-Words approach Visual Dictionary
41
42
43
We assume the shape was previously segmented
44
We assume the salient points were detected
45
Curvature equation:
the curvature equation describes how much a point bends at a portion of the curve
representation
, where s = arc length of the curve portion
46
We count how many times each word appears in the shape Problem:
A B B D D C B A
47
Two different shapes can have the same global histogram Encode the spatial relationship between the visual words in the image!!
Solution:
48
49
50
1) We divide the shape space in equally separated zones according to the distance from the shape centroid. In this example, we used 3 zones 2) We compute a histogram in each zone.
51
Two different shapes can have the same global histogram, but usually not the same distribution of visual words
52
FINAL REPRESENTATION
Parameters:
53
54
MPEG-7 CE-Shape-1 Kimia-216
Some sample shapes of each dataset (70 different classes) (18 different classes)
histogram
MPEG-7 dataset Kimia-216 dataset
55
56
Feature vector dimensionality comparison
smallest vector
Problem: it needs a specific distance function to compute the dissimilarity
Average computational time to compute the dissimilarity: The proposed descriptor (BoSP) is the second faster descriptor
57
MPEG-7 dataset Kimia-216 dataset
The proposed descriptor achieved similar performance to the SSD descriptor, but being 53% faster when computing the dissimilarity
58
Bag-of-Visual-Phrases -> Bag-of-Salience-Points (BoSP): new feature extraction methods for dealing with shape-based images using salience points features Three interesting points: a multi-scale method to efficiently represent the salience points of a shape; a Dictionary of Curvatures to encode the final shape representation into a one single feature vector; a spatial pooling approach to encode the distance distribution of the visual words in the shape space. Experimental results show that the proposed descriptor achieved the best retrieval performance while requiring a low computational cost to measure the dissimilarity.
59
Access Methods
(Structured)
Features
(Content)
Advanced database applications must deal with: Large number of data elements (i.e., cardinality) High dimensionality (i.e., number of attributes) Complexity of the features that describe the attributes Non-dimensional data (e.g. DNA sequences)
61
Non-dimensional and high-dimensional datasets may consist of thousands of attributes and may be subject to missing values. Non-dimensional and high-dimensional datasets may consist of thousands of attributes and may be subject to missing values.
Preventable errors or mistakes (e.g. failing to appear for a medical exam,…etc). Problems outside of control (e.g. failure of the equipment, low battery,…etc). Privacy or security reason. Legitimate (e.g. a survey question that does not apply to the respondent).
62
Missing data: Missing data impacts similarity search due to main reasons: Distance function: How to measure the distance among elements when part
Access methods collapse
63
64 64
Mechanisms of missingness (Rubin, 1976)
Missing Completely At Random : MCAR - probability that data are
missing is independent of both observed and missing data
Pr(I / yobs, ymiss) = Pr(I) Missing At Random : MAR (Ignorable Missingness) - probability
that data are missing is independent of missing data, but may be missing as a function of
Pr(I / yobs, ymiss) = Pr(I / yobs)
Missing Not At Random : MNAR (Non-ignorable Missingness)
Pr(I / yobs, ymiss) = Pr(I / ymiss) Mechanisms of missingness (Rubin, 1976)
Missing Completely At Random : MCAR - probability that data are
missing is independent of both observed and missing data
Pr(I / yobs, ymiss) = Pr(I) Missing At Random : MAR (Ignorable Missingness) - probability
that data are missing is independent of missing data, but may be missing as a function of
Pr(I / yobs, ymiss) = Pr(I / yobs)
Missing Not At Random : MNAR (Non-ignorable Missingness)
Pr(I / yobs, ymiss) = Pr(I / ymiss)
Y : a set of variables yobs : fully observed variables ymiss : variables with missing values I : indicator variable
0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,
Normalized NDVI
Time Original signal
0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,
NDVI reconstructed
Time Signal reconstructed with DWT
0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,
Normalized NDVI
Time Signal with 10% missing data
0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,
NDVI reconstructed
Time Signal reconstructed with DWT
65
The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown. The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown.
A1 A2 Null …. An-2 Null An
Obj A
B1 B2 B3 …. Bn-2 Bn-1 Bn
Obj B
66
Obs: Given any two feature vectors X and Y, the Lp family of distance functions are defined as: Obs: Given any two feature vectors X and Y, the Lp family of distance functions are defined as:
67 67
The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown. The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown.
A1 A2 Null …. An-2 Null An
Obj A
B1 B2 B3 …. Bn-2 Bn-1 Bn
Obj B
67
68 68
68
69 69
69
Fractal: self-similarity property (an object that presents roughly the same characteristics over a large range of scales.
L ine - resolutio n 1 :1
70
71
data distribution /behavior and attributes correlation
,
[Traina_JIDM2011] Well-suited to complex data scalable
72
,
73
Bureau of the Census - Tiger/Line Precensus Files: 1990 technical documentation.
74 74
74
Bureau of the Census - Tiger/Line Precensus Files: 1990 technical documentation.
Given a set of n objects in a dataset with a distance function d: PC(r) = Kp× rD Given a set of n objects in a dataset with a distance function d: PC(r) = Kp× rD
6, 9, 12, 15, 18, 21, log(# pairs within distance r) log(r)
Fractal dimension of the Sierpinski dataset D
log(r) log(Pairs(k))
76 76
Fractal Dimension Given a set of n objects in a dataset with a distance function d: PC(r) = Kp × rD Fractal Dimension Given a set of n objects in a dataset with a distance function d: PC(r) = Kp × rD
6, 9, 12, 15, 18, 21, log(# pairs within distance r) log(r)
Fractal dimension of the Sierpinski dataset D
The distance exponent is invariant to random sampling, i.e., the power law holds forsubsets of the dataset.
77 77 Dynamic radius = Diameter of space
Query Objects
Metric tree k-NNq query
sq
Query Objects
Metric tree Range query
Limiting radius rq sq
Oid1, d(S1, Srep) … Oidk, d(Sk, Srep) Oid1, d(S1, Srep) … Oidn, d(Sn, Srep) Oid1, d(S1, Srep) … Oidk, d(Sk, Srep)
Update the query response
Limiting radius rq
Oid1, d(S1, Srep) … Oidn, d(Sn, Srep)
78 78
Query Objects
Metric tree Range query
sq
sq
Query Objects
Metric tree k-NNq query
Oid1, d(S1, Srep) … Oidk, d(Sk, Srep)
Reduce the dynamic redius Final query response
79
80 80
Pairewise/ Listwise Deletion Imputation methods (e.g. Mean Substitution, Multiple Imputation)
Biased results when predicting MNAR data High cost for more sophisticated techniques
Pairewise/ Listwise Deletion Imputation methods (e.g. Mean Substitution, Multiple Imputation)
Biased results when predicting MNAR data High cost for more sophisticated techniques Special treatment is necessary to allow the applications to
81 81
82 82
Metric access methods Employ an index structure to
hierarchical tree structure, called Metric Tree, based on a distance function. The space is divided into regions using a set of chosen
called representatives, and their distances to the rest of the
Metric access methods Employ an index structure to
hierarchical tree structure, called Metric Tree, based on a distance function. The space is divided into regions using a set of chosen
called representatives, and their distances to the rest of the
Data Objects
Rep Rep Rep Rep Rep
Metric Tree
83 83
S
(2) d(x,y) > 0 (3) d(y,z) = d(z,y) (1) d(x,x) = 0 (4) d(x,y) + d(y,z) ≥ d(x,z)
S : Data domain d : Metric distance (1) : Reflexivity (2) : Non-negativity (3) : Symmetry (4) : Triangle inequality S : Data domain d : Metric distance (1) : Reflexivity (2) : Non-negativity (3) : Symmetry (4) : Triangle inequality
84 84
Missing data can underestimate or overestimate the distances and: When data are MAR => Distortion of the index structure When data are MNAR => Skew (distance concentration) of the index structure Missing data can underestimate or overestimate the distances and: When data are MAR => Distortion of the index structure When data are MNAR => Skew (distance concentration) of the index structure
Rep Covering Radius Rep Covering Radius
Distance Concentration
Object with Null Values Complete Object Representative
85 85
86 86
Missing At Random
Object with Null Values Complete Object Representative
r
Rep Rep
r
Rep
r
Missing Not At Random Complete Data Sparser Data Distortion
Distance Concentration (Skew)
Missing data can underestimate or overestimate the distances
Ignore missing attribute values and index the data with missing values: When data are MAR => Distortion in the index structure When data are MNAR => Skew in the index structure Ignore missing attribute values and index the data with missing values: When data are MAR => Distortion in the index structure When data are MNAR => Skew in the index structure
Rep Covering Radius Rep Covering Radius
Distance Concentration
Object with Null Values Complete Object Representative
This fact can cause inconsistency in the data structure, leading to inaccurate query response.
87
Investigate the key issues involved when indexing and searching datasets with missing attribute values in metric spaces, Identify the effects of each mechanism of missingness on the metric access methods when applied on incomplete datasets, Fomalize the problem of missing data in metric spaces and propose a ”Model of Missingness”, Develop new techniques to support similarity search over large and complex datasets with missing values. Investigate the key issues involved when indexing and searching datasets with missing attribute values in metric spaces, Identify the effects of each mechanism of missingness on the metric access methods when applied on incomplete datasets, Fomalize the problem of missing data in metric spaces and propose a ”Model of Missingness”, Develop new techniques to support similarity search over large and complex datasets with missing values.
A Metric Access Method to support similarity search over large and complex datasets with missing attribute values:
missing values.
89 89
The Hollow-tree metric access method Built over the Slim- tree platform. Technique that allows to index objects with missing values. Similarity queries based on Fractal Dimension and the local density around the query objects to achieve an accurate query response, when missingness is ignorable. Overcome the limitations of the metric access methods when applied
The Hollow-tree metric access method Built over the Slim- tree platform. Technique that allows to index objects with missing values. Similarity queries based on Fractal Dimension and the local density around the query objects to achieve an accurate query response, when missingness is ignorable. Overcome the limitations of the metric access methods when applied
90 90
Object with Null Values Complete Object
r3
Data Objects
rep3
r1
rep1 rep2
r2 v
Slim-tree
v
Load Complete Objects Leaf Nodes
91 91
There are two types: Range query Rq(sq, r) k-Nearest Neighbor query k-NNq(sq, k) There are two types: Range query Rq(sq, r) k-Nearest Neighbor query k-NNq(sq, k) Rq(sq, r)
sq sq
r
k-NNq(sq, k)
92 92 Load objects with Null values
Object with Null Values Complete Object
r3
Data Objects
rep3
r2 r1
rep1 rep2
v
Slim-tree
v
Load complete
Leaf Nodes
Indicator missing
FALS E TRUE 93 93 Load objects with Null values
Object with Null Values Complete Object
r3
Data Objects
rep3
r2 r1
rep1 rep2
v
Slim-tree
v
Load complete
Leaf Nodes
This strategy prevents data with missing values from being promoted as representatives and, thus, avoiding to introduce substantial distortion in the internal structure of the index.
The queries return two separate lists The queries return two separate lists
94 94
sq
r
Oid1, d(S1, Srep) … Oidn, d(Sn, Srep) Oid1, d(S1, Srep) … Oidn, d(Sn, Srep) Oid1, d(S1, Srep) … Oidn, d(Sn, Srep) Oid1, d(S1, Srep) … Oidn, d(Sn, Srep)
List of complete objects List of objects with Null values
k-NNq(sq, k) query is sensitive to distance concentration around the query
center sq.
k-NNq(sq, k) query is sensitive to distance concentration around the query
center sq.
95 95
sq
r
Oid1, d(S1, Srep) … Oid1, d(S1, Srep) … Oid1, d(S1, Srep) … Oidk-1, d(Sk-1, Srep) Oid1, d(S1, Srep) … Oidk-1, d(Sk-1, Srep)
List of complete
List of objects with Null values
96 96 Complete Time Series (Original Data) Discret Wavelet Transform Query by Similarity (k-NNq, Range) Incomplete Time Series Datasets 500 query
Datasets
Indexing Querying
Slim Tree
Euclidean Distance
Query Object Query Object 20 Coefficients Incomplete Time Series Dataset Incomplete Time Series Dataset 20 Coefficients
Feature Extraction
MAR/MNAR
97 97
Complete datasets Incomplete datasets Complete datasets Incomplete datasets
Dataset Description Type Nº attributes Nº objects
NDVI
Normalized Difference Vegetation Index
Real 108 500000 WeathFor
Weather Forecast
Synthetic 128 10000
MAR data
Nº attributes
% missing dada
NDVI
7
2
17
5
33
10
49
15
65
20
82
25 WeathFor
8
2
20
5
39
10
58
15
78
20
97
25
MNAR data
% missing dada
NDVI 20 WeathFor 2 5 10 12 16 18
98 98
Precision and Recall for RMq queries — Weather dataset
0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1, 2 5 10 15 20 25
Precision
% Missing Data RMq - Hollow-tree RMq - Slim-tree
0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1, 2 5 10 15 20 25
Recall
% Missing Data
99 99
Efficiency parameters — NDVI (MAR & MNAR) Efficiency parameters — NDVI (MAR & MNAR)
2 5 10 15 20 25 20
Total Time [sec] % Missing Data k-NNq query Rq query MAR MNAR
2 5 10 15 20 25 20
% Missing Data k-NNq query Rq query MAR MNAR
2 5 10 15 20 25 20
% Missing Data k-NNq query Rq query MAR MNAR
100 100
Efficiency parameters — WeathFor (MAR) Efficiency parameters — WeathFor (MAR)
2 5 10 15 20 25
% Missing Data k-NNq query Rq query
2 5 10 15 20 25
% Missing Data k-NNq query Rq query
2 5 10 15 20 25
Total Time [sec] % Missing Data k-NNq query Rq query
101 101
Efficiency parameters — WeathFor (MNAR) Efficiency parameters — WeathFor (MNAR)
2.55 5 10.23 12 16 18
% Missing Data k-NNq query Rq query
2.55 5 10.23 12 16 18
% Missing Data k-NNq query Rq query
2.55 5 10.23 12 16 18
Total Time [sec] % Missing Data k-NNq query Rq query
Complex data bring new interesting challenges:
extractors, which are closer to the users´ needs
metric spaces (only the data elements and distances between them are provided)
Fractals)
the knowledge needed and desired.
ICMC-USP/São Carlos Specially to: Glauco Vitor Pedrosa Safia Brinis Alceu Ferraz Costa