[PPT] - Content-based Similarity Queries on Complex Data: Challenges and PowerPoint Presentation

SLIDE 1

Content-based Similarity Queries on Complex Data: Challenges and Real Applications

1

Agma J. Machado Traina

agma@icmc.usp.br

31st Brazilian Symposium on Databases 04-07, October 2016 Salvador - Bahia

SLIDE 2

Outline Outline

Introduction and Motivation
Complex data representation for indexing and retrieval
Feature extraction techniques
From Bags-of-Visual-Words to Bags-of-Visual-Phares:
Missing data
Indexing approaches for missing data
The Hollow-tree
Final considerations

SLIDE 3

Complex data are everywhere:

Current Scenario...

3

Introduction Introduction

Time Series

2 4 6 8 1 2 3 4 5 6 7 8 9 1

T im e

Data stream sliding window

SLIDE 4

The volume and diversity of data increase continuously.

“Data is growing faster than ever before and by the year 2 020, about 1.7 megabytes of new information will be crea ted every second for every human being on the planet”. “By 2020, our accumulated digital universe of data will grow from 4.4 zettabytes today to around 44 zettabytes,

r 44 trillion gigabytes”.

Forbes Sept 2015

Current Scenario...

Introduction Introduction

SLIDE 5

There is a great need for developing automated systems that could help users to retrieve complex data from the databases, employing their inherent content Thus, for each type of data it is important: Extract the relevant features that best describe the data Get dependable data Index the data for fast retrieval/processing (scalability) Process queries on their content (Similarity queries)

Current Scenario...

Introduction Introduction

Learning

Knowledge

(Decision Making)

Access Methods

Information

(Structured)

Features

Data

(Content and Context)

SLIDE 6

6

In social media services, images are composed of content and context data

Content and context can be combined to improve image mining results…

GBdI, ICMC-USP

Content Content

Example: Images in social media services Example: Images in social media services

Context Context

…but there are challenges in combining content and context: discarding non-useful information, finding correlations between content and context, rely on dependable information …

SLIDE 7

Let´s focus on Images one of the main types of complex data. Nowadays, digital images are produced in ever-increasing quantities and stored in large databases.

Large dataset

Retrieve Similar Images

7

Current Scenario... Current Scenario...

SLIDE 8

There is a great need for developing automated systems that could help users to retrieve images from these databases A common approach to support image searches relies on the use of CBIR (content-based image retrieval) systems designed for retrieving the images most similar to a given query image.

Large dataset

Retrieve Similar Images

stored CBIR

8

Current Scenario... Current Scenario...

Images + features

Two main issues: features and distances

SLIDE 9

There is a large number of feature extraction methods for images.

Color distribution
Texture
Shape

Focusing on the Data Features Focusing on the Data Features

Features

Data

(Content and Context)

Challenge: to bridge the “semantic gap”

Comply to the user´s expectations when querying and retrieving the data

SLIDE 10

10

The first step is to extract the features of the images
Then, a distance function is applied to measure the dissimilarity, generating a

ranking

The most similar images to the given query image appear at top position of this

ranking

General scheme of CBIR system: General scheme of CBIR system:

SLIDE 11

Image representation Image representation

1 2 ... n

Feature Vector

The success of a retrieval system depends heavily on the way the data is represented.

11

Two main approaches to represent the image content:

global features
local features

SLIDE 12

Image representation Image representation

1 2 ... n

Feature Vector

Global representation

Global features have the ability to generalize an entire image with a single vector

12

Ex: Color Histograms, LBP, Fourier, Moment Invariants, Haralick Descriptors...

SLIDE 13

Image representation Image representation

1 2 ... n

Feature Vector

1 2 ... n

Feature Vector

1 2 ... n

Feature Vector

...

Local representation Local features are computed at multiple points in the image

well-suited for object recognition
more robust to occlusion and cluttering

13

SLIDE 14

Local Features Detection Local Features Detection

Scale-invariant feature transform (or SIFT) is the most used descriptor

to encode local image features;

It detects keypoints using Difference-of-Gaussians (DoG);
It represents each keypoint using orientation and magnitude over a 16x16

region around the point, generating a 128-D feature vector for each keypoint.

1 2 ... 128

Feature Vector

14 1 2 ... 128

Feature Vector

1 2 ... 128

Feature Vector

Typically, a few thousand keypoints are extracted

from each image

SLIDE 15

The Bag-of-Visual-Words approach The Bag-of-Visual-Words approach

The goal is to quantize the local features using a visual dictionary
Each local feature is assigned to a visual word according to a visual

dictionary;

The final representation is given by the frequency of each visual word in

the image

15

A B C A A A D D A B C D Visual Dictionary

Local features extraction Representation

Bag-of-Visual-Words

A C B 4 3 2 1 D

SLIDE 16

16

\

Local features detection Image database subset

\

Clustering Feature space Clustered feature space

Clustering is a common method for learning a visual dictionary
A dictionary can be built by clustering the local features detected in

a set of training images from the database

Visual Dictionary

Visual Dictionary Construction Visual Dictionary Construction

A visual word is the centroid of a cluster

SLIDE 17

17

This final representation loses the spatial relationship between the visual words in the image

A B C A A A D D A B C D Visual Dictionary

Local features extraction Representation

Bag-of-Visual-Words

A C B 4 3 2 1 D

The Bag-of-Visual-Words (BoVW) approach The Bag-of-Visual-Words (BoVW) approach

SLIDE 18

Spatial Pyramid (SP) Some Proposed Approaches Some Proposed Approaches

18 * Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

S. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006

The idea is to divide the image space into sub-regions and to compute a bag in each sub-region. Not invariant to rotation.

SLIDE 19

Words Spatial Arrangement (WSA) Words Spatial Arrangement (WSA)

The idea is to divide the image space into quadrants using each keypoint as the

rigin of the quadrants and counting the number of words that appear in each

quadrant.

19

** Penatti, O. A.; Silva, F. B.; Valle, E.; Gouet-Brunet, V.; Torres, R.

S. Visual word spatial arrangement for image retrieval and
classication. Pattern Recognition, v. 47, n. 2, p. 705 - 720, 2014.

SLIDE 20

Drawback of the BoVW previous works:

20

A C C A A B B B A C C A A B B B

not invariant to rotation!!!

Different representations

Invariant features? Invariant features?

SLIDE 21

New Methods for Enconding Spatial Information New Methods for Enconding Spatial Information

1) Global Spatial Arrangement (GSA): encodes global spatial distribution 2) Bag-of-2-grams: encodes co-ocurrence of words.

21

SLIDE 22

22

A C C A A B B B

Uses the information about the gradient direction of each visual word

to define the quadrant.

Gradient always points to the direction of higher values

Proposed Method: Global Spatial Arrangement

(GSA)

Global Spatial Arrangement (GSA) Global Spatial Arrangement (GSA)

SLIDE 23

23

To reduce the feature vector dimensionality, we group each quarter

f the quadrant in relation to its position: top, down, left or right

The proposed final representation is a 2-uple considering the relation between top-down and left-right, computed as follows

Global Spatial Arrangement Global Spatial Arrangement

SLIDE 24

24

Representing the spatial information of visual words of image (a) and image (b) using: (c) the WSA (Words Spatial Arrangement) , (d) the SP (Spatial Pyramid) and (e) the GSA (Global Spatial Arrangement).

Experimental Results Experimental Results

smallest one

SLIDE 25

25

Corel1000 dataset Texture dataset

Retrieval Problem

Experimental Results Experimental Results

SLIDE 26

From Bags-of-Visual-Words to Bags-of-Visual-Phrases From Bags-of-Visual-Words to Bags-of-Visual-Phrases

26

SLIDE 27

The Bag-of-Visual-Words approach The Bag-of-Visual-Words approach

C F D A E

Bag-of-Visual-Words Dictionary of words

Representation

B C F D A E B A A A A A B B C D C C F F A C F A B D A B A A C C F

Image

Local features extraction

27

The main idea of using visual dictionaries is to consider that the visual patterns present in images are similar to textual words present in textual documents. Therefore, an image is composed of visual words as a textual document is composed of textual words

SLIDE 28

Approach Approach

We used 2-grams (bigrams) for generating visual phrases

28

We have an exponential combination of phrases!

Which phrases are the representative ones to encode image information?

Problem: Solution: In textual area, 2-gram (bi-gram) is represented by two sequences of words, such as : {data bases, computer vision, medical systems, artificial systems}

SLIDE 29

The BoVW drawback The BoVW drawback

The Bag-of-Visual-Words ignores the spatial information of visual words in the final representation

C F D A E

Bag-of-Visual-Words Dictionary of words

Representation

B C F D A E B A A A A A B B C D C C F F A C F A B D A B A A C C F

Image

Local features extraction

29

SLIDE 30

Bags-of-Visual-Phrases Bags-of-Visual-Phrases

C F D A E

Bag-of-Visual-Words Dictionary of words

Representation

B C F D A E B A A A A A B B C D C C F F A C F A B D A B A A C C F

Image

Local features extraction

C D F A B C A A B E F

Bag-of-Visual-Phrases

C D F A B C A A B E F A B A B C D F C A C A Representation

Dictionary of Phrases

The goal is to encode spatial information of visual words
A more powerful description can be obtained by grouping words

30 AB CD F ABEF CA AB

ABEF

AB AB

CDF CDF

CA CA CA

Bag-of-Visual-Phrases

SLIDE 31

The proposed approach The proposed approach

A B A C A A D D

BD BA BD BC BA

The 2-grams can be generated by placing a region over each keypoint

All pairs of words formed with the center point are considered an 2-gram.

31 31

SLIDE 32

A B A C A A D A

The proposed approach The proposed approach

BC, BA, BA BD, BA, BA

2-grams extracted

We divided the area in two zones to extract orientation

32

SLIDE 33

2-grams extraction Dictionary of 2-grams

AB, BC, AC, AB, CA, BB, CC, CA, CA, BA, AA ...

AB AC AD BC BD CD

AB AD AC 4 3 2 1 BC BD CD

Bag-of-2-grams

BB, CC, CA, CA, BA, AA AB, BC, AC, AB, CA, ...

Bag-of-2-grams

AB AD AC 4 3 2 1 BC BD CD

Bag 1 Bag 2

A B C A A A D D

33

The proposed approach The proposed approach

SLIDE 34

Experimental Results Experimental Results

34

Dataset Evaluated: ImageCLEF 2012 Medical Task: composed of 5,042 bio-medical images classified in 32 categories and 3 levels Comparative results using 80-20 classification test CLD = Color Layour Descriptor EHD = Edge Histogram Descriptor CEED = Color and Edge Directivity Descriptor BoVW = Bag-of-Visual-Words

SLIDE 35

Going further on BoVW Bag-of-Salience-Points Going further on BoVW Bag-of-Salience-Points

It is a Bag-of-Words model to retrieve shapes by similarity using salient points

SLIDE 36

Image is usually represented by color, texture and/or shape

color texture shape

Shape is usually more effective in characterizing the object

within an image

It represents the silhouette of the object in the image

36

Image Description Image Description

SLIDE 37

However, the development of a shape descriptor is a challenging task in the computer vision area The main reason is the fact that a same object may present a rich variability of shapes and different objects may present shapes with a high visual similarity

Shapes of different objects with high visual similarity Shapes of same objects with different visual similarity

37

Shape Description Shape Description

SLIDE 38

Different shape descriptors may target different aspects of the shape There are many shape descriptors in the literature:

Fourier Descriptors
Curvature Scale Space
Fractal Dimension
Moment Invariant
Zernike Moments
Shape Salience Descriptor

38

Contour-based approaches Statistical-based approaches Analyses only the salient points

Shape Description Shape Description

Region-based approach

SLIDE 39

Saliences are the higher curvature points along the countour Motivation: Salient (corner) points encode the most important parts in a compact way and invariant to geometric transformations

Salient (corner) points of some shapes

39

Motivation Motivation

SLIDE 40

How to represent each salient point? 4 saliences 10 saliences

How to measure the dissimilarity?

40

Problems of using salient points as features... Problems of using salient points as features...

1. Dealing with variable number of features
2. The need of building a distance function
3. Dealing with a large number of features

SLIDE 41

The idea is to model the representation as a Bag-of-Words approach Visual Dictionary

Advantage: final feature vector with a fixed dimensionality

41

Bag-of-Salience-Points (BoSP)

Bag-of-Salience-Points (BoSP): Method Bag-of-Salience-Points (BoSP): Method

SLIDE 42

42

Bag-of-Salience-Points (BoSP): Method Bag-of-Salience-Points (BoSP): Method

SLIDE 43

43

We assume the shape was previously segmented

Bag-of-Salience-Points (BoSP): Method Bag-of-Salience-Points (BoSP): Method

SLIDE 44

44

We assume the salient points were detected

Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method

SLIDE 45

45

Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method

SLIDE 46

Curvature equation:

Basically,

the curvature equation describes how much a point bends at a portion of the curve

Varying the value of S we can obtain a multi-scale

representation

, where s = arc length of the curve portion

46

Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method

SLIDE 47

We count how many times each word appears in the shape Problem:

A B B D D C B A

47

Two different shapes can have the same global histogram Encode the spatial relationship between the visual words in the image!!

Solution:

BoSP Method BoSP Method

SLIDE 48

48

Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method

SLIDE 49

49

Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method

SLIDE 50

50

Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method

SLIDE 51

1) We divide the shape space in equally separated zones according to the distance from the shape centroid. In this example, we used 3 zones 2) We compute a histogram in each zone.

Encoding the spatial arrangement of visual words…

51

Two different shapes can have the same global histogram, but usually not the same distribution of visual words

BoSP Method BoSP Method

SLIDE 52

52

FINAL REPRESENTATION

BoSP Method BoSP Method

SLIDE 53

Parameters:

How many words?
How many zones?

53

BoSP Method BoSP Method

SLIDE 54

To investigate the best values of these two parameters, we exploited different values using two different databases:

54

MPEG-7 CE-Shape-1 Kimia-216

Some sample shapes of each dataset (70 different classes) (18 different classes)

Experimental Evaluation Experimental Evaluation

SLIDE 55

These graphs show the mAP values obtained by varying :
the dictionary size
the quantity of zones. For Z = 1 we consider only the global

histogram

Size of dictionary higher than 20 did not achieve an overall improvement
Quantity of zones higher than 4 did not improve the results

MPEG-7 dataset Kimia-216 dataset

55

Experimental Evaluation Experimental Evaluation

SLIDE 56

Performance comparison of BOSP with 4 descriptors:

MI (Moment Invariants)
Fourier (Fourier Descriptor),
MS Fractal (Multi-scale Fractal Dimension)
SSD (Shape Salience Descriptor)

56

Feature vector dimensionality comparison

smallest vector

Problem: it needs a specific distance function to compute the dissimilarity

Experimental Evaluation Experimental Evaluation

SLIDE 57

Average computational time to compute the dissimilarity: The proposed descriptor (BoSP) is the second faster descriptor

57

Experimental Evaluation Experimental Evaluation

SLIDE 58

Retrieval Performance (Curve of Precision x Recall)

MPEG-7 dataset Kimia-216 dataset

The proposed descriptor achieved similar performance to the SSD descriptor, but being 53% faster when computing the dissimilarity

58

Experimental Evaluation Experimental Evaluation

SLIDE 59

Bag-of-Visual-Phrases -> Bag-of-Salience-Points (BoSP): new feature extraction methods for dealing with shape-based images using salience points features Three interesting points: a multi-scale method to efficiently represent the salience points of a shape; a Dictionary of Curvatures to encode the final shape representation into a one single feature vector; a spatial pooling approach to encode the distance distribution of the visual words in the shape space. Experimental results show that the proposed descriptor achieved the best retrieval performance while requiring a low computational cost to measure the dissimilarity.

59

Considerations on BoVW approaches Considerations on BoVW approaches

1. Deal with variable number of features
2. The need of building a distance function
3. Dealing with a large number of features

SLIDE 60

From Features to Data Structures From Features to Data Structures

Access Methods

Information

(Structured)

Features

Data

(Content)

SLIDE 61

Advanced database applications must deal with: Large number of data elements (i.e., cardinality) High dimensionality (i.e., number of attributes) Complexity of the features that describe the attributes Non-dimensional data (e.g. DNA sequences)

61

Complex Data Complex Data

Non-dimensional and high-dimensional datasets may consist of thousands of attributes and may be subject to missing values. Non-dimensional and high-dimensional datasets may consist of thousands of attributes and may be subject to missing values.

SLIDE 62

Missing data can occur due to:

Preventable errors or mistakes (e.g. failing to appear for a medical exam,…etc). Problems outside of control (e.g. failure of the equipment, low battery,…etc). Privacy or security reason. Legitimate (e.g. a survey question that does not apply to the respondent).

62

Motivation Motivation

SLIDE 63

Missing data: Missing data impacts similarity search due to main reasons: Distance function: How to measure the distance among elements when part

f the information is missing?

Access methods collapse

63

Dealing with missing data Dealing with missing data

SLIDE 64

64 64

Mechanisms of missingness (Rubin, 1976)

Missing Completely At Random : MCAR - probability that data are

missing is independent of both observed and missing data

Pr(I / yobs, ymiss) = Pr(I) Missing At Random : MAR (Ignorable Missingness) - probability

that data are missing is independent of missing data, but may be missing as a function of

bserved data

Pr(I / yobs, ymiss) = Pr(I / yobs)

Missing Not At Random : MNAR (Non-ignorable Missingness)

ccurs when data are missing as a function of the missing values.

Pr(I / yobs, ymiss) = Pr(I / ymiss) Mechanisms of missingness (Rubin, 1976)

Missing Completely At Random : MCAR - probability that data are

missing is independent of both observed and missing data

Pr(I / yobs, ymiss) = Pr(I) Missing At Random : MAR (Ignorable Missingness) - probability

that data are missing is independent of missing data, but may be missing as a function of

bserved data

Pr(I / yobs, ymiss) = Pr(I / yobs)

Missing Not At Random : MNAR (Non-ignorable Missingness)

ccurs when data are missing as a function of the missing values.

Pr(I / yobs, ymiss) = Pr(I / ymiss)

Y : a set of variables yobs : fully observed variables ymiss : variables with missing values I : indicator variable

Taxonomy Taxonomy

SLIDE 65

0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,

Normalized NDVI

Time Original signal

0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,

NDVI reconstructed

Time Signal reconstructed with DWT

0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,

Normalized NDVI

Time Signal with 10% missing data

0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,

NDVI reconstructed

Time Signal reconstructed with DWT

Missing data can occur in the raw data: Missing data can occur in the raw data:

65

SLIDE 66

The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown. The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown.

A1 A2 Null …. An-2 Null An

Obj A

B1 B2 B3 …. Bn-2 Bn-1 Bn

Obj B

? ?

How to compare the data elements?

66

Obs: Given any two feature vectors X and Y, the Lp family of distance functions are defined as: Obs: Given any two feature vectors X and Y, the Lp family of distance functions are defined as:

SLIDE 67

67 67

The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown. The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown.

A1 A2 Null …. An-2 Null An

Obj A

B1 B2 B3 …. Bn-2 Bn-1 Bn

Obj B

? ?

How to compare the data elements?

67

SLIDE 68

68 68

Fractal Concepts

68

SLIDE 69

69 69

Fractal Concepts

69

Fractal: self-similarity property (an object that presents roughly the same characteristics over a large range of scales.

L ine - resolutio n 1 :1

L ine - resolution 1 :100 L ine - resolution 1:1000000000000

SLIDE 70

Fractals and Intrinsic Dimension - Intuition

70

SLIDE 71

Fractals and Intrinsic Dimension - Intuition

71

data distribution /behavior and attributes correlation

,

)

Box-counting approach

Fast Method (linear cost) [Traina_SBBD2000],

[Traina_JIDM2011] Well-suited to complex data scalable

SLIDE 72

Fractals – Examples: Sierpinski triangle

72

. . .

,

)

Box-counting approach

SLIDE 73

Fractals – Examples

73

Bureau of the Census - Tiger/Line Precensus Files: 1990 technical documentation.

SLIDE 74

74 74

Fractals – Examples

74

Bureau of the Census - Tiger/Line Precensus Files: 1990 technical documentation.

SLIDE 75

Given a set of n objects in a dataset with a distance function d: PC(r) = Kp× rD Given a set of n objects in a dataset with a distance function d: PC(r) = Kp× rD

6, 9, 12, 15, 18, 21, log(# pairs within distance r) log(r)

Fractal dimension of the Sierpinski dataset D

log(r) log(Pairs(k))

Fractal Dimension for Similarity Search Fractal Dimension for Similarity Search

SLIDE 76

76 76

Fractal Dimension Given a set of n objects in a dataset with a distance function d: PC(r) = Kp × rD Fractal Dimension Given a set of n objects in a dataset with a distance function d: PC(r) = Kp × rD

6, 9, 12, 15, 18, 21, log(# pairs within distance r) log(r)

Fractal dimension of the Sierpinski dataset D

The distance exponent is invariant to random sampling, i.e., the power law holds forsubsets of the dataset.

Fractal Dimension for Similarity Search Fractal Dimension for Similarity Search

SLIDE 77

77 77 Dynamic radius = Diameter of space

Query Objects

Metric tree k-NNq query

sq

Query Objects

Metric tree Range query

Limiting radius rq sq

Oid1, d(S1, Srep) … Oidk, d(Sk, Srep) Oid1, d(S1, Srep) … Oidn, d(Sn, Srep) Oid1, d(S1, Srep) … Oidk, d(Sk, Srep)

Update the query response

Fractal Dimension for Similarity Search Fractal Dimension for Similarity Search

SLIDE 78

Limiting radius rq

Oid1, d(S1, Srep) … Oidn, d(Sn, Srep)

78 78

Query Objects

Metric tree Range query

sq

Query Objects

Metric tree k-NNq query

Oid1, d(S1, Srep) … Oidk, d(Sk, Srep)

Reduce the dynamic redius Final query response

Fractal Dimension for Similarity Search Fractal Dimension for Similarity Search

SLIDE 79

Fractal Concepts

79

SLIDE 80

80 80

Pairewise/ Listwise Deletion Imputation methods (e.g. Mean Substitution, Multiple Imputation)

Biased results when predicting MNAR data High cost for more sophisticated techniques

Pairewise/ Listwise Deletion Imputation methods (e.g. Mean Substitution, Multiple Imputation)

Biased results when predicting MNAR data High cost for more sophisticated techniques Special treatment is necessary to allow the applications to

perate on the available data properly.

Missing Data Treatment at the Data Level Missing Data Treatment at the Data Level

SLIDE 81

81 81

Up to now, there is no solution for metric access method to support similarity search over incomplete datasets.

Missing Data Treatment at the Data Level Missing Data Treatment at the Data Level

SLIDE 82

82 82

Metric access methods Employ an index structure to

rganize the objects in an

hierarchical tree structure, called Metric Tree, based on a distance function. The space is divided into regions using a set of chosen

bjects,

called representatives, and their distances to the rest of the

bjects in the space.

Metric access methods Employ an index structure to

rganize the objects in an

hierarchical tree structure, called Metric Tree, based on a distance function. The space is divided into regions using a set of chosen

bjects,

called representatives, and their distances to the rest of the

bjects in the space.

Data Objects

Rep Rep Rep Rep Rep

Metric Tree

Metric Access Methods Metric Access Methods

SLIDE 83

83 83

S

(2) d(x,y) > 0 (3) d(y,z) = d(z,y) (1) d(x,x) = 0 (4) d(x,y) + d(y,z) ≥ d(x,z)

S : Data domain d : Metric distance (1) : Reflexivity (2) : Non-negativity (3) : Symmetry (4) : Triangle inequality S : Data domain d : Metric distance (1) : Reflexivity (2) : Non-negativity (3) : Symmetry (4) : Triangle inequality

Metric Space – Metric distance Metric Space – Metric distance

SLIDE 84

84 84

Slim-tree Metric Access Method Slim-tree Metric Access Method

SLIDE 85

Missing data can underestimate or overestimate the distances and: When data are MAR => Distortion of the index structure When data are MNAR => Skew (distance concentration) of the index structure Missing data can underestimate or overestimate the distances and: When data are MAR => Distortion of the index structure When data are MNAR => Skew (distance concentration) of the index structure

Problem definition Problem definition

Rep Covering Radius Rep Covering Radius

Distance Concentration

Object with Null Values Complete Object Representative

85 85

SLIDE 86

86 86

Missing At Random

Object with Null Values Complete Object Representative

r

Rep Rep

r

Rep

r

Missing Not At Random Complete Data Sparser Data Distortion

Distance Concentration (Skew)

Missing data considerations Missing data considerations

Missing data can underestimate or overestimate the distances

SLIDE 87

Ignore missing attribute values and index the data with missing values: When data are MAR => Distortion in the index structure When data are MNAR => Skew in the index structure Ignore missing attribute values and index the data with missing values: When data are MAR => Distortion in the index structure When data are MNAR => Skew in the index structure

Rep Covering Radius Rep Covering Radius

Distance Concentration

Object with Null Values Complete Object Representative

This fact can cause inconsistency in the data structure, leading to inaccurate query response.

87

Problem definition Problem definition

SLIDE 88

Investigate the key issues involved when indexing and searching datasets with missing attribute values in metric spaces, Identify the effects of each mechanism of missingness on the metric access methods when applied on incomplete datasets, Fomalize the problem of missing data in metric spaces and propose a ”Model of Missingness”, Develop new techniques to support similarity search over large and complex datasets with missing values. Investigate the key issues involved when indexing and searching datasets with missing attribute values in metric spaces, Identify the effects of each mechanism of missingness on the metric access methods when applied on incomplete datasets, Fomalize the problem of missing data in metric spaces and propose a ”Model of Missingness”, Develop new techniques to support similarity search over large and complex datasets with missing values.

A Metric Access Method to support similarity search over large and complex datasets with missing attribute values:

Able to index data with missing attribute values.
Performs the similarity queries on the available data.
Searches for complete data as well as data with

missing values.

Hollow-tree Hollow-tree

SLIDE 89

89 89

The Hollow-tree metric access method Built over the Slim- tree platform. Technique that allows to index objects with missing values. Similarity queries based on Fractal Dimension and the local density around the query objects to achieve an accurate query response, when missingness is ignorable. Overcome the limitations of the metric access methods when applied

n incomplete datasets.

The Hollow-tree metric access method Built over the Slim- tree platform. Technique that allows to index objects with missing values. Similarity queries based on Fractal Dimension and the local density around the query objects to achieve an accurate query response, when missingness is ignorable. Overcome the limitations of the metric access methods when applied

n incomplete datasets.

The Hollow-tree Metric Access Method The Hollow-tree Metric Access Method

SLIDE 90

90 90

Building the Hollow-tree Building the Hollow-tree

Object with Null Values Complete Object

r3

Data Objects

rep3

r1

rep1 rep2

r2 v

Slim-tree

v

Load Complete Objects Leaf Nodes

SLIDE 91

91 91

There are two types: Range query Rq(sq, r) k-Nearest Neighbor query k-NNq(sq, k) There are two types: Range query Rq(sq, r) k-Nearest Neighbor query k-NNq(sq, k) Rq(sq, r)

sq sq

r

k-NNq(sq, k)

Similarity Queries Similarity Queries

SLIDE 92

92 92 Load objects with Null values

Object with Null Values Complete Object

r3

Data Objects

rep3

r2 r1

rep1 rep2

v

Slim-tree

v

Load complete

bjects

Leaf Nodes

Building the Hollow-tree Building the Hollow-tree

SLIDE 93

Indicator missing

FALS E TRUE 93 93 Load objects with Null values

Object with Null Values Complete Object

r3

Data Objects

rep3

r2 r1

rep1 rep2

v

Slim-tree

v

Load complete

bjects

Leaf Nodes

This strategy prevents data with missing values from being promoted as representatives and, thus, avoiding to introduce substantial distortion in the internal structure of the index.

Building the Hollow-tree Building the Hollow-tree

SLIDE 94

The queries return two separate lists The queries return two separate lists

94 94

sq

r

Oid1, d(S1, Srep) … Oidn, d(Sn, Srep) Oid1, d(S1, Srep) … Oidn, d(Sn, Srep) Oid1, d(S1, Srep) … Oidn, d(Sn, Srep) Oid1, d(S1, Srep) … Oidn, d(Sn, Srep)

List of complete objects List of objects with Null values

Similarity queries on the Hollow-tree Similarity queries on the Hollow-tree

SLIDE 95

k-NNq(sq, k) query is sensitive to distance concentration around the query

center sq.

k-NNq(sq, k) query is sensitive to distance concentration around the query

center sq.

95 95

sq

r

Oid1, d(S1, Srep) … Oid1, d(S1, Srep) … Oid1, d(S1, Srep) … Oidk-1, d(Sk-1, Srep) Oid1, d(S1, Srep) … Oidk-1, d(Sk-1, Srep)

List of complete

bjects

List of objects with Null values

k-NNq Query for Data with Missing Values — k-NNFMq k-NNq Query for Data with Missing Values — k-NNFMq

SLIDE 96

96 96 Complete Time Series (Original Data) Discret Wavelet Transform Query by Similarity (k-NNq, Range) Incomplete Time Series Datasets 500 query

bject

Datasets

Indexing Querying

Slim Tree

Euclidean Distance

Query Object Query Object 20 Coefficients Incomplete Time Series Dataset Incomplete Time Series Dataset 20 Coefficients

Feature Extraction

MAR/MNAR

Experimental Evaluation Experimental Evaluation

SLIDE 97

97 97

Complete datasets Incomplete datasets Complete datasets Incomplete datasets

Dataset Description Type Nº attributes Nº objects

NDVI

Normalized Difference Vegetation Index

Real 108 500000 WeathFor

Weather Forecast

Synthetic 128 10000

MAR data

Nº attributes

% missing dada

NDVI

7

2

17

5

33

10

49

15

65

20

82

25 WeathFor

8

2

20

5

39

10

58

15

78

20

97

25

MNAR data

% missing dada

NDVI 20 WeathFor 2 5 10 12 16 18

Experimental Evaluation Experimental Evaluation

SLIDE 98

98 98

Experimental Results Experimental Results

Precision and Recall for RMq queries — Weather dataset

0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1, 2 5 10 15 20 25

Precision

% Missing Data RMq - Hollow-tree RMq - Slim-tree

0, 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1, 2 5 10 15 20 25

Recall

% Missing Data

SLIDE 99

99 99

Efficiency parameters — NDVI (MAR & MNAR) Efficiency parameters — NDVI (MAR & MNAR)

Experimental Results Experimental Results

2 5 10 15 20 25 20

Total Time [sec] % Missing Data k-NNq query Rq query MAR MNAR

2 5 10 15 20 25 20

Avg. Disk Access

% Missing Data k-NNq query Rq query MAR MNAR

2 5 10 15 20 25 20

Avg. Dist. Calc.

% Missing Data k-NNq query Rq query MAR MNAR

SLIDE 100

100 100

Efficiency parameters — WeathFor (MAR) Efficiency parameters — WeathFor (MAR)

Experimental Results Experimental Results

2 5 10 15 20 25

Avg. Disk Access

% Missing Data k-NNq query Rq query

2 5 10 15 20 25

Avg. Dist. Calc.

% Missing Data k-NNq query Rq query

2 5 10 15 20 25

Total Time [sec] % Missing Data k-NNq query Rq query

SLIDE 101

101 101

Efficiency parameters — WeathFor (MNAR) Efficiency parameters — WeathFor (MNAR)

Experimental Results Experimental Results

2.55 5 10.23 12 16 18

Avg. Disk Access

% Missing Data k-NNq query Rq query

2.55 5 10.23 12 16 18

Avg. Dist. Calc.

% Missing Data k-NNq query Rq query

2.55 5 10.23 12 16 18

Total Time [sec] % Missing Data k-NNq query Rq query

SLIDE 102

Complex data bring new interesting challenges:

Development of more robust (invariant to transformations) feature

extractors, which are closer to the users´ needs

Strategies to deal with high-dimensional and adimensional data in

metric spaces (only the data elements and distances between them are provided)

Scalable approaches to deal with missing data
Get the real intuition about your data distribution (intrinsic dimension:

Fractals)

New mechanisms for organizing the data
A closer relationship with related fields to gather convey to the users

the knowledge needed and desired.

Conclusions Many opportunities to research!

SLIDE 103

Results presented herein have been applyed to systems under development at

EMBRAPA: Agrodatamine and Agrocomputing.net
Clinical Hospital of Ribeirão Preto – USP.

Conclusions

SLIDE 104

ACKNOWLEDGMENT

To all the members of Databases and Images Group (GBdI)

ICMC-USP/São Carlos Specially to: Glauco Vitor Pedrosa Safia Brinis Alceu Ferraz Costa

To SBBD 2016 Steering and Organizing Committees
To you all for attending to SBBD 2016

Thanks!

SLIDE 105

Prof. Agma Juci Machado Traina
Prof. Agma Juci Machado Traina