Quality of Similarity Rankings in Time Series T. Bernecker, in - - PowerPoint PPT Presentation

quality of similarity rankings
SMART_READER_LITE
LIVE PREVIEW

Quality of Similarity Rankings in Time Series T. Bernecker, in - - PowerPoint PPT Presentation

Quality of Similarity Rankings Quality of Similarity Rankings in Time Series T. Bernecker, in Time Series M. E. Houle, H.-P. Kriegel, P. Krger, 12th International Symposium on Spatial and Temporal M. Renz, E. Schubert, Databases (SSTD


slide-1
SLIDE 1

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Quality of Similarity Rankings in Time Series

12th International Symposium on Spatial and Temporal Databases (SSTD 2011) Thomas Bernecker1, Michael E. Houle2, Hans-Peter Kriegel1, Peer Kröger1, Matthias Renz1, Erich Schubert1, Arthur Zimek1

1 Ludwig-Maximilians-Universität München, Munich, Germany 2 National Institute of Informatics, Tokyo, Japan

2011-08-26 — Minneapolis, MN

1/18

slide-2
SLIDE 2

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Time Series Distances

Time series research . . . has plenty of:

◮ New distance functions ◮ Dimensionality reduction ◮ Approximations

. . . but:

◮ How big is a distance of 0.432? ◮ How big is a difference of 0.123?

What is the meaning of these values?

2/18

slide-3
SLIDE 3

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Interpreting distance functions

Distance functions used to have a physical meaning:

◮ “As the crow flies” ◮ “Taxicab metric”

This worked well for the three-dimensional world. But this is not so in time series:

◮ “Curse of dimensionality”

loss of contrast in high-dimensional data

◮ Dimension-alignment as done by time warping ◮ Edit distances treat big and small edits the same

But: the distance functions work!

3/18

slide-4
SLIDE 4

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

The “Curse of Dimensionality”

Commonly described as

◮ Distances become “indiscernible” ◮ Distances “lose their usefulness” ◮ Hypercube becomes “vastly” bigger than hypersphere ◮ Nearest and farthest neighbor become similar ◮ Mathematical:

lim

dim→∞

distmax − distmin distmin → 0

So they should not work. But: they do!

4/18

slide-5
SLIDE 5

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

How bad is the “Curse of Dimensionality”?

Some facts on the “Curse of Dimensionality” (from Houle et al. 2010):

◮ Mathematics proven for i.i.d. data only ◮ Relevant dimensions make the problem easier ◮ Irrelevant dimensions make the problem harder ◮ ⇒ mostly a matter of “signal to noise ratio” ◮ Numerical contrast goes away,

but ranking still remains meaningful

Goal: Restore contrast and intuition using the ranking information

  • f the existing distance functions!

5/18

slide-6
SLIDE 6

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Shared Nearest Neighbor Similarity

Idea: Similar objects have similar neighbors. SNNs(x, y) = |NNs(x) ∩ NNs(y)| simcoss(x, y) = SNNs(x, y) s Properties:

◮ Intuitive value range from “None” to “All” ◮ Intuitive interpretation (“social”) ◮ Good contrast, good performance ◮ Needs an “okay” existing ranking ◮ Extra parameter s to choose ◮ More expensive to use (second order distance)

6/18

slide-7
SLIDE 7

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Shared Nearest Neighbor Distance

The similarity function needs to be transformed to a (non-metrical) distance function: dinvs(x, y) = 1 − simcoss(x, y) dacoss(x, y) = arccos (simcoss(x, y)) dlns(x, y) = − ln simcoss(x, y) Just like cosine distance. Interpretable as “cosine distance” in “neighbor space”. Similar: Jaccard distance (metrical) J(x, y) := 1 − |NNs(x) ∩ NNs(y)| |NNs(x) ∪ NNs(y)|

7/18

slide-8
SLIDE 8

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Experiments

Experimental results

8/18

slide-9
SLIDE 9

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Data sets used

Four very different data sets:

◮ Cylinder-Bell-Funnel (CBF): artificial ◮ Synthetic control: artificial ◮ Leaf dataset: outlines of tree leafs ◮ Lightning-7: lightning strike emissions

Each modified in different ways:

◮ Original data set ◮ Extended with noise (irrelevant attributes) ◮ Extended with “signal” (relevant attributes)

9/18

slide-10
SLIDE 10

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Unmodified data sets

Results on unmodified data sets

Benefits of using SNN Exemplary on the Cylinder-Bell-Funnel (artificial) data set

10/18

slide-11
SLIDE 11

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Contrast gain using SNN

Visual improvement (unmodified CBF data set):

Euclidean DTW 20% LCSS 20% DTW s = 70 DTW s = 100 LCSS s = 100

11/18

slide-12
SLIDE 12

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Distance Histograms

Numerical contrast improved (unmodified CBF data set):

10 20 30 40 50 60 70 PDF Primary distance 0.2 0.4 0.6 0.8 1 PDF SNN 100 distance

Euclidean

100 200 300 400 500 600 700 800 PDF Primary distance 0.2 0.4 0.6 0.8 1 PDF SNN 100 distance

Manhattan

10 20 30 40 50 60 70 PDF Primary distance 0.2 0.4 0.6 0.8 1 PDF SNN 100 distance

DTW 20%

5 10 15 20 25 30 35 40 45 50 55 60 PDF Primary distance 0.2 0.4 0.6 0.8 1 PDF SNN 100 distance

ERP 20%

30 40 50 60 70 80 90 100 110 120 PDF Primary distance 0.2 0.4 0.6 0.8 1 PDF SNN 100 distance

EDR 20%

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 PDF Primary distance 0.2 0.4 0.6 0.8 1 PDF SNN 100 distance

LCSS 20%

12/18

slide-13
SLIDE 13

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Effect of neighborhood size s:

Effect of variation of SNN size parameter s (CBF):

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 50 100 150 200 Mean ROC AUC SNN size

Euclidean

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 50 100 150 200 Mean ROC AUC SNN size

Manhattan

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 50 100 150 200 Mean ROC AUC SNN size

DTW 20%

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 50 100 150 200 Mean ROC AUC SNN size

ERP 20%

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 50 100 150 200 Mean ROC AUC SNN size

EDR 20%

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 50 100 150 200 Mean ROC AUC SNN size

LCSS 20%

13/18

slide-14
SLIDE 14

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Modified data sets

Results on modified data sets

Adding noise to the data set, Changing the signal to noise ratio

14/18

slide-15
SLIDE 15

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Adding noise

Adding noise to the data (Leaf data set)

0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 16 Mean ROC AUC Data set size multiplicator Euclidean Euclidean SNN 60

Euclidean

0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 16 Mean ROC AUC Data set size multiplicator Manhattan Manhattan SNN 60

Manhattan

0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 16 Mean ROC AUC Data set size multiplicator DTW 20% DTW 20% SNN 60

DTW 20%

0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 16 Mean ROC AUC Data set size multiplicator ERP 20% ERP 20% SNN 60

ERP 20%

0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 16 Mean ROC AUC Data set size multiplicator EDR 20% EDR 20% SNN 60

EDR 20%

0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 16 Mean ROC AUC Data set size multiplicator LCSS 20% LCSS 20% SNN 60

LCSS 20%

15/18

slide-16
SLIDE 16

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Chaning signal to noise ratio

Changing the signal-to-noise ratio (Leaf data set)

0.5 0.6 0.7 0.8 0.9 1 0% 25% 50% 75% 100% Mean ROC AUC Relative amount of noise 4-fold 4-fold, SNN 60 16-fold 16-fold, SNN 60

Euclidean

0.5 0.6 0.7 0.8 0.9 1 0% 25% 50% 75% 100% Mean ROC AUC Relative amount of noise 4-fold 4-fold, SNN 60 16-fold 16-fold, SNN 60

Manhattan

0.5 0.6 0.7 0.8 0.9 1 0% 25% 50% 75% 100% Mean ROC AUC Relative amount of noise 4-fold 4-fold, SNN 60 16-fold 16-fold, SNN 60

DTW 20%

0.5 0.6 0.7 0.8 0.9 1 0% 25% 50% 75% 100% Mean ROC AUC Relative amount of noise 4-fold 4-fold, SNN 60 16-fold 16-fold, SNN 60

ERP 20%

0.5 0.6 0.7 0.8 0.9 1 0% 25% 50% 75% 100% Mean ROC AUC Relative amount of noise 4-fold 4-fold, SNN 60 16-fold 16-fold, SNN 60

EDR 20%

0.5 0.6 0.7 0.8 0.9 1 0% 25% 50% 75% 100% Mean ROC AUC Relative amount of noise 4-fold 4-fold, SNN 60 16-fold 16-fold, SNN 60

LCSS 20%

16/18

slide-17
SLIDE 17

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Conclusions

Conclusions

Second order “shared nearest neighbor” distances offer:

◮ Improved performance ◮ Better numerical contrast ◮ Parameter s is not difficult to choose ◮ Less sensitive to noise ◮ . . . but computationally more expensive

17/18

slide-18
SLIDE 18

Quality of Similarity Rankings in Time Series

  • T. Bernecker,
  • M. E. Houle,

H.-P. Kriegel,

  • P. Kröger,
  • M. Renz,
  • E. Schubert,
  • A. Zimek

Motivation Interpreting Distance Fct.

Distance Functions Curse of Dimens. SNN Distance

Experiments

SNN performance Histograms Effects of noise

Conclusions

Thank you for your attention!

18/18