Query by Content for Time Series Data in RDBMS 1 I N E S F . V E - - PowerPoint PPT Presentation

query by content for time series data in rdbms
SMART_READER_LITE
LIVE PREVIEW

Query by Content for Time Series Data in RDBMS 1 I N E S F . V E - - PowerPoint PPT Presentation

Query by Content for Time Series Data in RDBMS 1 I N E S F . V E G A - L O P E Z University of Houston, Computer Science Seminar 12/6/13 Roadmap 2 Querying non-text data Time series data ECG data ECG sequence


slide-1
SLIDE 1

I N E S F . V E G A - L O P E Z

Query by Content for Time Series Data in RDBMS

12/6/13

1

University of Houston, Computer Science Seminar

slide-2
SLIDE 2

Roadmap

— Querying non-text data

— Time series data — ECG data — ECG sequence classification — Extending RDBMS

12/6/13

2

University of Houston, Computer Science Seminar

slide-3
SLIDE 3

Non-text data

— Music — Speech — Biosignals — Images — Video

12/6/13

3

University of Houston, Computer Science Seminar

slide-4
SLIDE 4

Querying non-text data

— By describing content

¡ Query by associated text ¡ Labels, html, etc.

— By content

¡ Similarity search ¡ Similarity or distance function is required ¡ Provided by a domain expert

12/6/13

4

University of Houston, Computer Science Seminar

slide-5
SLIDE 5

Roadmap

— Querying non-text data

— Time series data

— ECG data — ECG sequence classification — Extending RDBMS

12/6/13

5

University of Houston, Computer Science Seminar

slide-6
SLIDE 6

Time series data

— A sequence of pairs (t[i], v[i])

¡ A timestamp and a value. ¡ Delta t is usually constant.

— Sometimes, the absolute time value is not important. — Then, the time series is just a sequence of values.

12/6/13

6

University of Houston, Computer Science Seminar

slide-7
SLIDE 7

Time series data

— Querying have been well studied for the past 20

years

— Two types of queries

¡ Whole sequence match ¡ Subsequence match

12/6/13

7

University of Houston, Computer Science Seminar

slide-8
SLIDE 8

Similarity Search on Time Series Data

— Whole Sequence Match

¡ Given a query pattern q of length n, and a DB, B, of sequences

  • f legth n

¡ Find all b ∈ B such that

ε ≤ ) , ( b q Dist

12/6/13

8

University of Houston, Computer Science Seminar

slide-9
SLIDE 9

Similarity Search on Time Series Data

— Sub Sequence Match

¡ Given a query pattern q of length n, and a DB, B, of sequences

  • f arbitrary length (each one longer than q)

¡ Find all pairs (b, i), b ∈ B, such that

ε ≤ + ]) : [ , ( n i i b q Dist

12/6/13

9

University of Houston, Computer Science Seminar

slide-10
SLIDE 10

How can we do this efficiently?

— For conventional data, we build and index and use it

to prune the search space.

¡ A linear order exists among the object in the DB.

— For time series, we do not have a linear ordering. — We can treat a (sub) sequence as a point in n-space.

¡ n is too large ¡ Curse of dimensionality

12/6/13

10

University of Houston, Computer Science Seminar

slide-11
SLIDE 11

Searching for (sub) sequences

— Generic Multimedia Indexing: GEMINI

¡ Map database Objects into a feature space. ¡ Index the transformed objects using a SAM ¡ Transform query objects to the feature space ¡ Search in this feature space ¡ Filter out false positives

12/6/13

11

University of Houston, Computer Science Seminar

slide-12
SLIDE 12

Mapping into a Feature Space

— DFT — DWT — PAA — APCA — SAX — Etc.

12/6/13

12

University of Houston, Computer Science Seminar

slide-13
SLIDE 13

Roadmap

— Querying non-text data — Time series data

— ECG data

— ECG sequence classification — Extending RDBMS

12/6/13

13

University of Houston, Computer Science Seminar

slide-14
SLIDE 14

ECG Data

— We want to do KDD on time series.

¡ Let us concentrate on a particular domain. ¡ Medicine has a high social impact. ¡ ECG data has some very interesting challenges.

— Can we build upon existing models?

¡ Can we use try and tested RDBMS’?

12/6/13

14

University of Houston, Computer Science Seminar

slide-15
SLIDE 15

Issues Challenges with ECG data

— An ECG contains more than one signal

¡ Usually 2 or 12 leads

— Different ECG’s might have different lengths

¡ A few minutes to a couple of days

— Different ECG’s might have different sampling ratios

¡ 128 Hz to 1 or 2 KHz

— Values’ bit-depth might also vary among ECG’s

¡ 8 to 20 bits per value

12/6/13

15

University of Houston, Computer Science Seminar

slide-16
SLIDE 16

What about database systems?

— All these characteristics can be captured by the ER

model just fine.

— In turn, this model can be transformed into relation.

12/6/13

16

University of Houston, Computer Science Seminar

slide-17
SLIDE 17

An instance of an ECG DB

12/6/13

University of Houston, Computer Science Seminar

17

slide-18
SLIDE 18

What needs to be done?

— The content of an ECG signal is not a conventional

data type.

— We need to define operators on this type

¡ What operators? ÷ Similarity Search ÷ Define a formal model

12/6/13

18

University of Houston, Computer Science Seminar

slide-19
SLIDE 19

Roadmap

— Querying non-text data — Time series data — ECG data

— ECG sequence classification

— Extending RDBMS

12/6/13

19

University of Houston, Computer Science Seminar

slide-20
SLIDE 20

Normal Hearth Beat

12/6/13

20

University of Houston, Computer Science Seminar

slide-21
SLIDE 21

Premature Ventricular Contraction

12/6/13

21

University of Houston, Computer Science Seminar

slide-22
SLIDE 22

Similarity Search

— K-nn search

¡ This gives us signals and the position of a matching

subsequence

— Subsequence retrieval

¡ This gives us the content of the matching signal

12/6/13

22

University of Houston, Computer Science Seminar

slide-23
SLIDE 23

K-NN Search

SELECT NN(D.signal, query_pattern, n) FROM ECG_DATA D WHERE <condition>;

12/6/13

23

University of Houston, Computer Science Seminar

slide-24
SLIDE 24

Sub-sequence Fetch

SELECT subsequence(D.signal, position, n) FROM ECG_DATA D WHERE D.signal = signal_id;

12/6/13

24

University of Houston, Computer Science Seminar

slide-25
SLIDE 25

What about the Distance Function?

— For Querying Time Series, the DB community has

been using L_p norm.

¡ Most often Euclidean

— Cardiologist use Cross Correlation

¡ This is not an L_P norm ¡ SAM’s cannot be used.

12/6/13

25

University of Houston, Computer Science Seminar

slide-26
SLIDE 26

Euclidean Distance

( )

− =

i

i y i x Y X Dist

2

] [ ] [ ) , (

12/6/13

26

University of Houston, Computer Science Seminar

slide-27
SLIDE 27

Cross Correlation Distance

( ) ( )

∑ ∑ ∑

− − − − =

i i i

y i y x i x y i y x i x Y X Dist

2 2

ˆ ] [ ˆ ] [ ˆ ] [ ˆ ] [ ) , (

12/6/13

27

University of Houston, Computer Science Seminar

slide-28
SLIDE 28

Roadmap

— Querying non-text data — Time series data — ECG data — ECG sequence classification

— Extending RDBMS

12/6/13

28

University of Houston, Computer Science Seminar

slide-29
SLIDE 29

Similarity Searching with UDF

12/6/13

University of Houston, Computer Science Seminar

29

slide-30
SLIDE 30

Sub-sequence Fetch

12/6/13

30

University of Houston, Computer Science Seminar

slide-31
SLIDE 31

PVC: A Match in the DB

12/6/13

31

University of Houston, Computer Science Seminar

slide-32
SLIDE 32

Which distance function is better?

— Using the MIT-BIH Arrhythmia DB — For healthy – non-healthy classification

¡ 98.35 % for Euclidean. ¡ 98.59 % For Cross Correlation.

— For pathology classification (15 classes)

¡ 97.70 % For Euclidean. ¡ 98.14 % For Cross Correlation.

— Too close to call

12/6/13

32

University of Houston, Computer Science Seminar

slide-33
SLIDE 33

Are UDF’s Efficient?

— We stored ECG signals as BLOBs and as reference to

a file.

— We developed an ad-hoc stand alone search

application.

¡ This uses a file repository.

— Using BLOBs has significant overhead both in

storage (5X) and in total elapsed time (10X).

— UDF’s on files are as efficient as ad-hoc queries.

12/6/13

33

University of Houston, Computer Science Seminar

slide-34
SLIDE 34

Conclusions

— Similarity Search is complex because all data must

be scanned.

¡ It can be efficiently implemented to extend a RDBMS. ¡ Compared to an ad-hoc query.

— It is worth exploring GEMINI.

¡ Now that we now that Euclidean distance can be used.

— Data encoding should be considered.

¡ We might not be getting much IO savings

12/6/13

34

University of Houston, Computer Science Seminar

slide-35
SLIDE 35

Questions

12/6/13

35

University of Houston, Computer Science Seminar

slide-36
SLIDE 36

Thanks!

12/6/13

36

University of Houston, Computer Science Seminar