Continuous Imputation of Missing Values in Streams of - - PowerPoint PPT Presentation

continuous imputation of missing values in streams of
SMART_READER_LITE
LIVE PREVIEW

Continuous Imputation of Missing Values in Streams of - - PowerPoint PPT Presentation

Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series Kevin Wellenzohn 1 ohlen 1 Michael H. B os 2 Johann Gamper 2 Hannes Mitterer 2 Anton Dign 1 Department of Computer Science University of Zurich 2 Faculty


slide-1
SLIDE 1

1

Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series

Kevin Wellenzohn1 Michael H. B¨

  • hlen1

Anton Dign¨

  • s2

Johann Gamper2 Hannes Mitterer2

1Department of Computer Science

University of Zurich

2Faculty of Computer Science

Free University of Bolzano

March 24, 2017

slide-2
SLIDE 2

2

South Tyrol

slide-3
SLIDE 3

3

Overview

  • Problem. Streaming time series often have missing values,

e.g. due to sensor failures or transmission delays!

  • Goal. Accurately impute (i.e. recover) the latest measurement by

exploiting the correlation among streams.

  • Challenge. Streaming time series are often non-linearly

correlated, e.g. due to phase shifts.

slide-4
SLIDE 4

4

Example

13:25 13:30 13:35 13:40 13:45 13:50 13:55 14:00 14:05 14:10 14:15 14:20 21 22 23

s Streaming Time Series s ? ? ? Time

  • Temp. [°C]

I The latest value at time 14:20 is missing and needs to be

imputed (i.e. recovered).

slide-5
SLIDE 5

5

Approach

slide-6
SLIDE 6

6

Top-k Case Matching (TKCM)

  • Intuition. Impute a missing value in time series s with past values

from s when a set of correlated reference time series exhibited similar patterns.

slide-7
SLIDE 7

6

Top-k Case Matching (TKCM)

  • Intuition. Impute a missing value in time series s with past values

from s when a set of correlated reference time series exhibited similar patterns. Imputation Steps:

  • 1. Draw query pattern over most recent values
slide-8
SLIDE 8

6

Top-k Case Matching (TKCM)

  • Intuition. Impute a missing value in time series s with past values

from s when a set of correlated reference time series exhibited similar patterns. Imputation Steps:

  • 1. Draw query pattern over most recent values
  • 2. Find k most similar non-overlapping patterns
slide-9
SLIDE 9

6

Top-k Case Matching (TKCM)

  • Intuition. Impute a missing value in time series s with past values

from s when a set of correlated reference time series exhibited similar patterns. Imputation Steps:

  • 1. Draw query pattern over most recent values
  • 2. Find k most similar non-overlapping patterns
  • 3. Impute missing value using the k most-similar patterns
slide-10
SLIDE 10

7

Applying TKCM

21 23

s

15 17

r1

  • Temp. [°C]

13:25 13:30 13:35 13:40 13:45 13:50 13:55 14:00 14:05 14:10 14:15 14:20 18 20

r2 Time

slide-11
SLIDE 11

7

Applying TKCM

21 23

s

15 17

r1

  • Temp. [°C]

13:25 13:30 13:35 13:40 13:45 13:50 13:55 14:00 14:05 14:10 14:15 14:20 18 20

r2 Time

  • 1. Define query pattern P(14:20) over d = 2 reference time

series {r1, r2} in a time frame of l = 10 minutes

slide-12
SLIDE 12

7

Applying TKCM

21 23

s

15 17

r1

  • Temp. [°C]

13:25 13:30 13:35 13:40 13:45 13:50 13:55 14:00 14:05 14:10 14:15 14:20 18 20

r2 Time

  • 2. The k = 2 most similar non-overlapping patterns are P(14:00)

and P(13:35)

slide-13
SLIDE 13

7

Applying TKCM

21 23

s

15 17

r1

  • Temp. [°C]

13:25 13:30 13:35 13:40 13:45 13:50 13:55 14:00 14:05 14:10 14:15 14:20 18 20

r2 Time

  • 3. Missing value is imputed as

ˆ s(14:20) = 1

2(s(14:00) + s(13:35)) = 21.85°C

slide-14
SLIDE 14

8

Query Pattern

16.3 17.1 17.5 20.2 19.9 18.2

14:10 14:15 14:20

r1 r2 Pattern length l = 3 # reference time series d = 2

I With l > 1, TKCM takes the temporal context into account

and captures how time series change over time

I Pattern length l is important to deal with non-linear

correlations

slide-15
SLIDE 15

9

Related Work

  • 1. Centroid Decomposition (CD)

I M. Khayati, M. H. B¨

  • hlen, and J. Gamper. Memory-efficient

centroid decomposition for long time series. ICDE 2014

I Singular Value Decomposition (SVD) that expects linear

correlations

  • 2. SPIRIT

I S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern

discovery in multiple time-series. VLDB 2005

I Principal Component Analysis (PCA) that expects linear

correlations

  • 3. MUSCLES

I B. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish,

  • C. Faloutsos, and A. Biliris. Online data mining for co-evolving

time sequences. ICDE 2000

I Multi-variate linear regression that expects linear correlations

slide-16
SLIDE 16

10

Linear vs. Non-Linear Correlations

slide-17
SLIDE 17

11

Linear Correlations

−1 2

s(t) = sind(t) r(t) = 1.5 × sind(t) + 1 s

180 540 840 −1 2

r Time t

I Time series s and r have different amplitude and offset

slide-18
SLIDE 18

11

Linear Correlations

−1 2

s(t) = sind(t) r(t) = 1.5 × sind(t) + 1 s

180 540 840 −1 2

r Time t

1 2.3 −0.86 0.86

r(t) s(t)

I Time series s and r have different amplitude and offset I They are linearly correlated and their Pearson Correlation

Coefficient is 1!

slide-19
SLIDE 19

11

Linear Correlations

−1 2

s(t) = sind(t) r(t) = 1.5 × sind(t) + 1

0.86

s

180 540 840 −1 2

2.3

r Time t

1 2.3 −0.86 0.86

r(t) s(t)

I Time series s and r have different amplitude and offset I They are linearly correlated and their Pearson Correlation

Coefficient is 1!

slide-20
SLIDE 20

12

Non-Linear Correlations

−1 1

s(t) = sind(t) r(t) = sind(t − 90) s

180 540 840 −1 1

r Time t

−1 0.5 1 −0.86 0.86

r(t) s(t)

I Time series s and r are phase-shifted by 90 degrees I They are non-linearly correlated and their Pearson

Correlation Coefficient is 0!

slide-21
SLIDE 21

12

Non-Linear Correlations

−1 1

s(t) = sind(t) r(t) = sind(t − 90)

0.86 −0.86

s

180 540 840 −1 1

0.5

r Time t

−1 0.5 1 −0.86 0.86

r(t) s(t)

I Time series s and r are phase-shifted by 90 degrees I They are non-linearly correlated and their Pearson

Correlation Coefficient is 0!

slide-22
SLIDE 22

13

Pattern Length l and Non-Linear Correlations

−1 1

s Pattern length l = 1

−1 1

r

180 540 840 1 2

Pattern dissimilarity

Time t

slide-23
SLIDE 23

13

Pattern Length l and Non-Linear Correlations

−1 1

s Pattern length l = 1

−1 1

r

180 540 840 1 2

Pattern dissimilarity

Time t

−1 1

s Pattern length l = 100

−1 1

r

180 540 840 1 2

Pattern dissimilarity

Time t

I With l > 1 there are less patterns with pattern dissimilarity 0

slide-24
SLIDE 24

14

Chlorine Dataset

I Chlorine dataset is phase-shifted and hence non-linearly

correlated

0.1 0.2

s

0.1 0.2

r Chlorine level Time t

0.1 0.2 0.1 0.2

r(t) s(t)

slide-25
SLIDE 25

15

Importance of Pattern Length l

0.1 0.2

Time t

Pattern length l = 1 Chlorine level

s s imputed by TKCM

0.1 0.2

Time t

Pattern length l = 72

I A larger pattern length decreases the oscillation in the

imputed time series

slide-26
SLIDE 26

16

Experiments

slide-27
SLIDE 27

17

Datasets

We use 4 datasets:

  • 1. SBR

I 130 meteorological time series from South Tyrol I linearly correlated

  • 2. SBR-1d

I SBR dataset shifted up to 1 day I non-linearly correlated

  • 3. Flights

I 8 time series I non-linearly correlated

  • 4. Chlorine

I 166 time series I non-linearly correlated

slide-28
SLIDE 28

18

Pattern Length l

1 36 72 108 144 0.6 0.8 1 1.2 1.4 linearly correlated

Pattern Length l

SBR RMSE

1 36 72 108 144 1.5 2 2.5 non-linearly correlated

Pattern Length l

SBR-1d RMSE

1 36 72 108 144 2 4 6 8 10 non-linearly correlated

Pattern Length l

Flights RMSE

1 36 72 108 144 0.01 0.02 0.03 0.04 non-linearly correlated

Pattern Length l

Chlorine RMSE

slide-29
SLIDE 29

19

Comparison

2 4 6

linearly correlated

1.07 0.88 0.89 1.32

SBR RMSE

TKCM SPIRIT MUSCLES CD

2 4 6

non-linearly correlated

1.82 2.57 4.34 2.12

SBR-1d RMSE

10 20 30

non-linearly correlated

3.57 14.67 8.35 20.7

Flights RMSE

0.05 0.1

non-linearly correlated

0.014 0.049 0.036 0.054

Chlorine RMSE

I TKCM is more accurate on all non-linearly correlated

datasets (SBR-1d, Flights, and Chlorine).

slide-30
SLIDE 30

20

Conclusion & Future Work

Conclusion

I TKCM imputes the current missing value in a stream using

reference time series

I TKCM exploits linear and non-linear correlations among

time series Future work

I Automatically choose reference time series I Improve efficiency of TKCM by pruning candidate patterns

slide-31
SLIDE 31

21

Thanks!