A SURVEY DENIS KHRYASHCHEV, GRADUATE CENTER, CUNY, OCTOBER 2018 - - PowerPoint PPT Presentation

a survey
SMART_READER_LITE
LIVE PREVIEW

A SURVEY DENIS KHRYASHCHEV, GRADUATE CENTER, CUNY, OCTOBER 2018 - - PowerPoint PPT Presentation

PATTERN DISCOVERY IN TIME SERIES A SURVEY DENIS KHRYASHCHEV, GRADUATE CENTER, CUNY, OCTOBER 2018 MOTIVATION Often datasets represent processes that take place over long periods of time. Their outputs are measured at regular time intervals


slide-1
SLIDE 1

PATTERN DISCOVERY IN TIME SERIES A SURVEY

DENIS KHRYASHCHEV, GRADUATE CENTER, CUNY, OCTOBER 2018

slide-2
SLIDE 2

MOTIVATION

Often datasets represent processes that take place over long periods of time. Their outputs are measured at regular time intervals creating discrete time series. For example, consider CitiBike demand and Fisher river temperature data.

Data sources: 1. https://s3.amazonaws.com/tripdata/index.html,

  • 2. https://datamarket.com/data/set/235d/mean-daily-temperature-fisher-river-near-dallas-jan-01-1988-to-dec-31-1991

CitiBike Fisher river

slide-3
SLIDE 3
  • MOTIVATION. COMPLEXITY

Complexity quantifies the internal structure of the underlying process. EEG data can be classified [1] into interictal, preictal and seizure using their complexity.

[1] Petrosian, Arthur. "Kolmogorov complexity of finite sequences and recognition of different preictal EEG patterns." Computer-based medical systems, 1995., Proceedings of the Eighth IEEE Symposium on. IEEE, 1995.

interictal 1. voltage, µV preictal 1. voltage, µV seizure 1. voltage, µV

slide-4
SLIDE 4
  • MOTIVATION. PERIODICITY

Natural phenomena like Sun activity, Earth rotation and revolution drive periodic human activity on the large scale. E.g. New York City’s human mobility is highly periodic with clear peaks in ridership from 6 AM to 10 AM, and from 3 PM to 7 PM.

Image source: http://web.mta.info/mta/news/books/docs/Ridership_Trends_FINAL_Jul2018.pdf

slide-5
SLIDE 5
  • MOTIVATION. PREDICTABILITY

Predictability estimates the expected accuracy of forecasting given time series. Often there is a trade-off between the desired accuracy and computation time [2].

[2] Zhao, Kai, et al. "Predicting taxi demand at high spatial resolution: approaching the limit of predictability." Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 2016.

slide-6
SLIDE 6
  • MOTIVATION. CLUSTERING

Image source: Denis Khryashchev’s summer internship at Simulmedia (Jun – Aug 2018).

Often a task of grouping similar in certain quality time series arises in the domains

  • f transportation, finance, medicine,… Time sensitive modifications of standard

techniques are applied, e.g. k-means of autocorrelation functions.

autocorrelation functions time series clustered together

slide-7
SLIDE 7
  • MOTIVATION. FORECASTING

Perhaps, the most well known and widely applied task related to time series is

  • forecasting. Understanding time series periodicity, complexity, and predictability

helps in selecting better predictors and optimizing parameters. E.g., knowing periodicity P=5 of the series, one can forecast averaging values with lag 5.

Video source: Denis Khryashchev’s summer internship at Simulmedia (Jun – Aug 2018).

slide-8
SLIDE 8

NOTATION

Throughout the presentation we will consider time series of real values and will use the following notation 𝑌 = 𝑌1, … , 𝑌𝑂 = 𝑌𝑢 1

𝑂, 𝑌𝑢 ∈ ℜ

Not to be confused with set notation, ⋅ is used to denote sequences. A subsequence of the series 𝑌 that starts at period 𝑗 and ends at period 𝑘 is written as 𝑌𝑗

𝑘 = 𝑌𝑢 𝑗 𝑘 = 𝑌𝑗, … , 𝑌 𝑘 , 𝑗 ≤ 𝑘

slide-9
SLIDE 9

ORGANIZATION OF THE PRESENTATION

slide-10
SLIDE 10
  • 1. KOLMOGOROV COMPLEXITY

For time series 𝑌 we define the Kolmogorov complexity as the length of the shortest description of a sequence of values ordered in time in some fixed universal description language 𝐿 𝑌 = 𝑒 𝑌 where 𝐿 is the Kolmogorov complexity, and 𝑒(𝑌) is the shortest description of the time series X. Smaller values of 𝐿 𝑌 indicate lower complexity.

slide-11
SLIDE 11
  • 1. KOLMOGOROV COMPLEXITY. EXAMPLE

Given two time series 𝑌 = 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 and 𝑍 = {1, 0, 0, 1, 1, 1, 0, 0, 1, 0} and selecting Python as our description language we have the shortest descriptions 𝑒𝑄 𝑌 = 0, 1 ∗ 5 and 𝑒𝑄 𝑍 = 1, 0, 0, 1, 1, 1, 0, 0, 1, 0 quantifying smaller “Pythonic” complexity for 𝑌 comparing to 𝑍 𝐿𝑄 𝑌 = 𝑒𝑄 𝑌 = 7 𝐿𝑄 𝑍 = 𝑒𝑄 𝑍 = 21

slide-12
SLIDE 12
  • 1. KOLMOGOROV COMPLEXITY. LIMITATIONS

However, as proven by Kolmogorov in [3], and Chaitin and Arslanov in [4], the complexity 𝐿 is not a computable function in general.

  • 3. Kolmogorov, Andrei N. "On tables of random numbers." Sankhyā: The Indian Journal of Statistics, Series A (1963): 369-376.
  • 4. G. J. Chaitin, A. Arslanov, and C. Calude, “Program-size complexity computes the halting problem,” Department of

Computer Science, The University of Auckland, New Zealand, Tech. Rep., 1995.

slide-13
SLIDE 13
  • 1. LEMPEL-ZIV COMPLEXITY

Lempel and Ziv [5] proposed a combinatorial approximation of the complexity of finite sequences based on their production history. For time series 𝑌 it is 𝐼 𝑌 = 𝑌1

ℎ1+1, 𝑌ℎ1+1 ℎ2

, … 𝑌ℎ𝑛−1+1

𝑛

For series 𝑌 = {0,0,0,1,1,0,1,0,0,1,0,0,0,1,0,1} one of the production histories is 𝐼 𝑌 = {0}ڂ{0,0,1}ڂ{1,0}ڂ{1,0,0}ڂ{1,0,0,0}ڂ{1,0,1} The overall complexity is the size of the shortest possible production history 𝑑 𝑌 = min

𝐼(𝑌) 𝐼 𝑌

Disadvantage: the actual values 𝑌𝑢 are treated as symbols, e.g. 𝑑 𝑌 = 1, 2, 1, 5, 1, 2 = 𝑑(𝑍 = {8, 0.5, 8, 0.1, 8, 0.5})

  • 5. Lempel, Abraham, and Jacob Ziv. "On the complexity of finite sequences." IEEE Transactions on information theory 22.1 (1976): 75-81.
slide-14
SLIDE 14
  • 2. ENTROPY

Shannon and Weaver introduced entropy [6] as a measure of information transmitted by a signal in a communication channel 𝐼 𝑌 = −𝔽 log2 𝑄 𝑌 Renyi [7] generalized the definition for ordinary discrete finite distribution of 𝑌 𝒬 = 𝑞1, … , 𝑞𝑁 , σ𝑙 𝑞𝑙 = 1 to entropy of order 𝛽 (𝛽 → 1 for Shannon entropy) 𝐼𝛽 𝑌 = 𝐼𝛽 𝒬 =

1 1−𝛽 log2 σ𝑙 𝑞𝑙 𝛽

Disadvantage: both definitions do not take order of the values 𝑌𝑢 into account, e.g. 𝐼 𝑌 = 1, 2, 3, 1, 2, 3 = 𝐼(𝑍 = {1, 3, 2, 2, 3, 1}).

  • 6. Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • 7. Rényi, Alfréd. On measures of entropy and information. HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary, 1961.
slide-15
SLIDE 15
  • 2. KOLMOGOROV ENTROPY

Entropy is often used as an approximation of complexity. Among the most well-known approximations [8] of the complexity is Kolmogorov Entropy defined as 𝐿 = − lim

𝜐→∞ lim 𝜗→∞ lim 𝑒→∞

1 𝑒𝜐 ෍

𝑗1,…𝑗𝑒

𝑞 𝑗1, … , 𝑗𝑒 ln 𝑞 𝑗1, … , 𝑗𝑒 It describes complexity of a dynamic system with 𝐺 degrees of freedom. 𝐺 - dimensional phase space is partitioned into 𝜗𝐺 boxes, 𝜐 stands for time intervals, and 𝑞 𝑗1, … , 𝑗𝑒 is the joint probability that we find the 𝐺-dimensional point representing values 𝑌𝑢=𝑙𝜐 inside the box 𝜗𝐺. Disadvantage: the approximation is computable for known analytically defined models, however, it is hard to calculate it given the resulting series only.

  • 8. Grassberger, Peter, and Itamar Procaccia. "Estimation of the Kolmogorov entropy from a chaotic signal." Physical review A 28.4 (1983): 2591.
slide-16
SLIDE 16
  • 2. ENTROPY WITH TEMPORAL COMPONENT

Another definition [6] of entropy takes into account temporal order of the values 𝑌𝑢 𝐼𝑢 𝑌 = − σ𝑗=1

𝑂

σ𝑘=1

𝑂

𝑄 𝑌𝑗

𝑘 log2 𝑄 𝑌𝑗 𝑘

𝑄 𝑌𝑗

𝑘 is the probability of the subsequence 𝑌𝑗 𝑘. 𝐼𝑢 𝑌 is 𝑃 2𝑂 complex.

Lempel-Ziv estimator [9] approximates 𝐼𝑢 𝑌 rapidly converging 𝐼𝑀𝑎 𝑌 =

1 𝑂 σ𝑢 𝑡𝑢 ′ −1

ln 𝑂 where 𝑡𝑢

′ is the shortest subsequence starting at period 𝑢 observed for the 1st time.

Disadvantage: values 𝑌𝑢 are treated as symbols, e.g. 𝐼𝑀𝑎 𝑌 = 1, 2, 1, 5 = 𝐼𝑀𝑎(𝑍 = {2, 9, 2, 3})

  • 6. Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • 9. Kontoyiannis, Ioannis, et al. "Nonparametric entropy estimation for stationary processes and random fields, with applications to English text."

IEEE Transactions on Information Theory 44.3 (1998): 1319-1327.

slide-17
SLIDE 17
  • 2. PERMUTATION ENTROPY

Bandt and Pompe [10] proposed permutation entropy of order 𝑜 𝐼 𝑜 = − σ 𝑞 𝜌 log 𝑞 𝜌 where 𝑞 𝜌 =

# 𝑢|0≤𝑢≤𝑈−𝑜, 𝑢𝑧𝑞𝑓 𝑌𝑢+1,…,𝑌𝑢+𝑜 =𝜌 𝑈−𝑜+1

is a frequency of a permutation of type 𝜌. E.g., for 𝑌 = 4, 7, 9, 10, 6, 11, 3 , 𝑜 = 3 we have 𝜌 4, 7, 9 = 𝜌 7, 9, 10 = 𝜌012 𝑌𝑢 < 𝑌𝑢+1 < 𝑌𝑢+2 , 𝜌 9, 10, 6 = 𝜌 6, 11, 3 = 𝜌210 𝑌𝑢+2 < 𝑌𝑢 < 𝑌𝑢+1 , 𝜌 10, 6, 11 = 𝜌102 𝑌𝑢+1 < 𝑌𝑢 < 𝑌𝑢+2 . The entropy becomes 𝐼 3 = −2

2 5 log 2 5 − 1 5 log 1 5 ≈ 1.52 .

Disadvantage: the definition requires 𝑌𝑢 ≠ 𝑌𝑢+1 and has a complexity of 𝑃(𝑜!).

  • 10. Bandt, Christoph, and Bernd Pompe. "Permutation entropy: a natural complexity measure for time series." Physical review letters 88.17

(2002): 174102.

slide-18
SLIDE 18
  • 3. PREDICTABILITY

Following the Fano inequality [11], predictability of series 𝑌 Π 𝑌 ≤ Π𝑛𝑏𝑦 𝐼 𝑌 , 𝑂 where 𝑂 is the number of unique values 𝑌𝑢 , 0 ≤ Π𝑛𝑏𝑦 ≤ 1 is the maximum predictability of 𝑌 with 0 standing for completely unpredictable chaotic series. In the works of Song et al. [12] the maximum predictability is shown to be the solution of 𝐼 𝑌 = −Π𝑛𝑏𝑦 log2 Π𝑛𝑏𝑦 − 1 − Π𝑛𝑏𝑦 × log2 1 − Π𝑛𝑏𝑦 + 1 − Π𝑛𝑏𝑦 log2 𝑂 − 1 . Often Lempel-Ziv estimator 𝐼𝑀𝑎 𝑌 is used to calculate the entropy of the series. Disadvantage: depends of the selected measure of 𝐼 𝑌 , no closed-form solution of the equation.

  • 11. Fano, Robert M. The transmission of information. Cambridge, Mass, USA: Massachusetts Institute of Technology, Research Laboratory of

Electronics, 1949.

  • 12. Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018-1021.
slide-19
SLIDE 19
  • 3. PREDICTABILITY. SHUFFLING

Alternative approach to measure predictability was proposed by Kaboudan [13]. They defined it as a ratio of forecasting errors prior and after shuffling of the series 𝜃 𝑌 = 1 −

𝑌−𝑔 𝑌

𝑈 𝑌−𝑔 𝑌

𝑌𝑡−𝑔 𝑌𝑡

𝑈 𝑌𝑡−𝑔 𝑌𝑡

where 𝑌𝑡 is a shuffled copy of the series, 𝑔 is the selected predictor. Disadvantage: the shuffling is done only once and could lead to inconsistent results when measured multiple times for the same time series 𝑌. A significant improvement

  • f the approach would be to calculate 𝜃 𝑌 numerous times (e.g. 1000) calculating 𝑞

value and performing hypothesis testing.

  • 13. Fano, Robert M. The transmission of information. Cambridge, Mass, USA: Massachusetts Institute of Technology, Research Laboratory of

Electronics, 1949.

slide-20
SLIDE 20
  • 3. PREDICTABILITY. REGRESSION MODELS

Often in econometrics predictability is quantified with linear regression models [14] 𝑌𝑢+1 = 𝛽 + 𝛾𝑌𝑢 + 𝜗𝑢+1 that are used to calculate 𝑆2 𝑆2 = 1 −

𝑊𝑏𝑠 𝛽+𝛾𝑌𝑢 𝑊𝑏𝑠 𝑌𝑢+1

ranging from 0 to 1 with the latter representing the most unpredictable series. Disadvantage: captures only linear relationships unless non-linear regression models are used.

  • 14. Campbell, John Y., and Motohiro Yogo. "Efficient tests of stock return predictability." Journal of financial economics 81.1 (2006): 27-60.
slide-21
SLIDE 21
  • 4. PERIODICITY. FOURIER

Numerous approaches to quantify time series periodicity rely on Fourier transform 𝑁𝑙 = σ𝑢=0

𝑂−1 𝑌𝑢𝑓−𝑗2𝜌𝑙𝑢/𝑂

where 𝑁𝑙 is the “magnitude” of 𝑙𝑢ℎ frequency quantifying “relative chance” of the repetition of the value 𝑌𝑢 after the corresponding period of time. To decrease the amount of spurious artifacts, usually a windowed or short-time Fourier transform is applied [15] 𝑁𝑙 = σ𝑢=0

𝑂−1 𝑌𝑢𝑥(𝑢 − 𝜐)𝑓−𝑗2𝜌𝑙𝑢/𝑂

where 𝑥 is the window function of effective size of 2𝜐. Often Blackman, Hamming, and Bartlett windows are used. Disadvantage: not linear in period.

  • 15. Allen, Jont B., and Lawrence R. Rabiner. "A unified approach to short-time Fourier analysis and synthesis." Proceedings of the IEEE 65.11

(1977): 1558-1564.

slide-22
SLIDE 22
  • 4. PERIODICITY. FOURIER AND AUTOCORRELATION

Linear autocorrelation function can be used to evaluate periodicity due to the Wiener-Khinchin [16][17] theorem that states that 𝑇𝑦𝑦 𝑔 = σ𝑢=0

𝑂−1 𝑠 𝑦𝑦𝑓−𝑗2𝜌𝑙𝑢/𝑂

where 𝑇𝑦𝑦 𝑔 is the power spectrum of 𝑌, 𝑠

𝑦𝑦 is its autocorrelation function. In

  • ther words, larger values of the autocorrelation function of 𝑌 at lag 𝜐 can signify

larger periodicity of the series at lag 𝜐.

  • 16. Wiener, Norbert. "Generalized harmonic analysis." Acta mathematica 55.1 (1930): 117-258.
  • 17. Cohen, Leon. "The generalization of the wiener-khinchin theorem." Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998

IEEE International Conference on. Vol. 3. IEEE, 1998.

slide-23
SLIDE 23
  • 4. PERIODICITY. FISHER TEST

A similar test based on the notion of periodograms was proposed by Fisher [18] 𝐽 𝑔 =

2 𝑂 σ𝑢=0 𝑂−1 𝑌𝑢 cos 2𝜌𝑔𝑢 + 2 𝑂 σ𝑢=0 𝑂−1 𝑌𝑢 sin 2𝜌𝑔𝑢

𝑋 = max

𝑔

𝐽(𝑔) where frequency −

1 2 ≤ 𝑔 ≤ 1 2 . The test assumes that 𝑌 = ς + 𝑏 with ς being the real

periodic signal and 𝑏~𝑂(0, 𝜏2) is the unobserved noise. 𝑋 is measured to reject the 𝐼0: ς = 0. Disadvantage: similar to original Fourier transform – spurious artifacts. Can be improved with window functions.

  • 18. Fisher, Ronald Aylmer. "Tests of significance in harmonic analysis." Proc. R. Soc. Lond. A 125.796 (1929): 54-59.
slide-24
SLIDE 24
  • 4. PERIODICITY TRANSFORM

Sethares and Staley [19] proposed linear-in-period transformation of time series. They defined 𝑌 to be 𝑞-periodic if 𝑌𝑢 = 𝑌𝑢+𝑞 and 𝑄

𝑞 to be a set of all 𝑞-periodic sequences.

They introduced non-orthogonal basis sequences that are 𝑞-periodic 𝜀𝑞

𝑡 𝑘 = ቊ1 if 𝑘 − 𝑡 mod 𝑞 = 0

0 otherwise where 𝑘 is the time index of the series 𝑌, and 𝑡 is the time shift. The measure of periodicity is the projection 𝜌 𝑌, 𝑄

𝑞 = σ𝑡=0 𝑞−1 𝛽𝑡𝜀𝑡 𝑞 , 𝛽𝑡 = 1 𝑂 σ𝑢=0 𝑂−1 𝑌𝑡+𝑢𝑞

Disadvantage: sample interval of 𝑌 or its entire length must correspond to an integer factor of 𝑞.

  • 19. Sethares, William A., and Thomas W. Staley. "Periodicity transforms." IEEE transactions on Signal Processing 47.11 (1999): 2953-2964.
slide-25
SLIDE 25
  • 5. SIMILARITY AND DEPENDENCE. LINEAR CORRELATION

Linear correlation is a standard and well-known dependence measure. For two time series 𝑌 and 𝑍 it is defined as 𝜍𝑌,𝑍 = 𝜍 𝑌, 𝑍 = 𝔽 𝑌 − 𝔽 𝑌 𝑍 − 𝔽 𝑍 𝜏𝑌𝜏𝑍 Often it is more convenient to standardize the series 𝑌′ =

𝑌−𝔽 𝑌 𝜏𝑌

and 𝑍′ =

𝑍−𝔽 𝑍 𝜏𝑍

𝜍 𝑌, 𝑍 = 𝔽[𝑌′𝑍′] Ranging as −1 ≤ 𝜍 𝑌, 𝑍 ≤ 1 it captures linear dependence between 𝑌 and 𝑍. Limitation: does not capture non-linear relationship between 𝑌 and 𝑍.

slide-26
SLIDE 26
  • 5. SIMILARITY AND DEPENDENCE. RENYI POSTULATES

Renyi considered a general measure of dependence 𝜀 𝑌, 𝑍 and postulated:

  • 𝜀 𝑌, 𝑍

is defined for any pair of random variables 𝑌, 𝑍 neither of them being constant with probability 1

  • It is symmetric, 𝜀 𝑌, 𝑍 = 𝜀 𝑍, 𝑌
  • 𝜀 𝑌, 𝑍 = 0 if and only if 𝑌 and 𝑍 are independent
  • 0 ≤ 𝜀 𝑌, 𝑍 ≤ 1
  • 𝜀 𝑌, 𝑍 = 1 only if there is a strict dependence between 𝑌 and 𝑍
  • 𝜀 𝑌, 𝑍 = 𝜀 Id(𝑌), Id 𝑍

where Id is a Borel-measurable identity function

  • If the joint distribution of 𝑌 and 𝑍 is normal then 𝜀 𝑌, 𝑍 = 𝜍(𝑌, 𝑍) .
  • 20. Rényi, Alfréd. "On measures of dependence." Acta mathematica hungarica 10.3-4 (1959): 441-451.
slide-27
SLIDE 27
  • 5. SIMILARITY AND DEPENDENCE.

MAXIMAL CORRELATION COEFFICIENT

The maximal correlation coefficient proposed by Gebelein [21] fits all of the Renyi’s postulates and is defined as 𝜍𝑛𝑏𝑦 𝑌, 𝑍 = max

𝑔,𝑕 𝜍 𝑔 𝑌 , 𝑕 𝑍

where 𝑔: ℜ → ℜ, 𝑕: ℜ → ℜ are Borel-measurable functions, 0 ≤ 𝑊𝑏𝑠 𝑔 𝑌 < 𝐶, and 0 ≤ 𝑊𝑏𝑠 𝑕 𝑍 < 𝐶, 𝐶 ∈ ℜ. 𝜍𝑛𝑏𝑦 𝑌, 𝑍 = 0 if 𝑌 and 𝑍 are independent. 𝜍𝑛𝑏𝑦 𝑌, 𝑍 = 1 if either 𝑌 = 𝑔(𝑍) or 𝑍 = 𝑕(𝑌). Limitation: The functions 𝑔∗, 𝑕∗ that maximize 𝜍 𝑔 𝑌 , 𝑕 𝑍 are not guaranteed to have inverses 𝑔∗ −1, 𝑕∗ −1.

  • 21. Gebelein, Hans. "Das statistische Problem der Korrelation als Variations‐und Eigenwertproblem und sein Zusammenhang mit der

Ausgleichsrechnung." ZAMM‐Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik 21.6 (1941): 364-379.

slide-28
SLIDE 28
  • 5. MAXIMAL VS LINEAR AUTOCORRELATION

CitiBike pickup data has a daily periodicity depicted with linear autocorrelation

  • function. However, maximal autocorrelation function demonstrates the presence of

non-linear dependencies due to 𝜍𝑛𝑏𝑦 𝑌1

𝑂−𝜐, 𝑌𝜐 𝑂 > 𝜍 𝑌1 𝑂−𝜐, 𝑌𝜐 𝑂 for every lag 𝜐.

slide-29
SLIDE 29
  • 5. SOME PROPERTIES OF MAXIMAL CORRELATION

Dembo et al. [22] demonstrated that the maximal correlation coefficient equals to 𝜍𝑛𝑏𝑦 𝑇𝑛, 𝑇𝑜 = 𝑛/𝑜 where 𝑇𝑙 = σ𝑗=1

𝑙

𝑍

𝑙, 𝑍 1, … , 𝑍 𝑂 is a collection of independent identically distributed

random variables with 𝑊𝑏𝑠 𝑍

𝑗 < ∞.

Lancaster [23] has shown that if 𝑌, 𝑍 is a bivariate Gaussian vector the maximal correlation coefficient is equal to 𝜍𝑛𝑏𝑦 𝑌, 𝑍 = |𝜍(𝑌, 𝑍)|

  • 22. Dembo, Amir, Abram Kagan, and Lawrence A. Shepp. "Remarks on the maximum correlation coefficient." Bernoulli 7.2 (2001): 343-350.
  • 23. Lancaster, Henry Oliver. "Some properties of the bivariate normal distribution considered in the form of a contingency table." Biometrika

44.1/2 (1957): 289-292.

slide-30
SLIDE 30
  • 5. SOME PROPERTIES OF MAXIMAL CORRELATION

Witsenhausen [24] proposed a way to calculate the maximal correlation coefficient for discrete 𝑌 and 𝑍:

  • First, the ordered sets 𝛽 and 𝛾 that contain unique values of 𝑌 and 𝑍 are built.
  • Second, the probabilities 𝑞𝑗𝑘 from the contingency table for 𝑌 and 𝑍 are calculated,

𝑞𝑗𝑘 = 𝑄 𝑌 = 𝛽𝑗|𝑍 = 𝛾𝑘 .

  • Third, the normalized joint-probability matrix 𝑅 = (𝑟𝑗𝑘) is computed

𝑟𝑗𝑘 =

𝑞𝑗𝑘 𝑞(𝑗.) 𝑞(.𝑘).

Then the maximal correlation coefficient is equal to 𝜍𝑛𝑏𝑦 𝑌, 𝑍 = 𝜇2 where 𝜇2 is the second singular value of 𝑅.

  • 24. Witsenhausen, Hans S. "On sequences of pairs of dependent random variables." SIAM Journal on Applied Mathematics 28.1 (1975): 100-113.
slide-31
SLIDE 31
  • 5. MONOTONE CORRELATION

Monotone correlation coefficient was proposed by Kimeldorf et al. [25] and was defined as follows: let ℱ = 𝑔: ℜ → ℜ|𝑔 is monotone then the monotone correlation 𝜍𝑛𝑝𝑜𝑝 𝑌, 𝑍 = max

𝑔,𝑕 ∈ ℱ 𝑔 𝑌 , 𝑕 𝑍

Overall, we have the following relationship between linear, monotone and maximal correlation coefficients 0 ≤ 𝜍 𝑌, 𝑍 ≤ 𝜍𝑛𝑝𝑜𝑝 𝑌, 𝑍 ≤ 𝜍𝑛𝑏𝑦 𝑌, 𝑍 ≤ 1 Limitation: there is no known formula that computes the value of the monotone correlation coefficient. It is a maximization problem.

  • 25. Kimeldorf, George, and Allan R. Sampson. "Monotone dependence." The Annals of Statistics (1978): 895-903.
slide-32
SLIDE 32
  • 6. CLUSTERING

We will denote 𝑦 as a collection of time series 𝑦 = 𝑌1, … , 𝑌𝑂 , 𝑌𝑗 = 𝑌𝑗,1, 𝑌𝑗,2, … , 𝑌𝑗,𝑁 , 𝑌𝑗,𝑘 ∈ ℜ. We define clustering as an unsupervised partitioning of 𝑌 into 𝑙 groups assigning 1 label from the set of labels {𝐷1, … , 𝐷𝑙} to every time series 𝑌𝑗 such that each label 𝐷𝑗 is assigned at least once.

slide-33
SLIDE 33
  • 6. CLUSTERING. NAÏVE APPROACH

Clustering discrete time series 𝑌1, 𝑌2, 𝑌3 into two groups naïvely applying k-means results in clusters {𝑌2, 𝑌3} and {𝑌1}. However, it seems to be more reasonable to group X1 and 𝑌2 together. Using autocorrelation functions 𝑆 = {𝑠

𝑦𝑦 𝑌1 , 𝑠 𝑦𝑦 𝑌2 , 𝑠 𝑦𝑦(𝑌3)} instead

  • f the actual value will group X1 and 𝑌2 in the same cluster.

𝑠

𝑦𝑦 𝑌 = 𝜍 𝑌, 𝑀𝜐𝑌 |2 ≤ 𝜐 ≤ 𝑂 − 1 , 𝑀𝑙𝑌𝑢 = 𝑌𝑢−𝜐

𝑌𝑗,𝑢

slide-34
SLIDE 34
  • 6. TYPES OF CLUSTERING APPROACHES
  • Raw data. Standard clustering techniques applied to raw data with modified

distance or dissimilarity measures.

  • Feature generation. New features are generated from raw data and clustered

with standard techniques.

  • Model assumption. Clustering based on model parameters, hypothesis testing.
slide-35
SLIDE 35
  • 6. RAW DATA CLUSTERING

Komelj and Batagelj [26] modified relocation clustering introducing new similarity measure 𝐸 𝑌, 𝑍 = σ𝑢 𝛽𝑢𝑒𝑢 𝑌, 𝑍 , 𝛽𝑡 ≠ 𝛽𝑢, 𝛽𝑢 ≥ 0, σ𝑢 𝛽𝑢 = 1 Golay et al. [27] proposed cross-correlation distance 𝑒𝑑𝑑

1 𝑌, 𝑍 = 1−𝜍 𝑌,𝑍 1+𝜍 𝑌,𝑍 𝛾

, 𝑒𝑑𝑑

2 𝑌, 𝑍 = 2 1 − 𝜍 𝑌, 𝑍

Limitations: most of the approaches do not take into account sequences of values 𝑌𝑢−𝑙, 𝑌𝑢−𝑙+1, … , 𝑌𝑢 but instead compare neighboring values.

  • 26. Košmelj, Katarina, and Vladimir Batagelj. "Cross-sectional approach for clustering time varying data." Journal of Classification 7.1 (1990):

99-109.

  • 27. Golay, Xavier, et al. "A new correlation‐based fuzzy logic clustering algorithm for FMRI." Magnetic Resonance in Medicine 40.2 (1998): 249-

260.

slide-36
SLIDE 36
  • 6. RAW DATA CLUSTERING

Moller-Levet et al. [28] described short time series distance 𝑒𝑇𝑈𝑇

2

𝑌, 𝑍 = σ𝑙

𝜀𝑌𝑙 𝜀𝑢𝑙 − 𝜀𝑍𝑙 𝜀𝑢𝑙 2

, 𝜀𝑌𝑙 = 𝑌𝑙 − 𝑌𝑙−1 Batista et. al [29] showed existing distance measures are not efficient for complex time series, and proposed a complexity-invariant distance measure (CID) 𝐷𝐽𝐸 𝑌, 𝑍 = 𝑌 − 𝑍 max 𝐷𝐹 𝑌 , 𝐷𝐹 𝑍 min 𝐷𝐹 𝑌 , 𝐷𝐹 𝑍 where 𝐷𝐹 𝑌 = 𝑌2

𝑂 − 𝑌1 𝑂−1

estimates distance between the time series and its lagged counterpart. Limitations: do not take subsequences into account but compare series pairwise.

  • 28. Möller-Levet, Carla S., et al. "Fuzzy clustering of short time-series and unevenly distributed sampling points." International Symposium on

Intelligent Data Analysis. Springer, Berlin, Heidelberg, 2003.

  • 29. Batista, Gustavo EAPA, Xiaoyue Wang, and Eamonn J. Keogh. "A complexity-invariant distance measure for time series." Proceedings of the

2011 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2011.

slide-37
SLIDE 37
  • 6. FEATURE CLUSTERING

The approaches include clustering of autocorrelation functions, Fourier and wavelet transformation, dimensionality reduction, and other transformations of raw time series. Vlachos et al. [30] proposed running the k-means clustering algorithm on the level 𝑗 of Haar wavelet representation of the data, projecting the final centers obtained for level 𝑗 from space 2𝑗 to space 2𝑗+1 for level 𝑗 + 1. If any time series were swapped between clusters then repeat previous steps. Fu et al. [31] proposed time series smoothing and dimensionality reduction prior to clustering with self-organizing maps. Only the Perpetually Important Points that match predefined Querying points are kept. 𝐸 𝑄𝐽𝑄, 𝑅 = 𝔽 𝑈 𝑄𝐽𝑄 − 𝑈 𝑅

2 .

Limitations: the approaches are very specific and require data engineering.

  • 30. Vlachos, Michail, et al. "A wavelet-based anytime algorithm for k-means clustering of time series." In Proc. Workshop on Clustering High

Dimensionality Data and Its Applications. 2003.

  • 31. Fu, Tak-chung, et al. "Pattern discovery from stock time series using self-organizing maps." Workshop Notes of KDD2001 Workshop on

Temporal Data Mining. 2001.

slide-38
SLIDE 38
  • 6. MODEL-BASED CLUSTERING

The main assumption for these approaches is that there exists an underlying generating process that can be modeled with certain models (e.g. ARMA and its modifications). Maharaj [32] proposed a clustering approach based on the 𝜓2 test statistic. They assumed that time series 𝑌 and 𝑍 are generated by an autoregressive process of order 𝑙, AR(𝑙) with parameters 𝜄𝑌 = {𝜄1

𝑌, … , 𝜄𝑙 𝑌} and 𝜄𝑍 = {𝜄1 𝑍, … , 𝜄𝑙 𝑍}. Setting the null

hypothesis 𝐼0: 𝜄𝑌 = 𝜄𝑍 they cluster 𝑌 and 𝑍 together if the 𝑞-value is greater than the predefined threshold. Disadvantage: The main weakness of the approach is due to the simplicity and linearity of the AR 𝑙

  • model. In general it might be hard to select the appropriate

model.

  • 32. Maharaj, Elizabeth Ann. "Cluster of time series." Journal of Classification 17.2 (2000): 297-314.
slide-39
SLIDE 39
  • 7. FORECASTING. LEMPEL-ZIV AND MARKOV CHAINS

LZW algorithm partitions time series 𝑌 into a collection

  • f

subsequences 𝑇0, 𝑇1, … , 𝑇𝑁. 𝑇𝑙 starts at time 𝑙 and represents the shortest previously unobserved

  • subsequence. The prediction is made as

𝑄 𝑌𝑢+1 = 𝛾 𝑌1

𝑢 = 𝐷 𝑇𝑙𝛾|𝑌1

𝑢

𝐷 𝑇𝑙 𝑌1 𝑢

where 𝐷(𝑇𝑙𝛾|𝑌1

𝑢) stands for the number of times 𝑇𝑙 is followed with 𝛾.

Markov chain predictor assumes 𝑄 𝑌𝑢+1 = 𝛾|𝑌1

𝑢 = 𝑄 𝑌𝑢+𝑙+1 = 𝛾|𝑌𝑙+1 𝑢+𝑙

Denoting repeating subsequences 𝑌𝑢−𝑙+1

𝑢

= 𝑌𝑢+1

𝑢+𝑙 = 𝑇1, … , 𝑇𝑙 = 𝑇 we predict

𝑄 𝑌𝑢+1 = 𝛾|𝑌1

𝑢 = 𝐷(𝑇𝛾|𝑌1

𝑢)

𝐷 𝑇 𝑌1 𝑢

Limitations: LZW and Markov chain-based predictors treat numeric values as symbols and cannot capture actual magnitudes of the values 𝑌𝑢 of time series.

slide-40
SLIDE 40
  • 7. FORECASTING. MACHINE LEARNING. SVM AND NN

We partition time series 𝑌 into 𝑙 overlapping subseries of size 𝑞 creating feature vectors 𝑦 ∈ ℜ 𝑞−1 ×(𝑂−𝑞+1) and label vectors 𝑧 ∈ ℜ𝑂−𝑞+1 𝑦 = 𝑌1

𝑞−1, 𝑌2 𝑞, … , 𝑌𝑂−𝑞+1 𝑂−1

, 𝑧 = 𝑌𝑞

𝑂

Standard models are fit on 𝑦 and 𝑧. Support Vector Machines [33] (SVM) minimize

1 𝑂−𝑞+1 σ𝑗=1 𝑂−𝑞+1 max 0, 1 − 𝑧𝑗 𝑥𝑈𝑦𝑗 − 𝑐

+ 𝜇 𝑥

2

Basic neural networks (NN) fit the model [34] 𝑔 𝑦𝑗 = 𝑥2

𝑈𝜚 𝑥1 𝑈𝑦𝑗 + 𝑐1 + 𝑐2

Limitations: machine learning approach does not take into account order within each vector 𝑦𝑗.

  • 33. Wu, Qiang, and Ding-Xuan Zhou. "SVM soft margin classifiers: linear programming versus quadratic programming." Neural computation 17.5

(2005): 1160-1187.

slide-41
SLIDE 41
  • 7. FORECASTING. ARIMA MODELS

ARIMA models [34] seem to be the most natural predictors for time series. They take into account autoregression and integrated moving average. ARIMA is defined as 1 − σ𝑗=1

𝑞

𝜚𝑗𝑀𝑗 1 − 𝑀 𝑒𝑌𝑢 = 𝜀 + 1 + σ𝑗=1

𝑟

𝜄𝑗𝑀𝑗 𝜗𝑢 where 𝜄, 𝜚 are the moving average and autoregression parameters, 𝑞 and 𝑟 define the lag of the autoregression and moving averages, 𝑒 defines integration, 𝜗 stands for normally distributed error, 𝑀𝑙𝑌𝑢 = 𝑌𝑢−𝑙, 𝜀 regulates the drift of the model. The parameters are often selected with Akaike Information Criterion (AIC) [35] AIC 𝑞, 𝑒, 𝑟 = −2 log ℒ + 2(𝑞 + 𝑒 + 𝑙 + 1) where ℒ is the maximum likelihood estimator, 𝑙 = 1 if there is a constant term. Limitations: does not capture non-linear relations within time series.

  • 34. Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
  • 35. Bozdogan, Hamparsum. "Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions."

Psychometrika 52.3 (1987): 345-370.

slide-42
SLIDE 42
  • 7. FORECASTING. RNN

Recurrent Neural Network (RNN) [36] models are among the first to capture temporal dependencies within the subsequences of time series. A fully interconnected model 𝑌𝑢 = ෍

𝑗=1 𝐽

𝑋

𝑗𝑕𝑗 𝑢

Limitations: the models are too complex and impractical to store long subsequences. 𝑕𝑗 𝑢 = 𝑔 σ𝑘=1

MAX(𝑞,𝑟) ෥

𝑥𝑗𝑘𝑌𝑢−𝑘 + σ𝑙=1

𝑟

σ𝑚=1

𝐽

෥ 𝑥𝑗𝑚𝑙𝑕𝑚 𝑢 − 𝑙 + 𝜄𝑗

  • 36. Connor, Jerome T., R. Douglas Martin, and Les E. Atlas. "Recurrent neural networks and robust time series

prediction." IEEE transactions on neural networks 5.2 (1994): 240-254.

slide-43
SLIDE 43
  • 7. FORECASTING. LSTM

Long Short-Term Memory (LSTM) [37] NN introduced neurons with internal

  • structure. LSTM contain 3 gates: input, output, and forget. 2 input gates

𝑗1 = tanh 𝑐𝑗1 + 𝑌𝑢𝑋

1 𝑗1 + ℎ𝑢−1𝑋 2 𝑗1 , 𝑗2 = 𝜏 𝑐𝑗2 + 𝑌𝑢𝑋 1 𝑗2 + ℎ𝑢−1𝑋 2 𝑗2

1 forget gate 𝑔 = 𝜏 𝑐𝑔 + 𝑌𝑢𝑋

1 𝑔 + ℎ𝑢−1𝑋 2 𝑔

and 1 output gate 𝑝 = 𝜏 𝑐𝑝 + 𝑌𝑢𝑋

1 𝑝 + ℎ𝑢−1𝑋 2 𝑝

Overall, the model is a combination ℎ𝑢 = tanh 𝑔 + 𝑗1⨂𝑗2 ⨂𝑝

  • 37. Connor, Jerome T., R. Douglas Martin, and Les E. Atlas. "Recurrent neural networks and robust time series

prediction." IEEE transactions on neural networks 5.2 (1994): 240-254.

slide-44
SLIDE 44

SUMMARY

1. Combinatorial estimators of complexity and entropy, as well as predictors including Lempel-Ziv approximator, permutation entropy, Markov chains and Lempel-Ziv predictors treat time series like a sequence of symbols and do not take actual values into account. 2. Methods quantifying predictability rely on entropy or assumed models. 3. Dependency measures efficient at capturing non-linear patterns use transformations that are not guaranteed to have inverses. 4. Majority of the distance measures proposed for time series clustering compare values pairwise and do not take subsequencies into account. 5. Machine learning approach to clustering and time series forecasting does not take

  • rder within features into account.
slide-45
SLIDE 45

FUTURE DIRECTIONS

1. Creating new non-linear dependency measures with transformations that are guaranteed to have inverses. (Developing approximations for maximum and monotone correlation). 2. Generalizing combinatorial methods for complexity and entropy estimation so they take magnitudes of time series into account. 3. Combining dependency measures with neural networks. Is there a link and can

  • ne improve another?
slide-46
SLIDE 46

THANK YOU! QUESTIONS?