IRDM ‘15/16
Jilles Vreeken
Chapter 7-1: Se Sequential Data Data
24 Nov 2015 Revision 1, November 26th
Definition of smoothing clarified
Chapter 7-1: Se Sequential Data Data Jilles Vreeken Revision 1, - - PowerPoint PPT Presentation
Chapter 7-1: Se Sequential Data Data Jilles Vreeken Revision 1, November 26 th Definition of smoothing clarified IRDM 15/16 24 Nov 2015 IRDM Chapter 7, overview Time Series Basic Ideas 1. Prediction 2. Motif Discovery 3.
IRDM ‘15/16
24 Nov 2015 Revision 1, November 26th
Definition of smoothing clarified
IRDM ‘15/16
Time Series
1.
Basic Ideas
2.
Prediction
3.
Motif Discovery
Discrete Sequences
4.
Basic Ideas
5.
Pattern Discovery
6.
Hidden Markov Models
You’ll find this covered in Aggarwal Ch. 3.4, 14, 15
VII-1: 2
IRDM ‘15/16
Time Series
1.
Basic Ideas
2.
Prediction
3.
Motif Discovery
Discrete Sequences
4.
Basic Ideas
5.
Pattern Discovery
6.
Hidden Markov Models
You’ll find this covered in Aggarwal Ch. 3.4, 14, 15
VII-1: 3
IRDM ‘15/16
Aggarwal Ch. 14.1-14.2
VII-1: 4
IRDM ‘15/16
VII-1: 5
Temp (°C) 28.2 25.4 30.5 15.7 33.4 29.4 28.6 16.1 28.5 27.9 15.5 31.4
IRDM ‘15/16
VII-1: 6
Time Temp (°C) June-15 28.2 June-16 25.4 June-17 30.5 June-18 15.7 June-19 33.4 June-20 29.4 June-22 28.6 June-23 16.1 June-24 28.5 June-25 27.9 June-26 15.5 June-27 31.4
5 10 15 20 25 30 35 40
Daily Temperature
IRDM ‘15/16
VII-1: 7
Time Temp (°C) June-15 28.2 June-16 25.4 June-17 30.5 June-18 15.7 June-19 33.4 June-20 29.4 June-22 28.6 June-23 16.1 June-24 28.5 June-25 27.9 June-26 15.5 June-27 31.4
5 10 15 20 25 30 35 40
Daily Temperature
IRDM ‘15/16
VII-1: 8
Time Temp (°C) June-15 28.2 June-16 25.4 June-17 30.5 Sept-18 15.7 June-19 33.4 June-20 29.4 June-22 28.6 Sept-23 16.1 Sept-24 28.5 June-25 27.9 Sept-26 15.5 June-27 31.4
5 10 15 20 25 30 35 40
Daily Temperature
IRDM ‘15/16
VII-1: 9
Stock a analy lysis is Weathe her Forecasting ing Healt lth Monit itorin ing Socia ial Network Analysis is
IRDM ‘15/16
A time s e seri eries of len engt gth 𝑜 consists of 𝑜 tuples 𝑢1, 𝑌1 , 𝑢2, 𝑌2 , … (𝑢𝑜, 𝑌𝑜) where for a tuple (𝑢𝑗, 𝑌𝑗), 𝑢𝑗 is the ti time s stam tamp, and 𝑌𝑗 is the data ata at time 𝑢𝑗, and we have a total
Length
may either be finite or infinite
Time stamps
may be contiguous, in practice integers are easier
Data
when talking about time series, usually numeric, continuous real eal-val alued ed
may be univariate (one attribute) or multivariate (multiple attributes)
VII-1: 10
IRDM ‘15/16
Consider data 𝑌𝑗 at time 𝑢𝑗 as a random variable
the actual data we observe at 𝑢𝑗 is a realiza zati tion of 𝑌𝑗
Some probabilistic properties can be stable le over time
e.g. the mean 𝜈𝑗 of 𝑌𝑗 does not change (much)
the covariance between pairs (𝑌𝑗, 𝑌𝑗+ℎ) is (almost) the same as (𝑌1, 𝑌1+ℎ), i.e., the autoc
arian ance of 𝑌𝑗 does not change (much)
A time series is stationa nary if the process behind it doe
s not
𝜈𝑢 = 𝜈𝑡 = 𝜈 for all 𝑢, 𝑡, and 𝐷𝑌𝑌 𝑢, 𝑡 = 𝐷𝑌𝑌 𝑡 − 𝑢 = 𝐷𝑌𝑌(𝜐) where 𝜐 = |𝑡 − 𝑢| is the amount of time
by which the signal is shifted
Stationary time series are easy to model and predict
most real-world time series, however, are anything but stationary
(recall, if 𝑌𝑗 has mean 𝜈𝑗 = 𝐹[𝑌𝑗], 𝐷𝑌𝑌 𝑢, 𝑡 = 𝑑𝑑𝑑 𝑌𝑢, 𝑌𝑡 = 𝐹 𝑌𝑢𝑌𝑡 − 𝜈𝑢𝜈𝑡) VII-1: 11
IRDM ‘15/16
VII-1: 12
10 20 30 40
Daily Temperature
10 20 30 40
Monthly Temperature
IRDM ‘15/16
VII-1: 13
5 10 15 20 25 30 35 40
Monthly Temperature
2011 2012 2013
IRDM ‘15/16
Classically, we assume a time series 𝑌 is composed of 𝑌𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 + 𝑢𝑢𝑡𝑜𝑒𝑗 + 𝑜𝑑𝑡𝑡𝑡𝑗 where 𝑜𝑑𝑡𝑡𝑡𝑗 is stationary. To make 𝑌 stationary, we simply have to remove seasonality and trend.
VII-1: 14
IRDM ‘15/16
Seasonality is essentially perio iodici icity
seasonality is a perio
iodic ic functio ion n of time with period 𝑒 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗−𝑒
How to find the seasonal ality f ty functi tion?
1.
by fitting a sine e or cosi
difficult – the signal may also be sine’ish
2.
by di diffe fferen encing
𝑌𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 + 𝑢𝑢𝑡𝑜𝑒𝑗 + 𝑜𝑑𝑡𝑡𝑡𝑗 𝑌𝑗−𝑒 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗−𝑒 + 𝑢𝑢𝑡𝑜𝑒𝑗−𝑒 + 𝑜𝑑𝑡𝑡𝑡𝑗−𝑒
VII-1: 15
IRDM ‘15/16
Seasonality is essentially perio iodici icity
seasonality is a perio
iodic ic functio ion n of time with period 𝑒 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗−𝑒
How to find the seasonal ality f ty functi tion?
1.
by fitting a sine e or cosi
difficult – the signal may also be sine’ish
2.
by di diffe fferen encing
𝑌𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 + 𝑢𝑢𝑡𝑜𝑒𝑗 + 𝑜𝑑𝑡𝑡𝑡𝑗 𝑌𝑗−𝑒 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗−𝑒 + 𝑢𝑢𝑡𝑜𝑒𝑗−𝑒 + 𝑜𝑑𝑡𝑡𝑡𝑗−𝑒 𝑌𝑗
′ = 𝑌𝑗 − 𝑌𝑗−𝑒
VII-1: 16
IRDM ‘15/16 VII-1: 17
5 10 15 20 25 30 35 40
Monthly Temperature
2011 2012 2013
′ = 𝑌𝑗 − 𝑌𝑗−𝑒 where d = 12
IRDM ‘15/16
VII-1: 18
5 10 15 20 25 30 35 40
Monthly Temperature
This is the time series we obtained by removing seasonality
IRDM ‘15/16
Trend is a pol
nomial f func unction
How to find the trend function?
1.
by fit itting ing functio ions ns
difficult to do, up to what order, when to stop?
2.
by di diffe fferen encing 𝑌𝑗
′ = 𝑌𝑗 − 𝑌𝑗−1
𝑌𝑗
′′ = 𝑌𝑗 ′ − 𝑌𝑗−1 ′
usually 2 times is enough
VII-1: 19
IRDM ‘15/16 VII-1: 20
5 10 15 20 25 30 35 40
Monthly Temperature
This is the time series we obtained by removing seasonality
IRDM ‘15/16
VII-1: 21
5 10 15 20 25 30 35 40
Monthly Temperature
This is the time series we obtained by removing seasonality and trend
′ = 𝑌𝑗 − 𝑌𝑗−1
IRDM ‘15/16
VII-1: 22
5 10 15 20 25 30 35 40
Monthly Temperature
The left-over fluctuations are either noise or non-trivial patterns
′ = 𝑌𝑗 − 𝑌𝑗−1
IRDM ‘15/16
We can infer missing values by interpolation 𝑌𝑙 = 𝑌𝑗 + 𝑢𝑙 − 𝑢𝑗 𝑢𝑘 − 𝑢𝑗 × (𝑌
𝑘 − 𝑌𝑗)
where 𝑢𝑗 < 𝑢𝑙 < 𝑢𝑘
VII-1: 23
IRDM ‘15/16
We can infer missing values by interpolation 𝑌𝑙 = 𝑌𝑗 + 𝑢𝑙 − 𝑢𝑗 𝑢𝑘 − 𝑢𝑗 × (𝑌
𝑘 − 𝑌𝑗)
where 𝑢𝑗 < 𝑢𝑙 < 𝑢𝑘
VII-1: 24
Time Temp (°C) 1 June-19 33.4 2 June-20 29.4 4 June-22 5 June-23 16.1
Temperature on June-22: 𝑌4 = 𝑌2 + 𝑢4 − 𝑢2 𝑢5 − 𝑢2 × 𝑌5 − 𝑌2 = 29.4 +
4−2 5−2 × 16.1 − 29.4
= 20.5
IRDM ‘15/16
We can remove noise by smoot
ing Standard options include avera veraging ng 𝑌𝑗
′ = 𝑡𝑑𝑏(𝑌𝑗−𝑥, … , 𝑌𝑗)
where win window
length 𝑥 is a user-specified parameter We can more weight to recent values by exponent nential s smoothi hing 𝑌𝑗
′ = 1 − 𝛽 𝑗 ⋅ 𝑌0 ′ + 𝛽 𝑌 𝑘 ⋅ 1 − 𝛽 𝑗−𝑘 𝑗 𝑘=1
where the user chooses decay factor 𝛽
(updated on Nov 26th : we now average explicitly over past values) VII-1: 25
IRDM ‘15/16
Aggarwal Ch. 14.3
VII-1: 26
IRDM ‘15/16
If we wish to make predictions, then clearly we must assu assume that something is stab stable over time.
VII-1: 27
IRDM ‘15/16
Future values depend on past ast va values + random noise
assumption: the time series depends on autocorrelation
Which past values?
the 𝑥 immedi
diatel ely previous values
What relation between past and future?
linear combination
What kind of noise?
Gaussian
VII-1: 28
IRDM ‘15/16
Future value is a linear combination of past ast va values + white noise 𝑌𝑢 = 𝑡𝑗 ⋅ 𝑌𝑢−𝑗
𝑥 𝑗=1
+ 𝑑 + 𝜗𝑢
where 𝜗𝑢~𝒪(0, 𝜏2)
VII-1: 29
Linear combination of past v valu lues noi noise with shifted mean
IRDM ‘15/16
𝜗𝑢 = 𝑌𝑢 − (𝑡1 ⋅ 𝑌𝑢−1 + 𝑡2 ⋅ 𝑌𝑢−2 + ⋯ + 𝑡𝑥 ⋅ 𝑌𝑢−𝑥 + 𝑑) Given data 𝑬 of 𝑂 training instances, we want to find 𝑡1, … , 𝑡𝑥 and 𝑑 that minimise the me mean squ squared err error 1 𝑂 − 𝑥 𝜗𝑢
2 𝑂 𝑢=𝑥+1
VII-1: 30
pr predic icted v valu lue act ctual v l valu lue
the prediction error is simply the Gaussian noise in the AR model, the smaller we can get this value, the better!
IRDM ‘15/16
Find 𝑡1, … , 𝑡𝑥 and 𝑑 that min inim imiz ize
1 𝑂−𝑥 ∑
𝜗𝑢
2 𝑂 𝑢=𝑥+1
There are different solving strategies available
ordinary least squares, assumes 𝜗𝑢 and 𝑌𝑢 are uncorrelated generalized least squares, assumes correlation exists but is known iteratively reweighted least squares, assumes correlation is unknown
Many standard tools available to do AR
MATLAB: ar
ar function
R: ar
arima function
VII-1: 31
IRDM ‘15/16
Monthly temperature measured above the ground in a province of Vietnam from 1971 to 2001
VII-1: 32
1 2 3 4 Jan-71 Oct-71 Jul-72 Apr-73 Jan-74 Oct-74 Jul-75 Apr-76 Jan-77 Oct-77 Jul-78 Apr-79 Jan-80 Oct-80 Jul-81 Apr-82 Jan-83 Oct-83 Jul-84 Apr-85 Jan-86 Oct-86 Jul-87 Apr-88 Jan-89 Oct-89 Jul-90 Apr-91 Jan-92 Oct-92 Jul-93 Apr-94 Jan-95 Oct-95 Jul-96 Apr-97 Jan-98 Oct-98 Jul-99 Apr-00 Jan-01 Oct-01
Original Data
IRDM ‘15/16
VII-1: 33
1 2 3 4 Jan-71 Oct-71 Jul-72 Apr-73 Jan-74 Oct-74 Jul-75 Apr-76 Jan-77 Oct-77 Jul-78 Apr-79 Jan-80 Oct-80 Jul-81 Apr-82 Jan-83 Oct-83 Jul-84 Apr-85 Jan-86 Oct-86 Jul-87 Apr-88 Jan-89 Oct-89 Jul-90 Apr-91 Jan-92 Oct-92 Jul-93 Apr-94 Jan-95 Oct-95 Jul-96 Apr-97 Jan-98 Oct-98 Jul-99 Apr-00 Jan-01 Oct-01
Original Data
1 2 3 Jan-72 Oct-72 Jul-73 Apr-74 Jan-75 Oct-75 Jul-76 Apr-77 Jan-78 Oct-78 Jul-79 Apr-80 Jan-81 Oct-81 Jul-82 Apr-83 Jan-84 Oct-84 Jul-85 Apr-86 Jan-87 Oct-87 Jul-88 Apr-89 Jan-90 Oct-90 Jul-91 Apr-92 Jan-93 Oct-93 Jul-94 Apr-95 Jan-96 Oct-96 Jul-97 Apr-98 Jan-99 Oct-99 Jul-00 Apr-01
Season Removed
0.5 1 1.5 2 Feb-72 Nov-72 Aug-73 May-74 Feb-75 Nov-75 Aug-76 May-77 Feb-78 Nov-78 Aug-79 May-80 Feb-81 Nov-81 Aug-82 May-83 Feb-84 Nov-84 Aug-85 May-86 Feb-87 Nov-87 Aug-88 May-89 Feb-90 Nov-90 Aug-91 May-92 Feb-93 Nov-93 Aug-94 May-95 Feb-96 Nov-96 Aug-97 May-98 Feb-99 Nov-99 Aug-00 May-01
Differencing 1
Mean Squared Error Mean Squared Error Mean Squared Error
0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Original Data: MSE vs. w Season Removed: MSE vs. w
0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Differencing 1: MSE vs. w
34
These plots show how the MSE behaves wrt to 𝑥. I.e., they to choose 𝑥.
Future values depend on deterministic factor + noi noise
assumption: the time series depends on hist
stor
shoc
What deterministic factor?
the mea
ean of the time series
Noise over what past values?
the current value and 𝑟 immedi
diatel ely previous values
What kind of noise?
Gaussian
35
IRDM ‘15/16
The 𝑁𝑁 𝑟 is defined as
𝑌𝑢 = 𝜈 + 𝜗𝑢 + 𝑐𝑗 ⋅ 𝜗𝑢−𝑗
𝑟 𝑗=1
where 𝜗𝑗~𝒪(0, 𝜏𝑗
2)
Recall, for the 𝑁𝐵(𝑥) model we had
𝑌𝑢 = 𝑑 + 𝜗𝑢 + 𝑡𝑗 ⋅ 𝑌𝑢−𝑗
𝑥 𝑗=1
VII-1: 36
past n nois ise mean cur urrent noi noise
IRDM ‘15/16
Find those 𝑐1, … , 𝑐𝑟 that min inim imiz ize the error Unlike for AR, this problem is not linear
to identify noise terms, we need to know 𝑐1, … , 𝑐𝑟 to identify 𝑐1, … , 𝑐𝑟, we need to know the noise terms typically we use an iterative non-linear fitting approach,
instead of linear least-squares
VII-1: 37
IRDM ‘15/16
ARMA RMA combines the AR AR model with the MA MA model Future values depend on past ast va values + hi histor
noise
the time series depends on
both autocorrela lation and hist stor
shoc
The ARMA model has two parameters, 𝑥 and 𝑟
window length w for autocorrelation history length 𝑟 for noise
What kind of noise?
Gaussian
VII-1: 38
IRDM ‘15/16
ARMA RMA combines the AR AR model with the MA MA model Autoregressive model, 𝑁𝐵(𝑥): 𝑌𝑢 = 𝑑 + 𝜗𝑢 + ∑ 𝑡𝑗 ⋅ 𝑌𝑢−𝑗
𝑥 𝑗=1
Moving Average model, 𝑁𝑁(𝑟) 𝑌𝑢 = 𝜈 + 𝜗𝑢 + ∑ 𝑐𝑗 ⋅ 𝜗𝑢−𝑗
𝑟 𝑗=1
Auto toregressive ve Mov
ng Ave verag age model, 𝑁𝐵𝑁𝑁(𝑥, 𝑟) 𝑌𝑢 = 𝑑 + 𝜗𝑢 + 𝑡𝑗 ⋅ 𝑌𝑢−𝑗
𝑥 𝑗=1
+ 𝑐𝑗 ⋅ 𝜗𝑢−𝑗
𝑟 𝑗=1
VII-1: 39
IRDM ‘15/16
Find those 𝑡𝑗 and 𝑐𝑗 and 𝑑 that min inim imize ize the error We need non non-lin linear least ast-squ squar are re regre ression
many standard tools to do this MATLAB and R implement ARMA as ‘arma’ resp. ‘arima’
How to set 𝑥 and 𝑟?
as small as possible, so that the model still fits the data well aka, good luck
VII-1: 40
IRDM ‘15/16
Aggarwal Ch. 14.4, 3.4
VII-1: 41
IRDM ‘15/16
A mot motif is a shape that frequently repeats in a time series
shape can also be called ‘pattern’
Many variations of mot motif dis isco covery exist
contiguous versus
non-continguous shapes
low versus
high granularities
single time series versus
databases of time series
VII-1: 42
IRDM ‘15/16
When does a motif belong to a time series?
there are two main methods for deciding
1. 1.
distan tance-based suppo support A segment ment 𝑌[𝑡, 𝑘] of a sequence 𝑌 is said to suppo support a motif 𝑍 when the distance 𝑒(𝑌[𝑡, 𝑘], 𝑍) between the segment and the motif is below some threshold 𝜗.
2. 2.
di discrete-ma matching b base sed suppo support first we discretise time series 𝑌 into a discrete sequence 𝑡. A motif is now a (frequent) subsequence of 𝑡.
VII-1: 43
IRDM ‘15/16
A motif, a sequence 𝑇 = 𝑇1, … , 𝑇𝑥 of real values, is said to approxi ximat ately m y match a contiguous subsequence of length 𝑥 in time series 𝑌, if the distance between (𝑇1, … , 𝑇𝑥) and 𝑌𝑗, … , 𝑌𝑗+𝑥−1 is at most 𝜗.
commonly, Euclidean distance or Dynamic Time Warping
The frequency of a motif is its number of occurrences
the number of matches of a motif 𝑇 = 𝑇1, … , 𝑇𝑥 to the time series
𝑌1, … , 𝑌𝑜 at threshold 𝜗 is equal to the number of windows of length 𝑥 in 𝑌 for which the distance is at most 𝜗
VII-1: 44
IRDM ‘15/16
Nobody wants all motifs
lots of many 𝜗-similar matches for even a single true occurrence instead, we aim for the top-𝑙 best motifs
As with frequent itemset mining, redundancy is an issue
we need to keep the top-k diverse distances between any pair of motifs must be at least 2 ⋅ 𝜗
VII-1: 45
IRDM ‘15/16
begin in for 𝑡 = 1 to 𝑜 − 𝑥 + 1 do do b begi gin 𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 = (𝑌𝑗, … , 𝑌𝑗+𝑥−1) for 𝑘 = 1 to 𝑜 − 𝑥 + 1 do do b begi gin 𝐷𝑑𝐷𝐷𝑡𝑢𝑡𝐷𝑑 = (𝑌
𝑘, … , 𝑌 𝑘+𝑥−1)
𝑒 = 𝑒𝑡𝑡𝑢𝑡𝑜𝑑𝑡(𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡, 𝐷𝑑𝐷𝐷𝑡𝑢𝑡𝐷𝑑) if 𝑒 < 𝜗 and and (non-trivial-match) the hen n increment support count of 𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 end ndfor if 𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 has the highest count found so far the hen n update 𝐶𝑡𝑡𝑢𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 end ndfor retu eturn 𝐶𝑡𝑡𝑢𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 end nd
(trivially expanded to top-𝑙) VII-1: 46
IRDM ‘15/16
Finding the best motif takes 𝑃(𝑜2) distance computations Practical complexity largely depends on distance function
Euclidean distance is fast Dynamic Time Warping is often better, but much slower
Lower bounds are our friend
if the lower bound on the distance between a motif and a windows
is greater than 𝜗, the window will never support the motif
piecewise-aggregate approximations (PAA) allow fast computation
VII-1: 47
IRDM ‘15/16
Prediction over time is one of the most important and most used data analysis problems – predic ictiv ive a analy lytics ics There exist two main types of sequential data
continuous real-valued time s
e ser eries and discrete eve vent s t seque uences
for both specialised algorithms exist
In practice, despite many assumptions ARMA RMA is powerful
often used in industry, learn how to use it, learn when to use it
Patterns in time series are called mot motifs
by choosing a distance function can be mined directly from time series
VII-1: 48
IRDM ‘15/16
Prediction over time is one of the most important and most used data analysis problems – predic ictiv ive a analy lytics ics There exist two main types of sequential data
continuous real-valued time s
e ser eries and discrete eve vent s t seque uences
for both specialised algorithms exist
In practice, despite many assumptions ARMA RMA is powerful
often used in industry, learn how to use it, learn when to use it
Patterns in time series are called mot motifs
by choosing a distance function can be mined directly from time series
VII-1: 49