Changepoint detection for time series prediction Allen B. Downey - - PowerPoint PPT Presentation

changepoint detection for time series prediction
SMART_READER_LITE
LIVE PREVIEW

Changepoint detection for time series prediction Allen B. Downey - - PowerPoint PPT Presentation

Changepoint detection for time series prediction Allen B. Downey Olin College of Engineering 1 My background: Predoc at San Diego Supercomputer Center. Dissertation on workload modeling, queue time prediction and malleable job


slide-1
SLIDE 1

Changepoint detection for time series prediction

Allen B. Downey Olin College of Engineering

1

slide-2
SLIDE 2

My background:

Predoc at San Diego Supercomputer Center. Dissertation on workload modeling, queue time

prediction and malleable job allocation for parallel machines.

Recent: Network measurement and modeling. Current: History-based prediction.

2

slide-3
SLIDE 3

Connection?

Resource allocation based on prediction. Prediction based on history. Historical data characterized by changepoints

(nonstationarity).

3

slide-4
SLIDE 4

Three ways to characterize variability:

Noise around a stationary level. Noise around an underlying trend. Abrupt changes in level: changepoints.

Important difference:

Data prior to a changepoint is irrelevant to

performance after.

4

slide-5
SLIDE 5

Example: wide area networks

Some trends (accumulating queue). Many abrupt changepoints.

  • Beginning and end of transfers.
  • Routing changes.
  • Hardware failure, replacement.

5

slide-6
SLIDE 6

Example: parallel batch queues

Some trends (daily cycles). Some abrupt changepoints.

  • Start/completion of wide jobs.
  • Queue policy changes.
  • Hardware failure, replacement.

6

slide-7
SLIDE 7

My claim:

Many systems are characterized by changepoints

where data before a changepoint is irrelevant to performance after.

In these systems, good predictions depend on

changepoint detection, because old data is wrong. Discussion?

7

slide-8
SLIDE 8

Two kinds of prediction:

Single value prediction. Predictive distribution.

  • Summary stats.
  • Intervals.
  • P(error > thresh)
  • E[cost(error)]

8

slide-9
SLIDE 9

If you assume stationarity, life is good:

Accumulate data indefinitely. Predictive distribution = observed distribution.

But this is often not a good assumption.

9

slide-10
SLIDE 10

If the system is nonstationary:

Fixed window? Exponential decay? Too far: obsolete data. Not far enough: loss of useful info.

10

slide-11
SLIDE 11

If you know where the changepoints are:

Use data back to the latest changepoint. Less information immediately after.

11

slide-12
SLIDE 12

If you don’t know, you have to guess. P(i) = prob of a changepoint at time i Example:

150 data points. P(50) = 0.7 P(100) = 0.5

How do you generate a predictive distribution?

12

slide-13
SLIDE 13

Two steps:

Derive P(i+): prob that i is the latest changepoint. Compute weighted mix going back to each i.

Example: P(50) = 0.7 P(100) = 0.5 P(⊘) = 0.15 P(50+) = 0.35 P(100+) = 0.5

13

slide-14
SLIDE 14

Predictive distribution = 0.50 · ed f(100, 150) ⊕ 0.35 · ed f(50, 150) ⊕ 0.15 · ed f(0, 150)

14

slide-15
SLIDE 15

So how do you generate the probabilities P(i+)? Three steps:

Bayes’ theorem. Simple case: you know there is 1 changepoint. General case: unknown # of changepoints.

15

slide-16
SLIDE 16

Bayes’ theorem (diachronic interpretation) P(H|E) = P(E|H) P(E) P(H)

H is a hypothesis, E is a body of evidence. P(H|E): posterior P(H): prior P(E|H) is usually easy to compute. P(E) is often not.

16

slide-17
SLIDE 17

Unless you have a suite of exclusive hypotheses. P(Hi|E) = P(E|Hi)P(Hi) P(E) P(E) =

  • Hj∈S

P(E|Hj)P(Hj) In that case life is good.

17

slide-18
SLIDE 18

If you know there there is exactly one changepoint in

an interval...

...then the P(i) are exclusive hypotheses, and all you need is P(E|i).

Which is pretty much a solved problem.

18

slide-19
SLIDE 19

What if the # of changepoints is unknown?

P(i) are no longer exclusive. But the P(i+) are. And you can write a system of equations for P(i+).

19

slide-20
SLIDE 20

P(i+) = P(i+|⊘) P(⊘) +

  • j<i

P(i+|j++) P(j++)

P(j++) is the prob that the second-to last

changepoint is at i.

P(i+|j++) reduces to the simple problem. P(⊘) is the prob that we have not seen two

changepoints.

P(i+|⊘) reduces to the simple problem (plus).

Great, so what’s P(j++)?

20

slide-21
SLIDE 21

P(j++) =

  • k>j

P(j++|k+) P(k+)

P(j++|k+) is just P(j+) computed at time k. So you can solve for P(+) in terms of P(++). And P(++) in terms of P(+). And at every iteration you have a pretty good

estimate.

Paging Dr. Jacobi!

21

slide-22
SLIDE 22

Implementation:

Need to keep n2/2 previous values. And n2/2 summary statistics. And it takes n2 work to do an update. But, you only have to go back two changepoints, ...so you can keep n small.

22

slide-23
SLIDE 23
  • 4
  • 2

2 4 x[i]

data

50 100 150 time 0.0 0.5 1.0 cumulative probability

P(i+) P(i++)

Synthetic series

with two changepoints.

µ = −0.5, 0.5, 0.0 σ = 1.0 P(⊘) = 0.04

23

slide-24
SLIDE 24

1880 1900 1920 1940 1960 50 100 150 annual flow (10^9 m^3)

data

1880 1900 1920 1940 1960 time 0.0 0.5 1.0 cumulative probability

P33(i+) P66(i+) P99(i+)

The ubiquitous

Nile dataset.

Change in 1898. Estimated probs

can be mercurial.

24

slide-25
SLIDE 25
  • 4
  • 2

2 4

data

50 100 index 0.0 0.5 1.0 cumulative probability

P(i+) P(i++)

Can also detect

change in variance.

µ = 1, 0, 0 σ = 1, 1, 0.5 Estimated P(i+)

is good.

Estimated

P(i++) less certain.

25

slide-26
SLIDE 26

Qualitative behavior seems good. Quantitative tests:

  • Compare to GLR for online alarm problem.
  • Test predictive distribution with synthetic data.
  • Test predictive distribution with real data.

26

slide-27
SLIDE 27

Changepoint problems:

Detection: online alarm problem. Location: offline partitioning. Tracking: online prediction.

Proposed method does all three. Starting simple...

27

slide-28
SLIDE 28

Online alarm problem:

Observe process in real time. µ0 and σ known. τ and µ1 unknown. Raise alarm ASAP after changepoint. Minimize delay. Minimize false alarm rate.

28

slide-29
SLIDE 29

GLR = generalized likelihood ratio.

Compute decision function gk. E[gk] = 0 before the changepoint, ... increases after. Alarm when gk > h. GLR is optimal when µ1 is known.

29

slide-30
SLIDE 30

CPP = change point probability P(changepoint) =

n

  • i=0

P(i+)

Alarm when P(changepoint) > thresh.

30

slide-31
SLIDE 31

0.0 0.1 0.2 false alarm probability 5 10 15 mean delay

GLR CPP

µ = 0, 1 σ = 1 τ ∼ Exp(0.01) Goodness =

lower mean delay for same false alarm rate.

31

slide-32
SLIDE 32

0.0 0.5 1.0 1.5 sigma 5 10 15 20 25 mean delay

GLR (5% false alarm rate) CPP (5% false alarm rate)

Fix false alarm

rate = 5%

Vary σ. CPP does well

with small S/N.

32

slide-33
SLIDE 33

So it works on a simple problem. Future work:

Other changepoint problems (location, tracking). Other data distributions (lognormal). Testing robustness (real data, trends).

33

slide-34
SLIDE 34

Related problem:

How much categorical data to use? Example: predict queue time based on size, queue,

etc.

Possible answer: narrowest category that yields two

changepoints.

34

slide-35
SLIDE 35

Good news:

Very general framework. Seems to work. Many possible applications.

35

slide-36
SLIDE 36

Bad news:

Need to apply and test in real application. n2 space and time may limit scope.

36

slide-37
SLIDE 37

More at

allendowney.com/research/changepoint

Or email downey@allendowney.com

37