changepoint detection for time series prediction
play

Changepoint detection for time series prediction Allen B. Downey - PowerPoint PPT Presentation

Changepoint detection for time series prediction Allen B. Downey Olin College of Engineering 1 My background: Predoc at San Diego Supercomputer Center. Dissertation on workload modeling, queue time prediction and malleable job


  1. Changepoint detection for time series prediction Allen B. Downey Olin College of Engineering 1

  2. My background: � Predoc at San Diego Supercomputer Center. � Dissertation on workload modeling, queue time prediction and malleable job allocation for parallel machines. � Recent: Network measurement and modeling. � Current: History-based prediction. 2

  3. Connection? � Resource allocation based on prediction. � Prediction based on history. � Historical data characterized by changepoints (nonstationarity). 3

  4. Three ways to characterize variability: � Noise around a stationary level. � Noise around an underlying trend. � Abrupt changes in level: changepoints. Important difference: � Data prior to a changepoint is irrelevant to performance after. 4

  5. Example: wide area networks � Some trends (accumulating queue). � Many abrupt changepoints. • Beginning and end of transfers. • Routing changes. • Hardware failure, replacement. 5

  6. Example: parallel batch queues � Some trends (daily cycles). � Some abrupt changepoints. • Start/completion of wide jobs. • Queue policy changes. • Hardware failure, replacement. 6

  7. My claim: � Many systems are characterized by changepoints where data before a changepoint is irrelevant to performance after. � In these systems, good predictions depend on changepoint detection, because old data is wrong. Discussion? 7

  8. Two kinds of prediction: � Single value prediction. � Predictive distribution. • Summary stats. • Intervals. • P ( error > thresh ) • E [ cost ( error )] 8

  9. If you assume stationarity, life is good: � Accumulate data indefinitely. � Predictive distribution = observed distribution. But this is often not a good assumption. 9

  10. If the system is nonstationary: � Fixed window? Exponential decay? � Too far: obsolete data. � Not far enough: loss of useful info. 10

  11. If you know where the changepoints are: � Use data back to the latest changepoint. � Less information immediately after. 11

  12. If you don’t know, you have to guess. P ( i ) = prob of a changepoint at time i Example: � 150 data points. � P (50) = 0 . 7 � P (100) = 0 . 5 How do you generate a predictive distribution? 12

  13. Two steps: � Derive P ( i +) : prob that i is the latest changepoint. � Compute weighted mix going back to each i . Example: P (50) = 0 . 7 P (100) = 0 . 5 P ( ⊘ ) = 0 . 15 P (50+) = 0 . 35 P (100+) = 0 . 5 13

  14. Predictive distribution = 0 . 50 · ed f (100 , 150) ⊕ 0 . 35 · ed f (50 , 150) ⊕ 0 . 15 · ed f (0 , 150) 14

  15. So how do you generate the probabilities P ( i +) ? Three steps: � Bayes’ theorem. � Simple case: you know there is 1 changepoint. � General case: unknown # of changepoints. 15

  16. Bayes’ theorem (diachronic interpretation) P ( H | E ) = P ( E | H ) P ( E ) P ( H ) � H is a hypothesis, E is a body of evidence. � P ( H | E ) : posterior � P ( H ) : prior � P ( E | H ) is usually easy to compute. � P ( E ) is often not. 16

  17. Unless you have a suite of exclusive hypotheses. P ( H i | E ) = P ( E | H i ) P ( H i ) P ( E ) � P ( E ) = P ( E | H j ) P ( H j ) H j ∈ S In that case life is good. 17

  18. � If you know there there is exactly one changepoint in an interval... � ...then the P ( i ) are exclusive hypotheses, � and all you need is P ( E | i ) . Which is pretty much a solved problem. 18

  19. What if the # of changepoints is unknown? � P ( i ) are no longer exclusive. � But the P ( i +) are. � And you can write a system of equations for P ( i +) . 19

  20. � P ( i + ) = P ( i + |⊘ ) P ( ⊘ ) + P ( i + | j ++ ) P ( j ++ ) j<i � P ( j ++ ) is the prob that the second-to last changepoint is at i . � P ( i + | j ++ ) reduces to the simple problem. � P ( ⊘ ) is the prob that we have not seen two changepoints. � P ( i + |⊘ ) reduces to the simple problem (plus). Great, so what’s P ( j ++ ) ? 20

  21. � P ( j ++ ) = P ( j ++ | k + ) P ( k + ) k>j � P ( j ++ | k + ) is just P ( j + ) computed at time k . � So you can solve for P ( + ) in terms of P ( ++ ) . � And P ( ++ ) in terms of P ( + ) . � And at every iteration you have a pretty good estimate. � Paging Dr. Jacobi! 21

  22. Implementation: � Need to keep n 2 / 2 previous values. � And n 2 / 2 summary statistics. � And it takes n 2 work to do an update. � But, you only have to go back two changepoints, � ...so you can keep n small. 22

  23. 4 � Synthetic series data with two 2 changepoints. x[i] 0 � µ = − 0 . 5 , 0 . 5 , 0 . 0 -2 � σ = 1 . 0 -4 � P ( ⊘ ) = 0 . 04 1.0 P(i+) cumulative probability P(i++) 0.5 0.0 0 50 100 150 time 23

  24. 150 � The ubiquitous data annual flow (10^9 m^3) Nile dataset. 100 � Change in 1898. 50 � Estimated probs can be 0 1880 1900 1920 1940 1960 mercurial. 1.0 P33(i+) cumulative probability P66(i+) P99(i+) 0.5 0.0 1880 1900 1920 1940 1960 time 24

  25. 4 � Can also detect data change in 2 variance. 0 � µ = 1 , 0 , 0 -2 � σ = 1 , 1 , 0 . 5 -4 � Estimated P ( i + ) 1.0 cumulative probability P(i+) is good. P(i++) � Estimated 0.5 P ( i ++ ) less certain. 0.0 0 50 100 index 25

  26. � Qualitative behavior seems good. � Quantitative tests: • Compare to GLR for online alarm problem. • Test predictive distribution with synthetic data. • Test predictive distribution with real data. 26

  27. Changepoint problems: � Detection: online alarm problem. � Location: offline partitioning. � Tracking: online prediction. Proposed method does all three. Starting simple... 27

  28. Online alarm problem: � Observe process in real time. � µ 0 and σ known. � τ and µ 1 unknown. � Raise alarm ASAP after changepoint. � Minimize delay. � Minimize false alarm rate. 28

  29. GLR = generalized likelihood ratio. � Compute decision function g k . � E [ g k ] = 0 before the changepoint, � ... increases after. � Alarm when g k > h . � GLR is optimal when µ 1 is known. 29

  30. CPP = change point probability n � P ( i + ) P ( changepoint ) = i =0 � Alarm when P ( changepoint ) > thresh . 30

  31. � µ = 0 , 1 15 � σ = 1 GLR � τ ∼ Exp (0 . 01) CPP mean delay 10 � Goodness = lower mean delay for same false alarm rate. 5 0 0.0 0.1 0.2 false alarm probability 31

  32. � Fix false alarm 25 rate = 5% GLR (5% false alarm rate) � Vary σ . 20 CPP (5% false alarm rate) mean delay � CPP does well 15 with small S/N . 10 5 0 0.0 0.5 1.0 1.5 sigma 32

  33. So it works on a simple problem. Future work: � Other changepoint problems (location, tracking). � Other data distributions (lognormal). � Testing robustness (real data, trends). 33

  34. Related problem: � How much categorical data to use? � Example: predict queue time based on size, queue, etc. � Possible answer: narrowest category that yields two changepoints. 34

  35. Good news: � Very general framework. � Seems to work. � Many possible applications. 35

  36. Bad news: � Need to apply and test in real application. � n 2 space and time may limit scope. 36

  37. � More at allendowney.com/research/changepoint � Or email downey@allendowney.com 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend