[PPT] - Modern MDL Meets Data Mining Insight, Theory, and Practice Part IV PowerPoint Presentation

SLIDE 1

Modern MDL Meets Data Mining Insight, Theory, and Practice

ーPart IVー Dynamic Setting

Kenji Yamanishi

Graduate School of Information Science and Technology, the University of Tokyo

August 4th 2019 KDD Tutorial

SLIDE 2

Part IV. Dynamic Setting

4.1. Change Detection with MDL Change Statistics 4.1.1. Change Detection 4.1.2. MDL Change Statistics 4.1.3. Sequential Gradual Change Detection 4.1.4. Adaptive Windowing 4.2. Model Change Detection with MDL Principle 4.2.1. MDL Model Change Statistics 4.2.2. Dynamic Model Selection 4.2.3. Clustering Change Detection 4.2.4. Model Change Sign Detection

SLIDE 3

4.1. Change Detection with MDL Change Statistics.

SLIDE 4

Detecting emergence of bursts of anomalies

4.1.1 Change Detection

What’s Change Detection?

SLIDE 5

ｔ＝a :

change point

is large Dissim issimila ilarit ity M Measu sure＝ Kullba llback-Leible ibler diver ergence ence

Definition of Change Point

SLIDE 6

Application to Malware Detection

Detecting SQL Injection via change point detection

5 10 10 15 15 20 20 25 25 30 30 35 35 40 40

change score Mal Malwar are A Attac ttack

→time 22 ho hour urs Sig ign-Sca canni nning ng

SLIDE 7

Why Change Detection?

Time Series Event behind change Access log Malware Computer usage log Fraud Syslog Failure Sensor data Accident Tweet Topic Emergence Real estate transaction Economics crisis Usage transaction Market trend Visual field loss Glaucoma

SLIDE 8

Previous Work

■ Abrupt Change detection:

[Hinkley 1970] [Hsu 1977][Basseville, Nikiforov 1993](CUSUM) [Guralnik, Srivastava 1998] [Fearnhead, Liu 2007]

■ On-line abrupt change detection:

[Yamanishi,Takeuchi 2002] [Kiefer et al.2004] [Takeuchi, Yamanishi 2006] [Adams,MacKay 2007]

■ Incremental change detection（Concept drift）

[Zliobaite 2009] [Gama et al. 2013]

■ Continuous change detection

[Miyaguchi, Yamanishi 2015] [Yamanishi Miyaguchi 2016]

No studies on unifying approaches to detecting gradual changes as well as abrupt ones

SLIDE 9

Abrupt Change Detection Gradual Change Detection

[Yamanishi Miyaguchi BigData 2016] {Miyaguchi Yamanishi JDSA2018] [Kaneko Miyaguchi Yamanishi BIgData2016]

Model Change Detection

[Yamanishi Fukushima IEEE IT 2018] [Hirai Yamanishi KDD2012] [Hayashi Yamanishi DAMI 2014]

Model Change Sign Detection

[Hirai Yamansihi BigData 2018]

Unifying gradual and abrupt change detecｔion

New Directions of Change Detection

MDL

SLIDE 10

4.1.2 MDL Change Statistics

Hypothesis Testing Framework

10

parametric class of

prob. densities

Like ikelih lihood d test st cannot be be applie pplied

t is change pt t is not change pt

SLIDE 11

MDL Change Statistics

If the data can be compressed significantly more by changing the distribution at time t, then that point may be thought of as a change point.

Basic Idea

time t

C.f. [Yamanishi Miyaguchi BigData2016] [Vreeken Leeuwen DAMI2014] [Hooi et al. CIKM2018] [Guralnik and Srivastava KDD1999]

SLIDE 12

NML Codelength

Parametric model NML Codelength （Normalized Maximum Likelihood （NML) Codelength) Parametric Complexity

k:# parameters

＝Cn

(Fisher Information) where

SLIDE 13

MDL-change statistics

Formal Definition of MDL Change Statistics

NML Code-length for unchange NML Code-length for change [Yamanishi Miyaguchi BigData2016]

SLIDE 14

Performance Evaluation Metrics

The performance measure of hypothesis testing Type I error probability: =The probability that H0 is true but H1 is accepted. (False alarm rate) Type II error probability =The probability that H1 is true but H0 is accepted. (Overlooking rate)

SLIDE 15

Theoretical Performance of MDL-Test

Theorem 4.1.1（Error probabilities for MDL-test) [Yamanishi Miyaguchi BigData2016] :NML distribution where Error probabilities converge to zero exponentially with model complexity-based exponents. （False alarm rate) （Overlooking rate)

SLIDE 16

4.1.3.Sequential Gradual Change Detection

16

Detecting change symptom from data stream Challenges： Real-time detection of sign of changes

Abrupt change ⇒Conventional target ⇒Our new target Gradual change

Change Symptom Change point

SLIDE 17

Sco core e Curve ve

Sequential MDL Change Detection(S-MDL)

MDL Change Statistics

[Yamanishi, Miyaguchi BigData2016]

Sequentially compute MDL change statistics with fixed window

Change point

SLIDE 18

Sequential MDL Change Detection

Sequential variant 2h: window size Runs linearly in window size

SLIDE 19

19

Example 4.1.1. (Gaussian distributions)

MDL change statistics at time t: :

SLIDE 20

20

Example 4.1.2. (Poisson distributions)

MDL change statistics at time t:

SLIDE 21

21

Example 4.2.3. (Linear Regression)

MDL change statistics at time t:

SLIDE 22

22

■Total Benefit （How early) ■#False Alarms （How reliably) ■Performance Measure

Experiments: Synthetic Data

Evaluation metrics

benefit T t*

true

1

β β β β

AUC UC threshold

Area under curve

β β

t

SLIDE 23

23

Experiments: Synthetic Data

Jumping means

replacing the step function H(·) with a slope function S(·) s.t. where H(x) is the Heaviside step function that takes 1 if x ≥ 0, otherwise 0 Abrupt Change Gradual Change

SLIDE 24

24

Experiments: Synthetic Data

Jumping variances

replacing the step function H(·) with a slope function S(·) s.t. where H(x) is the Heaviside step function that takes 1 if x ≥ 0, otherwise 0 Abrupt Change Gradual Change

SLIDE 25

25

Experiments: Synthetic Data

Jumping means: Jumping variances: IRL: Inverse Run Length [Adams and MacKay 2007] CF: ChangeFinder [Takeuchi and Yamanishi 2006] MDL1: Proposed method with independent Gaussian MDL2: Proposed method with linear regression AUC UC AUC UC

[Yamanishi, Miyaguchi BigData2016]

SLIDE 26

26

Experiments: Real Data(Security)

SQL injection symptom detection

■A time series of IP-URL counts, where each datum was the maximum # of total counts of records sent from an identical IP address to an identical URL within 15 minutes. ■Total records =8632 ■MDL1 and MDL2 employ Poisson distributions

Data provided by LAC Corporation [Yamanishi, Miyaguchi BigData2016]

SLIDE 27

27

Experiments: Real Data

SQL injection Attack Real symptom security analysts confirmed

SQL injection symptom detection-

Detected symptom caused by gradual increase of IP-URL accounts

SLIDE 28

How do you choose window size?

SLIDE 29

29

Compute statistics for all division points in the window

Determine window size

SCAW: Sequentially compute MDL change statistics with Adaptive Windowing (ADWIN) [Bifet & Gavaldà SDM07]

If a statistics value exceeds threshold, it shrinks its window
Cost-saving version (ADWIN2)
Narrowing down the number of division points from

to → no need to choose window size heuristically

4.1.4. Adaptive Windowing

[Kaneko, Miyaguchi, Yamanishi BigData2017]

SLIDE 30

Asymptotic Reliability

[Kaneko, Miyaguchi, Yamanishi BigData2017]

Asymptotic reliability assures:

“the number of false-alarms stays finite as the data size grows when the target process does not contain any changes.”

Threshold Hyperparameter

Theorem 4.1.2

SLIDE 31

・ Precision-recall plots

PHT: Page-Hinkley Test [Hinkle 70] ADWIN [Bifet & Gavaldà 07] CF: ChangeFinder [Takeuchi & Yamanishi 06] BOCPD: Bayesian online chnagepoint detection [Adams & MacKay 07]

31

SCAW achieves highest performance

Experimental Result: Synthetic Data

[Kaneko, Miyaguchi. Yamanishi BigData2017]

SLIDE 32

time series data

217×325,440

32

SCAW2 S-MDL

SCAW is the better choice as a stream change detection

Data provided and evaluated by Toray Corp.

Increase in the amount of an ingredient from early Apr. in 2015
A temporary stop of the boiler system on Mar. 15th in 2015

Experimental Results: Real Data

ーFailure Sign Detection－

Window size ChangeSc

re

Adaptive window Fixed window

Detected signs of real failures in an industrial boiler system Real Failure Signs of failures

[Kaneko, Miyaguchi. Yamanishi BigData2017]

SLIDE 33

4.2. Model Change Detection with MDL Principle

SLIDE 34

Related Work

・Tracking Piecewise Stationary Sources

[Shamir Merhav IEEE IT1999] [Killick, Fearnhead, Eckley JASA2012] [Davis, Yau EJS2013]

・Switching Distribution

[Erven, Grunwald, Rooij JRoyalStat 2013]

・Tracking Best Experts / Derandomization

[Herbster, Warmuth JML 1998] [Vovk ML99]

・Dynamic Model Selection

[Yamanishi, Maruyama KDD2005, IEEE IT2007] [Davis, Lee, Rodriguez JASA 2006] [Hirai Yamanishi KDD2012] [Yamanishi Fukushima IEEE IT2019] ・Concept Drift [J. Gama, I. Zlibait, A. Bifet, M. Pechenizkiy, Bouchachia, ACM Survey 2013]

SLIDE 35

4.2.1. MDL Model Change Statistics

M0* M2* M1*

NML codelength for change

[Yamanishi Fukushima IEEE Inform Theory 2018]

MDL-Change Statistics

NML codelength for unchange

mo model par aram ameter Parametric Complexity

SLIDE 36

[Yamanishi Fukushima IEEE Inform Theory 2018] Theorem 4.1.3

Theoretical Result on MDL-Test

Type I and II error probabilities converge exponentially to zero where exponents depend on parametric complexities

（False alarm prob.) (Overlooking prob.)

MDL change statistics MDL Test:

SLIDE 37

4.2.3. Dynamic Model Selection (DMS)

Multiple model change detection-

Find a model sequence that minimizes total description length Predictive Codelength for data sequence PredictiveCodelength for model sequence

DMS（Dynamic Model Selection）criterion

[Yamanishi and Maruyama KDD2005, IEEE IT 2007]

Computable via Dynamic Programming

Model class

SLIDE 38

Probabilistic Setting of DMS

■Predictive distribution for data sequence ■モデル遷移確率

Sequentially normalized maximum likelihood code-length

■Model transition probability Maximum Likelihood Prediction Bayes Prediction SNML Prediction

SLIDE 39

１） Model sequence selection using dynamic programming 2)Estimating model transition prob. via Krischevsky-Trofimov

estimator

DMS Algorithm

# change points needed to be M at time t

SLIDE 40

■What’s Syslog?

Event sequences collected with BSD syslog protocol
Warning messages about devices

[Yamanishi, Maruyama KDD2005]

時間 Anomaly score

Detect failures early and identify their patterns

Application to Failure Detection from Syslog

SLIDE 41

j-th session of syslog :

Syslog sessions are modeled with HMM mixtures

∑

=

K k k j k k j

P P

1

) | ( ) | ( θ π θ y y

where

∑ ∏ ∏

− = = +

=

) ,..., ( 1 1 1 1 1

1

) | ( ) | ( ) ( ) | (

j T j j

x x T t T t t t k t t k k k j k

x y b x x a x P γ θ y ) ,..., ( 1

j

T

x x

：latent variables

Syslog Modeling with HMM Mixtures

K: #syslog behavior patterns :session length Sys Syslog se seq. q. State State se seq. q..

SLIDE 42

Bridge Error (2002/1/10) Systen down System Lock-up (2002/1/15)

33025:Jan 15 15:03:59 WARN:swsig:sw_SigGetMem: alloc failed(256) 33026:Jan 15 15:03:59 WARN:swsig:sw_SigGetMem: alloc failed(256) 42253:Jan 19 22:26:33 ERR :bridge:!brdgursrv: queue is full. discarding a message. …

System lock up (2001/11/13) Memory Exhaustion 2001/11/20) Memory exhaustion (2001/11/11) http://fbi-award.jp/sentan/jusyou/2005/nec.pdf

Experiments: Failure Detection

#syslog patterns changed two days before system down.

# sy syslo slog pa g patterns

SLIDE 43

43

4.2.3. Clustering Change Detection

Time Change point Change point

Detecting changes of number of clusters and clustering assignments

SLIDE 44

DMS for Complete Variable Model

Incrementally Application of DMS to complete variable model

[Hirai Yamanishi KDD2012] Z: latent variable …Cluster index of X

Complete variable model

SLIDE 45

Incremental DMS Criterion

NML codelength for Clustered data sequence Codelength for cluster change

Slice total codelength time-wisely, then select # clusters and cluster assignment at each time

Slice time-wisely

[Hirai Yamanishi KDD2012] See also [Sun et al. KDD2007] [Satoh Yamanishi ICDM2013]

SLIDE 46

Application to Gaussian Mixture Model

Complete variable model of Gaussian mixture model Upper bound on NML codelength for GMM

[Hirai and Yamanishi IEEE IT 2019]

SLIDE 47

Tracking changes of customer structures from beer transaction behavior data（QPR）

Data provided by M-Cube

Experimental Results: Real Data

Market Structure Change Detection-

Period: Nov.2011-Jan. 2012 #customers: 3185 Data for each customer at t＝consumption volume of 14 brands beer during 14 days until time t

time change change

3185 14 dim 14 days

[Hirai Yamanishi KDD2012]

SLIDE 48

48

Consumption From Dec. 19th To Jan. 1st.

Change of #clusters was detected at time when year-end demand increased vastly.

Consumption From Jan. 9th To Jan. 22nd

SLIDE 49

2012/2/1 49 平均消費量（ｍｌ） cluster 1 cluster 2 cluster 3 ビールA

184 184 117 117

ビールB

91 91 95 95

プレミアムA

108 108 80 80

プレミアムB

113 113 43 43

ビールC

126 126

ビールD

140 140

第三のビールA

93 93 41 41 43 43

第三のビールB

198 198 121 121

第三のビールC

303 303 103 103

第三のビールD

120 120 182 182

発泡酒A

75 75 48 48

オフA

157 157

オフB

114 114 34 34

オフC

83 83

総購入量

589 589 852 852 1373 1373

人数（人）

598 598 376 376 311 311

cluster 1 cluster 2 cluster 3 cluster 4 cluster 5

84 84 131 131 50 50 229 229 123 123 248 248 153 153 174 174 73 73 176 176 105 105 146 146 122 122 72 72 192 192 101 101 131 131 130 130 34 34 406 406 131 131 107 107 112 112 46 46 236 236 202 202 431 431 107 107 87 87 169 169 138 138 215 215 74 74 61 61 83 83 637 637 796 796 2348 2348 705 705 596 596 397 397 190 190 123 123 162 162 363 363

Year-end demands of Beer A and 3rd

world Beer C rapidly increased, they led to form new additional clusters

Clustering Structure Change

SLIDE 50

4.2.4. Model Change Sign Detection

k=3 k=4 k=? Model uncertainty increases

SLIDE 51

Problem Setting

Problem setting

time t

SLIDE 52

Structural Entropy

Structural Entropy … measuring uncertainty of model selection

[Hirai Yamanishi BigData 2018] where Or for complete variable model

SLIDE 53

Model Change Sign Detection via Structural Entropy

Model dimension Structural uncertainty Change sign

[Hirai Yamanishi BigData 2018] See also [Ohsawa RevSNS 2018]

SLIDE 54

Experimental Results: Synthetic Data

Change sign can be detected by looking at rise up of structural entropy

[Hirai Yamanishi BigData2016]

SLIDE 55

Experimental Results: Real Data

Signs of customer clustering structure changes can be detected by looking at rise up of structural entropy

SLIDE 56

Summary

The MDL change statistics is a theoretically justified methodology

for measuring the change score either for parameter changes or model changes.

For gradual change detection, apply sequential MDL statistics with

adaptive/non-adaptive windowing to conduct real-time event detection.

For multiple model change detection, conduct Dynamic Model

Selection(DMS) to obtain optimal model sequences.

For clustering structure change detection, apply DMS to latent

variable models sequentially to catch up latent structure changes.

Signs of model changes may be detected by looking at structural

entropy measuring model uncertainty.

SLIDE 57

References

■ 4.1. MDL change statistics

・J. Vreeken, M. van Leeuwen, A. Siebes: “Krimp: mining itemsets that compress,” Data Mining and Knowledge Discovery, Vol. 23, 1, pp 169-214, 2011. ・K.Yamanishi and K.Miyaguchi: “Detecting gradual changes from data stream using MDL- change statistics,“ Proceedings of 2016 IEEE International Conference on BigData (IEEE BigData2016), pp:156-163, 2016. ・R. Kaneko, K.Miyaguchi, and K.Yamanishi：“Detecting Changes in Streaming Data with Information-Theoretic Windowing," Proceedings of 2017 IEEE International Conference on Big Data (BigData2017 ), pp: 646-655, 2017. ・B.Hooi, L.Akoglu,D.Eswaran,A.Pandey, A.Jereminov,L.Pileggi, C.Faloutsos: “ChangeDAR: Online localized change detection for sensor data on a graph,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp:507-516, 2018. C.f. Adaptive window algorithm ・A.Bifet and R.Gavaldà: “Learning from time-changing data with adaptive windowing,” in Proceedings of the 2007 SIAM International Conference on Data Mining, 2007. C.f. Predictive change statistics ・V.Guralnik and J.Srivastava: “Event detection from time series data,” in Proceedings

f ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

pp:33–42, 1999.

SLIDE 58

References

■ 4.2. Dynamic Model Selection

・K.Yamanishi and Y.Maruyama: “Dynamic syslog mining for network failure monitoring,” Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD2005), pp：499－508, 2005. ・K.Yamanishi and Y.Maruyama: “Dynamic model selection with its applications to novelty detection,” IEEE Transactions on Information Theory, Vol. 53, NO. 6, pp:2180-2189, 2007. ・K.Yamanishi and S. Fukushima: " Model change detection with the MDL principle", IEEE Transactions on Information Theory, 64(9), pp: 6115-6126, 2018. ■ 4.2. Topics Related to Dynamic Model Selection ・M.Herbster and M.Warmuth: “Tracking the best experts,” Machine Learning, 32, pp:151–178,1998. ・V. Vovk: “Derandomizing stochastic prediction strategies," Machine Learning,

vol. 35, no. 3, pp. 247-282, 1999.

・J.Kleinberg: “Bursty and hierarchical structure in stream,” Data Mining and Knowledge Discovery, 7, pp:373—397, 2003.

SLIDE 59

References

■ 4.2. Topics Related to Dynamic Model Selection(Cont.) ・R.A.Davis, T.C.M.Lee, G.A.Rofriguez-Yam: “Structural break estimation for nonstationary time series models,” Journal of American Statistical Associations, 101, pp:223-239, 2006. ・X. Xuan and K. Murphy. Modeling changing dependency structure in multivariate time series,” Proceedings of the 24th International Conference on Machine Learning, (ICML2007), pp.1055--1062, 2007. ・T.Erven, P.Grunwald, and S.Rooij: “Catching up by switching sooner: a predictive approach to adaptive estimation with an application to the AIC-BIC dilemma," Jr.Royal Stat.Soc.Ser.B, vol. 74, no. Issue 3, pp. 361–417, 2012. ・R.Killick, P.Fearnhead, and I.A.Eckley: “Optimal detection of changepoints with a linear computational cost,” Journal of American Statistical Associations, 107:500, pp:1590-1598, 2012. ・J. Gama, I. Zlibait, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation," ACM Computing Survey, 2013. ・Y.Hayashi and K.Yamanishi: “Sequential network change detection with its applications to ad impact relation analysis,“ Data Mining and Knowledge Discovery:

Vol. 29, Issue 1 ,pp: 137-167, 2015.

SLIDE 60

References

■ 4.2.3. Clustering Change Detection

・M. Song and H.Wang: “Highly efficient incremental estimation of Gaussian mixture models for online data stream clustering,” Intelligent Computing, 2005. ・J. Sun, C. Faloutsos, S.Papadimitriou,P. S. Yu: “GraphScope: parameter-free mining

f large time-evolving graphs,” Proceedings of the 13th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining (KDD2007), pp: 687-696, 2007. ・S. Hirai and K.Yamanishi: 〝Detecting changes of clustering structures using normalized maximum likelihood coding." Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD2012), pp:343-351, 2012 ・S.Sato and K.Yamanishi:〝Graph partitioning change detection using tree-based clustering," Proceedings of IEEE International Conference on Data Mining (ICDM2013)，pp:1169-1174, 2013. ■ 4.2.4. Model Change Sign Detection ・S. Hirai and K.Yamanishi: “Detecting Latent Structure Uncertainty with Structural Entropy”, Proceedings of IEEE International Conference on BigData (BigData2018),

Dec. 2018.

・Y. Ohsawa: “Graph-based entropy for detecting explanatory signs of changes in market,” The Review of Social Network Strategies, Vol 12, 2, pp:183-203, 2018.