Modern MDL meets Data Mining Insight, Theory, and Practice Jilles - PowerPoint PPT Presentation

2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi CISPA Helmholtz Center The University of Tokyo for Information Security

About the presenters Kenji Jilles Yamanishi Vreeken 2

About this tutorial Approximately 3.5 hours long Extensive, but in inco comple lete introduction to MDL theory  MDL practice in in data min ining ng  naturally a bit biased  3

Schedule 8:00am Opening 8:10am Introduction to MDL 8:50am MDL in Action 9:30am –––––– break –––––– 10:00am Stochastic Complexity 11:00am MDL in Dynamic Settings 4

Part 1 Intr trod oduc uction t on to o MDL Jilles V Vreeke ken 7

Induction by Simplicity “The simplest description of an object is the best” 8

Kolmogorov Complexity 𝑚 ( 𝑧 ) 𝑉 𝑧 halts and 𝑉 𝑧 = 𝑦 } 𝐿 𝑉 𝑦 = min 𝑧 The Kolmogorov complexity of a binary string 𝑦 is the length of the shortest program 𝑧 ∗ for a universal Turing Machine 𝑉 that generates s and halts. (Solomonoff 1960, Kolmogorov 1965, Chaitin 1969) 9

Kolmogorov Complexity alts and 𝑉 𝑧 = 𝑦 } 𝐿 𝑉 𝑦 = min 𝑚 ( 𝑧 ) 𝑉 𝑧 hal 𝑧 The Kolmogorov complexity of a binary string 𝑦 is the length of the shortest program 𝑧 ∗ for a universal Turing Machine 𝑉 that generates s and hal alts ts. (Solomonoff 1960, Kolmogorov 1965, Chaitin 1969) 10

Ultimately Impractical Kolmogorov complexity 𝐿 ( 𝑦 ) , or rather, the Kolmogorov optimal program 𝑦 ∗ is not computable. We can approximate it from above, but, this is not very practical. (simply not enough students to enumerate all Turing machines) We can approximate it through off-the-shelf compressors, yet, this has serious drawbacks. (big-O, what structure does a compressor reward, etc) 11

A practical variant A more viable alternative is the Min Minim imum De Descrip iptio ion Le Length principle “the best model is the model that gives the best lossless compression” There are two ways to motivate MDL  we’ll discuss both at a high level  then go into more details on what MDL is and can do 12

Two-Part MDL The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ , the best hy hypot othesis 𝐼 ∈ ℋ for given data 𝐸 is that 𝐼 that minimises 𝑀 𝐼 + 𝑀 ( 𝐸 ∣ 𝐼 ) in which 𝑀 ( 𝐼 ) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼 (see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 13

Bayesian Learning Bayes tells us that Pr( 𝐼 ∣ 𝐸 ) = Pr( 𝐸 ∣ 𝐼 ) × Pr( 𝐼 ) Pr( 𝐸 ) This means we want the 𝐼 that maximises Pr( 𝐼 ∣ 𝐸 ) . Since Pr( 𝐸 ) is the same for all models, we have to maxi aximise se Pr 𝐸 × Pr 𝐼 𝐼 Or, equivalently, min inim imis ise 𝐼 ) − log(Pr( 𝐼 )) − log (Pr 𝐸 14

From Bayes to MDL So, Bayesian Learning means min inim imis isin ing 𝐼 ) − log(Pr( 𝐼 )) − log (Pr 𝐸 Shannon tells us that the − log transform takes us from probabilities to optima mal pr pref efix-cod ode length gths This means we are actually minimizing 𝐼 𝑀 𝐼 + 𝑀 𝐸 for some encoding 𝑀 for 𝐼 resp. 𝐸 ∣ 𝐼 corresponding to distribution Pr 15

Bayesian MDL If we want to do MDL this way – i.e., being a Bayesian – we need to specify  a prior probability Pr ( 𝑁 ) on the models, and  a conditional probability Pr ( 𝐸 | 𝑁 ) on data given a model What are reasonable choices? 16

What Distribution to Use? For the data, this is ‘easy’: a maximum likelihood model a maximum entropy model for Pr( 𝐸 ∣ 𝑁 ) makes most sense  For the models, this is ‘harder’, we could, e.g., use ‘whatever the expert says is a good distribution’, or  an uninformative prior on 𝑁 , or  (a derivative of) the universal prior from algorithmic statistics  These are not easy to compute, query, and ad ad hoc hoc. In MDL we say, if we are going to be ad hoc, let us do so ope penl nly and use expl plicit uni universal e enco ncodings 17

Information Criteria MDL might make you think of either Aka kaik ike’s Inf Infor ormation C on Criterion (AIC) 𝑙 − ln(Pr( 𝐸 | 𝐼 )) or the Bayesian Inf Infor ormation C on Criterion n (BIC) k (Pr 𝐸 𝐼 ) 2 ln 𝑜 − ln 18

Information Criteria MDL might make you think of either Aka kaik ike’s Inf Infor ormation C on Criterion (AIC) 𝑙 − 𝑀 ( 𝐸 | 𝐼 ) or the Bayesian Inf Infor ormation C on Criterion n (BIC) k 2 ln 𝑜 − 𝑀 ( 𝐸 | 𝐼 ) 19

Information Criteria MDL might make you think of either Aka kaik ike’s Inf Infor ormation C on Criterion (AIC) 𝑀 𝐵𝐵𝐵 𝐼 = 𝑙 or the Bayesian Inf Infor ormation C on Criterion n (BIC) 𝑀 𝐶𝐵𝐵 𝐼 = k 2 ln 𝑜 We, however, do not o not assume that all parameters are created equal, we take their complexity into account 20

From Kolmogorov to MDL Both Kolmogorov complexity and MDL are based on compression. Is there a relationship between the two? Ye Yes. We can derive two wo-par art M t MDL from Kolmogorov complexity. We’ll sketch here how. (see, e.g., Li & Vitanyi 1996, Vereshchagin & Vitanyi 2004 for details) 21

Objects and Sets Recall that in Algorithmic Information Theory we are looking for (optimal) descriptions of obje ject cts. One way to describe an object is describe a set of which it is a member  point int out which ch of these members it it is is.  In fact, we do this all the time the beach (i.e., the set of all beaches)  over there (pointing out a specific one)  22

Algorithmic Statistics We have, a set 𝑇 which we call a mod model  which has complexity 𝐿 ( 𝑇 )  and an object 𝑦 ∈ 𝑇 𝑇 is a model of 𝑦  the complexity of pointing out 𝑦 in 𝑇 is  the complexity of 𝑦 given 𝑇 , i.e. 𝐿 ( 𝑦 ∣ 𝑇 ) Obviously, 𝐿 𝑦 ≤ 𝐿 𝑇 + 𝐿 ( 𝑦 ∣ 𝑇 ) 23

So? Algorithmic Information Theory states that every program that outputs 𝑦 and halts encodes the infor orma mation on in in 𝑦  the smallest such program encodes only y the e in informatio ion n in in 𝑦  If 𝑦 is a data set, i.e. a rand ndom om s samp mple, we expect it has epis istemic ic struc uctur ure, “true” structure; captured by 𝑇  aleat atoric struc uctur ure, “accidental” structure; captured by 𝑦 ∣ 𝑇  We are hence interested in that model 𝑇 that minimizes 𝑇 𝐿 𝑇 + 𝐿 𝑦 which is surprisingly akin to two-part MDL 24

More detail For 𝐿 ( 𝑇 ) this is simply the length of the shortest program that outputs 𝑇  and halts; i.e., a gen generative e model of 𝑦 For 𝐿 ( 𝑦 ∣ 𝑇 ) if 𝑦 is a typi pica cal element of 𝑇  there is no more efficient way to find 𝑦 in 𝑇 than by an index, i.e., ( 𝑇 ) 𝐿 𝑦 𝑇 ≈ log 25

Kolmogorov’s Structure Function This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } ℎ 𝑦 𝑗 = min 𝑇 {log 𝑇 That is, we start with very simple – in terms of complexity – models and gradually work our way up (see, e.g., Li & Vitanyi 1996, Vereshchagin & Vitanyi 2004) 26

The MDL function This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } ℎ 𝑦 𝑗 = min 𝑇 {log 𝑇 which defines the MDL function as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } 𝜇 𝑦 𝑗 = min 𝑇 { 𝐿 𝑇 + log 𝑇 We try to find the minimum by considering increasingly complex models. (see Vereshchagin & Vitanyi 2004) 27

The MDL function This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } ℎ 𝑦 𝑗 = min 𝑇 {log 𝑇 which defines the MDL function as ∣ 𝑦 ∈ 𝑇 , 𝐿 𝑇 ≤ 𝑗 } 𝜇 𝑦 𝑗 = min 𝑇 { 𝐿 𝑇 + log 𝑇 We try to find the minimum by considering increasingly complex models. (see Vereshchagin & Vitanyi 2004) 28

Two-Part MDL The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ , the best hy hypot othesis 𝐼 ∈ ℋ for given data 𝐸 is that 𝐼 that minimises 𝑀 𝐼 + 𝑀 ( 𝐸 ∣ 𝐼 ) in which 𝑀 ( 𝐼 ) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼 (see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 29

Modern MDL meets Data Mining Insight, Theory, and Practice Jilles - PowerPoint PPT Presentation

2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi CISPA Helmholtz Center The University of Tokyo for Information

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

Modern MDL Meets Data Mining Insight, Theory, and Practice Part IV Dynamic Setting Kenji

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas & Maksim Eisenstein, 03.21.2019 - GPU Technology

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Recent Developments in Class Action Law and Impact on MDL Cases and Impact on MDL Cases

Between Renderers with MDL Jan Jordan Software Product Manager MDL March 18, GTC San Jose 2019

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

SIGGRAPH16 : NVIDIA BEST OF GTC MDL MATERIALS TO GLSL SHADERS Andreas Senbach, NVIDIA

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining & Assessment TU Darmstadt

Mining for insight Osma Ahvenlampi, CTO, Sulake Implementing business intelligence for Habbo

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Second-Order Bias-Corrected AIC for Selecting Structural Equation Models Kentaro H AYASHI

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department

Information leakage from black holes with symmetry Yoshifumi NAKATA Kyoto university E.

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Lecture 7: Cross-Validation Instructor: Prof. Shuai Huang Industrial and Systems Engineering

Motivation Partial Wave Analysis Up to know: worked on + with

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Modern MDL meets Data Mining Insight, Theory, and Practice Jilles - PowerPoint PPT Presentation

2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi CISPA Helmholtz Center The University of Tokyo for Information

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

Modern MDL Meets Data Mining Insight, Theory, and Practice Part IV Dynamic Setting Kenji

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas &amp; Maksim Eisenstein, 03.21.2019 - GPU Technology

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Recent Developments in Class Action Law and Impact on MDL Cases and Impact on MDL Cases

Between Renderers with MDL Jan Jordan Software Product Manager MDL March 18, GTC San Jose 2019

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

SIGGRAPH16 : NVIDIA BEST OF GTC MDL MATERIALS TO GLSL SHADERS Andreas Senbach, NVIDIA

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining &amp; Assessment TU Darmstadt

Mining for insight Osma Ahvenlampi, CTO, Sulake Implementing business intelligence for Habbo

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Second-Order Bias-Corrected AIC for Selecting Structural Equation Models Kentaro H AYASHI

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department

Information leakage from black holes with symmetry Yoshifumi NAKATA Kyoto university E.

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Lecture 7: Cross-Validation Instructor: Prof. Shuai Huang Industrial and Systems Engineering

Motivation Partial Wave Analysis Up to know: worked on + with

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas & Maksim Eisenstein, 03.21.2019 - GPU Technology

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining & Assessment TU Darmstadt