Modern MDL meets Data Mining Insight, Theory, and Practice Jilles - - PowerPoint PPT Presentation

modern mdl meets data mining insight theory and practice
SMART_READER_LITE
LIVE PREVIEW

Modern MDL meets Data Mining Insight, Theory, and Practice Jilles - - PowerPoint PPT Presentation

2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi CISPA Helmholtz Center The University of Tokyo for Information


slide-1
SLIDE 1

2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Modern MDL meets Data Mining Insight, Theory, and Practice

Kenji Yamanishi

The University of Tokyo

Jilles Vreeken

CISPA Helmholtz Center for Information Security

slide-2
SLIDE 2

About the presenters

2

Kenji Yamanishi Jilles Vreeken

slide-3
SLIDE 3

About this tutorial

Approximately 3.5 hours long Extensive, but in inco comple lete introduction to

  • MDL theory
  • MDL practice in

in data min ining ng

  • naturally a bit biased

3

slide-4
SLIDE 4

Schedule

Opening Introduction to MDL MDL in Action ––––––break –––––– Stochastic Complexity MDL in Dynamic Settings

4

8:00am 8:10am 8:50am 9:30am 10:00am 11:00am

slide-5
SLIDE 5

Schedule

Opening Introduction to MDL MDL in Action ––––––break –––––– Stochastic Complexity MDL in Dynamic Settings

5

8:00am 8:10am 8:50am 9:30am 10:00am 11:00am

slide-6
SLIDE 6

Schedule

Opening Introduction to MDL MDL in Action ––––––break –––––– Stochastic Complexity MDL in Dynamic Settings

6

8:00am 8:10am 8:50am 9:30am 10:00am 11:00am

slide-7
SLIDE 7

Jilles V Vreeke ken

Part 1 Intr trod

  • duc

uction t

  • n to
  • MDL

7

slide-8
SLIDE 8

Induction by Simplicity

“The simplest description

  • f an object is the best”

8

slide-9
SLIDE 9

Kolmogorov Complexity

𝐿𝑉 𝑦 = min

𝑧

𝑚(𝑧) 𝑉 𝑧 halts and 𝑉 𝑧 = 𝑦} The Kolmogorov complexity of a binary string 𝑦 is the length of the shortest program 𝑧∗ for a universal Turing Machine 𝑉 that generates s and halts.

(Solomonoff 1960, Kolmogorov 1965, Chaitin 1969) 9

slide-10
SLIDE 10

Kolmogorov Complexity

𝐿𝑉 𝑦 = min

𝑧

𝑚(𝑧) 𝑉 𝑧 hal alts and 𝑉 𝑧 = 𝑦} The Kolmogorov complexity of a binary string 𝑦 is the length of the shortest program 𝑧∗ for a universal Turing Machine 𝑉 that generates s and hal alts ts.

(Solomonoff 1960, Kolmogorov 1965, Chaitin 1969) 10

slide-11
SLIDE 11

Ultimately Impractical

Kolmogorov complexity 𝐿(𝑦), or rather, the Kolmogorov optimal program 𝑦∗ is not computable. We can approximate it from above, but, this is not very practical.

(simply not enough students to enumerate all Turing machines)

We can approximate it through

  • ff-the-shelf compressors,

yet, this has serious drawbacks.

(big-O, what structure does a compressor reward, etc) 11

slide-12
SLIDE 12

A practical variant

A more viable alternative is the Min Minim imum De Descrip iptio ion Le Length principle

“the best model is the model that gives the best lossless compression”

There are two ways to motivate MDL

  • we’ll discuss both at a high level
  • then go into more details on what MDL is and can do

12

slide-13
SLIDE 13

Two-Part MDL

The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ, the best hy hypot

  • thesis 𝐼 ∈ ℋ

for given data 𝐸 is that 𝐼 that minimises

𝑀 𝐼 + 𝑀(𝐸 ∣ 𝐼)

in which 𝑀(𝐼) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼

(see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 13

slide-14
SLIDE 14

Bayesian Learning

Bayes tells us that Pr(𝐼 ∣ 𝐸) = Pr(𝐸 ∣ 𝐼) × Pr(𝐼) Pr(𝐸) This means we want the 𝐼 that maximises Pr(𝐼 ∣ 𝐸). Since Pr(𝐸) is the same for all models, we have to maxi aximise se Pr 𝐸 𝐼 × Pr 𝐼 Or, equivalently, min inim imis ise − log(Pr(𝐼)) − log (Pr 𝐸 𝐼 )

14

slide-15
SLIDE 15

From Bayes to MDL

So, Bayesian Learning means min inim imis isin ing − log(Pr(𝐼)) − log (Pr 𝐸 𝐼 ) Shannon tells us that the −log transform takes us from probabilities to optima mal pr pref efix-cod

  • de length

gths This means we are actually minimizing 𝑀 𝐼 + 𝑀 𝐸 𝐼 for some encoding 𝑀 for 𝐼 resp. 𝐸 ∣ 𝐼 corresponding to distribution Pr

15

slide-16
SLIDE 16

Bayesian MDL

If we want to do MDL this way – i.e., being a Bayesian – we need to specify

  • a prior probability Pr

(𝑁) on the models, and

  • a conditional probability Pr

(𝐸|𝑁) on data given a model What are reasonable choices?

16

slide-17
SLIDE 17

What Distribution to Use?

For the data, this is ‘easy’: a maximum likelihood model

  • a maximum entropy model for Pr(𝐸 ∣ 𝑁) makes most sense

For the models, this is ‘harder’, we could, e.g., use

  • ‘whatever the expert says is a good distribution’, or
  • an uninformative prior on 𝑁, or
  • (a derivative of) the universal prior from algorithmic statistics

These are not easy to compute, query, and ad ad hoc hoc.

In MDL we say, if we are going to be ad hoc, let us do so ope penl nly and use expl plicit uni universal e enco ncodings

17

slide-18
SLIDE 18

Information Criteria

MDL might make you think of either Aka kaik ike’s Inf Infor

  • rmation C
  • n Criterion (AIC)

𝑙 − ln(Pr(𝐸|𝐼))

  • r the Bayesian Inf

Infor

  • rmation C
  • n Criterion

n (BIC) k 2 ln 𝑜 − ln (Pr 𝐸 𝐼 )

18

slide-19
SLIDE 19

Information Criteria

MDL might make you think of either Aka kaik ike’s Inf Infor

  • rmation C
  • n Criterion (AIC)

𝑙 − 𝑀(𝐸|𝐼)

  • r the Bayesian Inf

Infor

  • rmation C
  • n Criterion

n (BIC) k 2 ln 𝑜 − 𝑀(𝐸|𝐼)

19

slide-20
SLIDE 20

Information Criteria

MDL might make you think of either Aka kaik ike’s Inf Infor

  • rmation C
  • n Criterion (AIC)

𝑀𝐵𝐵𝐵 𝐼 = 𝑙

  • r the Bayesian Inf

Infor

  • rmation C
  • n Criterion

n (BIC) 𝑀𝐶𝐵𝐵 𝐼 = k 2 ln 𝑜 We, however, do not

  • not assume that all parameters are

created equal, we take their complexity into account

20

slide-21
SLIDE 21

From Kolmogorov to MDL

Both Kolmogorov complexity and MDL are based on compression. Is there a relationship between the two? Ye Yes. We can derive two wo-par art M t MDL from Kolmogorov complexity. We’ll sketch here how.

(see, e.g., Li & Vitanyi 1996, Vereshchagin & Vitanyi 2004 for details) 21

slide-22
SLIDE 22

Objects and Sets

Recall that in Algorithmic Information Theory we are looking for (optimal) descriptions of obje ject cts. One way to describe an object is

  • describe a set of which it is a member
  • point

int out which ch of these members it it is is.

In fact, we do this all the time

  • the beach (i.e., the set of all beaches)
  • ver there (pointing out a specific one)

22

slide-23
SLIDE 23

Algorithmic Statistics

We have, a set 𝑇

  • which we call a mod

model

  • which has complexity 𝐿(𝑇)

and an object 𝑦 ∈ 𝑇

  • 𝑇 is a model of 𝑦
  • the complexity of pointing out 𝑦 in 𝑇 is

the complexity of 𝑦 given 𝑇, i.e. 𝐿(𝑦 ∣ 𝑇)

Obviously, 𝐿 𝑦 ≤ 𝐿 𝑇 + 𝐿(𝑦 ∣ 𝑇)

23

slide-24
SLIDE 24

So?

Algorithmic Information Theory states that

  • every program that outputs 𝑦 and halts encodes the infor
  • rma

mation

  • n in

in 𝑦

  • the smallest such program encodes only

y the e in informatio ion n in in 𝑦

If 𝑦 is a data set, i.e. a rand ndom

  • m s

samp mple, we expect it has

  • epis

istemic ic struc uctur ure, “true” structure; captured by 𝑇

  • aleat

atoric struc uctur ure, “accidental” structure; captured by 𝑦 ∣ 𝑇

We are hence interested in that model 𝑇 that minimizes 𝐿 𝑇 + 𝐿 𝑦 𝑇 which is surprisingly akin to two-part MDL

24

slide-25
SLIDE 25

More detail

For 𝐿(𝑇)

  • this is simply the length of the shortest program that outputs 𝑇

and halts; i.e., a gen generative e model of 𝑦

For 𝐿(𝑦 ∣ 𝑇)

  • if 𝑦 is a typi

pica cal element of 𝑇 there is no more efficient way to find 𝑦 in 𝑇 than by an index, i.e., 𝐿 𝑦 𝑇 ≈ log ( 𝑇 )

25

slide-26
SLIDE 26

Kolmogorov’s Structure Function

This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ℎ𝑦 𝑗 = min

𝑇 {log 𝑇

∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} That is, we start with very simple – in terms of complexity – models and gradually work our way up

(see, e.g., Li & Vitanyi 1996, Vereshchagin & Vitanyi 2004) 26

slide-27
SLIDE 27

The MDL function

This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ℎ𝑦 𝑗 = min

𝑇 {log 𝑇

∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} which defines the MDL function as 𝜇𝑦 𝑗 = min

𝑇 {𝐿 𝑇 + log 𝑇

∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} We try to find the minimum by considering increasingly complex models.

(see Vereshchagin & Vitanyi 2004) 27

slide-28
SLIDE 28

The MDL function

This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ℎ𝑦 𝑗 = min

𝑇 {log 𝑇

∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} which defines the MDL function as 𝜇𝑦 𝑗 = min

𝑇 {𝐿 𝑇 + log 𝑇

∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} We try to find the minimum by considering increasingly complex models.

(see Vereshchagin & Vitanyi 2004) 28

slide-29
SLIDE 29

Two-Part MDL

The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ, the best hy hypot

  • thesis 𝐼 ∈ ℋ

for given data 𝐸 is that 𝐼 that minimises

𝑀 𝐼 + 𝑀(𝐸 ∣ 𝐼)

in which 𝑀(𝐼) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼

(see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 29

slide-30
SLIDE 30

Example Binomial

Say we have a string 𝑦 = 01011100001101010011

  • f 10 zeroes and 10 ones

Suppose ℋ consists of these binomials, e.g. 𝑞1 = 0.1, 𝑞2 = 0.2, 𝑞3 = 0.5 𝑀(𝑦 ∣ 𝑞1) = −10 log 𝑞1 − 10 log 1 − 𝑞1 = 34.7 bits 𝑀(𝑦 ∣ 𝑞2) = −10 log 𝑞2 − 10 log 1 − 𝑞2 = 26.4 bits 𝑀(𝑦 ∣ 𝑞3) = −10 log 𝑞3 − 10 log 1 − 𝑞3 = 20.0 bits

30

slide-31
SLIDE 31

Example Binomial

Suppose 𝑦 = 01011100001101010011, and ℋ = {𝑞1 = 0.1, 𝑞2 = 0.2, 𝑞3 = 0.5} Without prior preference over 𝐼 ∈ ℋ 𝑀 𝐼 = log |ℋ| 𝑀 𝑞1 + 𝑀(𝑦 ∣ 𝑞1) = 36.3 bits 𝑀 𝑞2 + 𝑀(𝑦 ∣ 𝑞2) = 28.0 bits 𝑀 𝑞3 + 𝑀(𝑦 ∣ 𝑞3) = 21.6 bits

31

slide-32
SLIDE 32

Example Binomial

Suppose 𝑦 = 01011100001101010011, and ℋ = {𝑞1 = 0.1, 𝑞2 = 0.2, 𝑞3 = 0.5} 𝑀 𝑞1 + 𝑀(𝑦 ∣ 𝑞1) = 36.3 bits 𝑀 𝑞2 + 𝑀(𝑦 ∣ 𝑞2) = 28.0 bits 𝑀 𝑞3 + 𝑀(𝑦 ∣ 𝑞3) = 21.6 bits However, when you receive 𝑀(𝑞1) you know that 𝑞2 and 𝑞3 were disregarded by the sender as these did not lead to a minimal description.

32

slide-33
SLIDE 33

Example Binomial

Suppose 𝑦 = 01011100001101010011, and ℋ = {𝑞1 = 0.1, 𝑞2 = 0.2, 𝑞3 = 0.5} 𝑀 𝑞1 + 𝑀(𝑦 ∣ 𝑞1) = 36.3 bits 𝑀 𝑞2 + 𝑀(𝑦 ∣ 𝑞2) = 28.0 bits 𝑀 𝑞3 + 𝑀(𝑦 ∣ 𝑞3) = 21.6 bits Models 𝐼 ∈ ℋ will onl

  • nly be u

be used ed for data where they are optim imal l within the model class! Two-part MDL ignores this, it wastes bits!

33

slide-34
SLIDE 34

Crude MDL

The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ, the best hy hypot

  • thesis 𝐼 ∈ ℋ

for given data 𝐸 is that 𝐼 that minimises

𝑀 𝐼 + 𝑀(𝐸 ∣ 𝐼)

in which 𝑀(𝐼) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼

(see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 34

slide-35
SLIDE 35

Refined MDL

The main intuition, coming from crude MDL: 𝑀(𝐼) is ad hoc, so we want to get rid of it, but keeping only 𝑀(𝐸 ∣ 𝐼) is going to give us a bad time, as maximising likelihood leads to overfitting. 𝑀 𝐸 ℋ = 𝑀 𝐸 𝐼∗ + COMP ℋ aka the stocha hastic complexity of 𝐸 given ℋ

35

Easy! Ehm…

slide-36
SLIDE 36

Universal Codes

What Universal codes do we know?

  • the two-part code (iff minimax guarantees, or large sample)
  • prequential plug-in codes
  • Bayesian mixtures codes (Jeffrey’s prior)
  • Normalised Maximum Likelihood (NML)

Each of these have quite a different nature, hence different coding schemes, but all lead to very similar 𝑀(𝐸 ∣ ℋ).

36

slide-37
SLIDE 37

NML

Normalized Maximum Likelihood (Shtarkov, 1987) 𝑀 𝐸 ℋ = − log 𝑄 𝐸 𝐼∗ ∈ ℋ ∑ 𝑄 𝐸′ 𝐼𝐼 ∈ ℋ

𝐸′∈ 𝒠

Interpretation: The more special 𝐸 is with respect to ℋ, the shorter its code. One nasty detail, the normalization: Enumerating every possible 𝐸𝐸 requires many PhD students, calculating the maximum likelihood 𝐼𝐸 for every 𝐸𝐸, even more so.

37

slide-38
SLIDE 38

Crude in Practice

Refined MDL is only y defined for a small set of cases. Computing stochastic complexity is possible for ev even en few fewer er. Hence, in practice, as much as we may dislike it in theory, we often have to resort to crude MDL. However, as long as we’re aware e of f the e biases es of the encoding, that’s not a a bad ad thing ing. In fact, as in two-part MDL we can steer our encoding towards models we (intuitively) like better, and hence for data mining purposes two-part MDL is a very often a good friend indeed.

38

slide-39
SLIDE 39

MDL is a principle

MDL is not

  • t a single method
  • it’s a general

al princ incip iple for doing inductive inference The main adage: few fewer er bits is bet etter er

  • encode the data univ

iversally lly that is, without external input, only consider the data at hand

  • ideally, uphold minimax optimality properties, try to make sure

your encoding is never much worse than the best Try to avoid, as much as possible, ad hoc biases

  • be explic

licit about those that exist

39