2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Modern MDL meets Data Mining Insight, Theory, and Practice
Kenji Yamanishi
The University of Tokyo
Jilles Vreeken
CISPA Helmholtz Center for Information Security
Modern MDL meets Data Mining Insight, Theory, and Practice Jilles - - PowerPoint PPT Presentation
2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi CISPA Helmholtz Center The University of Tokyo for Information
2019 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
The University of Tokyo
CISPA Helmholtz Center for Information Security
2
Approximately 3.5 hours long Extensive, but in inco comple lete introduction to
in data min ining ng
3
4
5
6
7
“The simplest description
8
𝐿𝑉 𝑦 = min
𝑧
𝑚(𝑧) 𝑉 𝑧 halts and 𝑉 𝑧 = 𝑦} The Kolmogorov complexity of a binary string 𝑦 is the length of the shortest program 𝑧∗ for a universal Turing Machine 𝑉 that generates s and halts.
(Solomonoff 1960, Kolmogorov 1965, Chaitin 1969) 9
𝐿𝑉 𝑦 = min
𝑧
𝑚(𝑧) 𝑉 𝑧 hal alts and 𝑉 𝑧 = 𝑦} The Kolmogorov complexity of a binary string 𝑦 is the length of the shortest program 𝑧∗ for a universal Turing Machine 𝑉 that generates s and hal alts ts.
(Solomonoff 1960, Kolmogorov 1965, Chaitin 1969) 10
Kolmogorov complexity 𝐿(𝑦), or rather, the Kolmogorov optimal program 𝑦∗ is not computable. We can approximate it from above, but, this is not very practical.
(simply not enough students to enumerate all Turing machines)
We can approximate it through
yet, this has serious drawbacks.
(big-O, what structure does a compressor reward, etc) 11
A more viable alternative is the Min Minim imum De Descrip iptio ion Le Length principle
“the best model is the model that gives the best lossless compression”
There are two ways to motivate MDL
12
The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ, the best hy hypot
for given data 𝐸 is that 𝐼 that minimises
in which 𝑀(𝐼) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼
(see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 13
Bayes tells us that Pr(𝐼 ∣ 𝐸) = Pr(𝐸 ∣ 𝐼) × Pr(𝐼) Pr(𝐸) This means we want the 𝐼 that maximises Pr(𝐼 ∣ 𝐸). Since Pr(𝐸) is the same for all models, we have to maxi aximise se Pr 𝐸 𝐼 × Pr 𝐼 Or, equivalently, min inim imis ise − log(Pr(𝐼)) − log (Pr 𝐸 𝐼 )
14
So, Bayesian Learning means min inim imis isin ing − log(Pr(𝐼)) − log (Pr 𝐸 𝐼 ) Shannon tells us that the −log transform takes us from probabilities to optima mal pr pref efix-cod
gths This means we are actually minimizing 𝑀 𝐼 + 𝑀 𝐸 𝐼 for some encoding 𝑀 for 𝐼 resp. 𝐸 ∣ 𝐼 corresponding to distribution Pr
15
If we want to do MDL this way – i.e., being a Bayesian – we need to specify
(𝑁) on the models, and
(𝐸|𝑁) on data given a model What are reasonable choices?
16
For the data, this is ‘easy’: a maximum likelihood model
For the models, this is ‘harder’, we could, e.g., use
These are not easy to compute, query, and ad ad hoc hoc.
In MDL we say, if we are going to be ad hoc, let us do so ope penl nly and use expl plicit uni universal e enco ncodings
17
MDL might make you think of either Aka kaik ike’s Inf Infor
𝑙 − ln(Pr(𝐸|𝐼))
Infor
n (BIC) k 2 ln 𝑜 − ln (Pr 𝐸 𝐼 )
18
MDL might make you think of either Aka kaik ike’s Inf Infor
𝑙 − 𝑀(𝐸|𝐼)
Infor
n (BIC) k 2 ln 𝑜 − 𝑀(𝐸|𝐼)
19
MDL might make you think of either Aka kaik ike’s Inf Infor
𝑀𝐵𝐵𝐵 𝐼 = 𝑙
Infor
n (BIC) 𝑀𝐶𝐵𝐵 𝐼 = k 2 ln 𝑜 We, however, do not
created equal, we take their complexity into account
20
Both Kolmogorov complexity and MDL are based on compression. Is there a relationship between the two? Ye Yes. We can derive two wo-par art M t MDL from Kolmogorov complexity. We’ll sketch here how.
(see, e.g., Li & Vitanyi 1996, Vereshchagin & Vitanyi 2004 for details) 21
Recall that in Algorithmic Information Theory we are looking for (optimal) descriptions of obje ject cts. One way to describe an object is
int out which ch of these members it it is is.
In fact, we do this all the time
22
We have, a set 𝑇
model
and an object 𝑦 ∈ 𝑇
the complexity of 𝑦 given 𝑇, i.e. 𝐿(𝑦 ∣ 𝑇)
Obviously, 𝐿 𝑦 ≤ 𝐿 𝑇 + 𝐿(𝑦 ∣ 𝑇)
23
Algorithmic Information Theory states that
mation
in 𝑦
y the e in informatio ion n in in 𝑦
If 𝑦 is a data set, i.e. a rand ndom
samp mple, we expect it has
istemic ic struc uctur ure, “true” structure; captured by 𝑇
atoric struc uctur ure, “accidental” structure; captured by 𝑦 ∣ 𝑇
We are hence interested in that model 𝑇 that minimizes 𝐿 𝑇 + 𝐿 𝑦 𝑇 which is surprisingly akin to two-part MDL
24
For 𝐿(𝑇)
and halts; i.e., a gen generative e model of 𝑦
For 𝐿(𝑦 ∣ 𝑇)
pica cal element of 𝑇 there is no more efficient way to find 𝑦 in 𝑇 than by an index, i.e., 𝐿 𝑦 𝑇 ≈ log ( 𝑇 )
25
This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ℎ𝑦 𝑗 = min
𝑇 {log 𝑇
∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} That is, we start with very simple – in terms of complexity – models and gradually work our way up
(see, e.g., Li & Vitanyi 1996, Vereshchagin & Vitanyi 2004) 26
This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ℎ𝑦 𝑗 = min
𝑇 {log 𝑇
∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} which defines the MDL function as 𝜇𝑦 𝑗 = min
𝑇 {𝐿 𝑇 + log 𝑇
∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} We try to find the minimum by considering increasingly complex models.
(see Vereshchagin & Vitanyi 2004) 27
This suggests a way to discover the best model. Kolmogorov’s structure function is defined as ℎ𝑦 𝑗 = min
𝑇 {log 𝑇
∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} which defines the MDL function as 𝜇𝑦 𝑗 = min
𝑇 {𝐿 𝑇 + log 𝑇
∣ 𝑦 ∈ 𝑇, 𝐿 𝑇 ≤ 𝑗} We try to find the minimum by considering increasingly complex models.
(see Vereshchagin & Vitanyi 2004) 28
The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ, the best hy hypot
for given data 𝐸 is that 𝐼 that minimises
in which 𝑀(𝐼) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼
(see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 29
Say we have a string 𝑦 = 01011100001101010011
Suppose ℋ consists of these binomials, e.g. 𝑞1 = 0.1, 𝑞2 = 0.2, 𝑞3 = 0.5 𝑀(𝑦 ∣ 𝑞1) = −10 log 𝑞1 − 10 log 1 − 𝑞1 = 34.7 bits 𝑀(𝑦 ∣ 𝑞2) = −10 log 𝑞2 − 10 log 1 − 𝑞2 = 26.4 bits 𝑀(𝑦 ∣ 𝑞3) = −10 log 𝑞3 − 10 log 1 − 𝑞3 = 20.0 bits
30
Suppose 𝑦 = 01011100001101010011, and ℋ = {𝑞1 = 0.1, 𝑞2 = 0.2, 𝑞3 = 0.5} Without prior preference over 𝐼 ∈ ℋ 𝑀 𝐼 = log |ℋ| 𝑀 𝑞1 + 𝑀(𝑦 ∣ 𝑞1) = 36.3 bits 𝑀 𝑞2 + 𝑀(𝑦 ∣ 𝑞2) = 28.0 bits 𝑀 𝑞3 + 𝑀(𝑦 ∣ 𝑞3) = 21.6 bits
31
Suppose 𝑦 = 01011100001101010011, and ℋ = {𝑞1 = 0.1, 𝑞2 = 0.2, 𝑞3 = 0.5} 𝑀 𝑞1 + 𝑀(𝑦 ∣ 𝑞1) = 36.3 bits 𝑀 𝑞2 + 𝑀(𝑦 ∣ 𝑞2) = 28.0 bits 𝑀 𝑞3 + 𝑀(𝑦 ∣ 𝑞3) = 21.6 bits However, when you receive 𝑀(𝑞1) you know that 𝑞2 and 𝑞3 were disregarded by the sender as these did not lead to a minimal description.
32
Suppose 𝑦 = 01011100001101010011, and ℋ = {𝑞1 = 0.1, 𝑞2 = 0.2, 𝑞3 = 0.5} 𝑀 𝑞1 + 𝑀(𝑦 ∣ 𝑞1) = 36.3 bits 𝑀 𝑞2 + 𝑀(𝑦 ∣ 𝑞2) = 28.0 bits 𝑀 𝑞3 + 𝑀(𝑦 ∣ 𝑞3) = 21.6 bits Models 𝐼 ∈ ℋ will onl
be used ed for data where they are optim imal l within the model class! Two-part MDL ignores this, it wastes bits!
33
The Minimum Description Length (MDL) principle given a set of hypothes eses es ℋ, the best hy hypot
for given data 𝐸 is that 𝐼 that minimises
in which 𝑀(𝐼) is the length, in bits, of the description of 𝐼 𝑀 𝐸 𝐼 is the length, in bits, of the description of the data when encoded using 𝐼
(see, e.g., Rissanen 1978, 1983, Grünwald, 2007) 34
The main intuition, coming from crude MDL: 𝑀(𝐼) is ad hoc, so we want to get rid of it, but keeping only 𝑀(𝐸 ∣ 𝐼) is going to give us a bad time, as maximising likelihood leads to overfitting. 𝑀 𝐸 ℋ = 𝑀 𝐸 𝐼∗ + COMP ℋ aka the stocha hastic complexity of 𝐸 given ℋ
35
Easy! Ehm…
What Universal codes do we know?
Each of these have quite a different nature, hence different coding schemes, but all lead to very similar 𝑀(𝐸 ∣ ℋ).
36
Normalized Maximum Likelihood (Shtarkov, 1987) 𝑀 𝐸 ℋ = − log 𝑄 𝐸 𝐼∗ ∈ ℋ ∑ 𝑄 𝐸′ 𝐼𝐼 ∈ ℋ
𝐸′∈
Interpretation: The more special 𝐸 is with respect to ℋ, the shorter its code. One nasty detail, the normalization: Enumerating every possible 𝐸𝐸 requires many PhD students, calculating the maximum likelihood 𝐼𝐸 for every 𝐸𝐸, even more so.
37
Refined MDL is only y defined for a small set of cases. Computing stochastic complexity is possible for ev even en few fewer er. Hence, in practice, as much as we may dislike it in theory, we often have to resort to crude MDL. However, as long as we’re aware e of f the e biases es of the encoding, that’s not a a bad ad thing ing. In fact, as in two-part MDL we can steer our encoding towards models we (intuitively) like better, and hence for data mining purposes two-part MDL is a very often a good friend indeed.
38
MDL is not
al princ incip iple for doing inductive inference The main adage: few fewer er bits is bet etter er
iversally lly that is, without external input, only consider the data at hand
your encoding is never much worse than the best Try to avoid, as much as possible, ad hoc biases
licit about those that exist
39