Jilles V Vreeke ken
Part 2 MDL in L in Ac Actio ion
1
Part 2 MDL in L in Ac Actio ion Jilles V Vreeke ken 1 - - PowerPoint PPT Presentation
Part 2 MDL in L in Ac Actio ion Jilles V Vreeke ken 1 Explicit Coding Ad hoc sounds bad, but is it really? Bayesian learning for instance, is in inherent ntly ly subjectiv ive, plus biasing search is a time-honoured tradition in
1
Ad hoc sounds bad, but is it really?
inherent ntly ly subjectiv ive, plus
Using an expli licit encoding ing allows us to steer towards the type of structure we want nt to dis iscover We so also mitigate one of the practical weak spots of AIT
you found would not depend on the order of the data?
2
The rank of a matrix π© is
hein n rank)
3
+ + β¦ =
π3
π3
π1
π1 π2
π2
ππ β π π ππ β π π ππ β π π
π©
The rank of a Boolean matrix π© is
hein n rank)
π1
π2
(Miettinen et al 2006, 2008) 4
π1
π1 π2
β¦
π1 π1 π1 π2 π3 π2 π3
π©
ππ β π π ππ β π π ππ β π π
+ + =
The rank of a Boolean matrix π© is
hein n rank)
π1
(Miettinen et al 2006, 2008) 5
= + + β¦
π©
ππ β π π ππ β π π ππ β π π
π1 π1
Noise quickly inflates the rank to min (π, π)
(Miettinen & Vreeken 2012, 2014) 6
π©
πͺ β π«
β
Separating structure and noise
(Miettinen & Vreeken 2012, 2014) 7
β
π©
πͺ β π« π
=
Encoding the structure
π πͺ = log π + log π + log π π
πβπͺ
(Miettinen & Vreeken 2012, 2014) 8
β
π©
πͺ β π« π
=
Encoding the structure
π π« = log π + log π + log π π
πβπ«
(Miettinen & Vreeken 2012, 2014) 9
β
π©
πͺ β π« π
=
Encoding the noise
π π = log ππ + log ππ π
(Miettinen & Vreeken 2012, 2014) 10
β
π©
πͺ β π« π
=
MDL for BMF π πΈ, πΌ = π πͺ + π π« + π(π)
(Miettinen & Vreeken 2012, 2014) 11
β
π©
πͺ β π« π
=
The idea deal outcome of pattern mining
Frequent pattern mining does not
MDL allows us to effectively pursue the ideal
attern s set t mining approach
(Tatti & Vreeken 2012, Bertens et al. 2016, Bhattacharyya & Vreeken 2017) (for transaction data, Vreeken et al (2011), for graphs Koutra et al (2014) 12
(Tatti & Vreeken 2012, Bertens et al. 2016, Bhattacharyya & Vreeken 2017) 13
multiple sequences
a a a b d c d b a b c , a a b d c d b , a a a b d c d b a , β¦ } { { a, b, c, d, β¦ } a a a b d c d b a b c a b d a a b c Alphabet Ξ© Data πΈ
(Tatti & Vreeken 2012, Bertens et al. 2016, Bhattacharyya & Vreeken 2017) 14
a a a b d c d b a b c , a a b d c d b , a a a b d c d b a , β¦ } { Alphabet Ξ© { a, b, c, d, β¦ } a a a b d c d b a b c a b d a a b c Data πΈ Patterns
serial episodes
a b a b a b a b
βsubsequences allowing gapsβ
multiple sequences
(Tatti & Vreeken 2012, Bertens et al. 2016, Bhattacharyya & Vreeken 2017) 15
a a a b d c d b a b c , a a b d c d b , a a a b d c d b a , β¦ } { { a, b, c, d, β¦ } a a a b d c d b a b c a a d a a b c Patterns
serial episodes
a b a b a b a b
βsubsequences allowing gapsβ
d c d c
multiple sequences
b b Alphabet Ξ© Data πΈ
As models we use code de tab tables
We use optimal prefix codes
16
a b d c a b d c abc da p q ? ? ! !
17
The length of the code for pattern π
π = β log π
= βlog (
π£π£π£(π) βπ£π£π£(π))
The length of the code stream π π·π = β π£π£π£ π π( )
πβπ·π·
a a a b d c d b a b c a b d c a a a b d c d b a b c a b d c X X X X Data πΈ:
π·π
1:
Encoding 1: using only singletons
π·π
18
a d b a a a b d c d b a b c abc da p q ? ? ! ! p q p ! ? ? ! ! ! ! a a a b d c d b a b c Alignment: p ! ? ! q ? ! p ! !
gap gap
a b d c a b d c Data πΈ: Encoding 2: using patterns
π·π π·π£ π·π2:
19
Data πΈ: a a a b d c d b a b c
The length of a gap code for pattern π π = β log(π )) and analogue for non-gap codes !
a d b abc da p q ? ? ! ! p q p ? ? p Encoding 2: using patterns
π·π π·π£
? a b d c a b d c ! ? ? ! ! ! !
π·π2:
20
which leaves us to define π(π·π β£ π·) By which, the encoded size of πΈ given π·π and π· is π πΈ π·π = π π·π π·π + π(π·π£ β£ π·π)
21
X X ? ! a z a z β¦ Y Y ? ! β¦
π(π·π β£ π·, πΈ) consists of
(Rissanen 1983) 22
X X ? ! a z a z β¦ Y Y ? ! β¦
π(π·π β£ π·, πΈ) consists of
1) base singleton counts in πΈ
πβ Ξ©
+ πβ πΈ + log πΈ β 1 Ξ© β 1
(Rissanen 1983) 23
X X ? ! a z Y Y ? ! β¦ a z β¦
π(π·π β£ π·, πΈ) consists of
1) base singleton counts in πΈ
πβ Ξ©
+ πβ πΈ + log πΈ β 1 Ξ© β 1
24
X X ? ! a z Y Y ? ! β¦ a z β¦
π(π·π β£ π·, πΈ) consists of
1) base singleton counts in πΈ
πβ Ξ©
+ πβ πΈ + log πΈ β 1 Ξ© β 1
2) number of patterns, total, and per pattern usage πβ π¬ + 1 + πβ π£π£π£ π¬ + 1 + log π£π£π£ π¬ β 1
π¬ β 1
π¬
25
X X ? ! a z Y Y ? ! β¦ a z β¦
π(π·π β£ π·, πΈ) consists of
1) base singleton counts in πΈ
πβ Ξ©
+ πβ πΈ + log πΈ β 1 Ξ© β 1
2) number of patterns, total, and per pattern usage πβ π¬ + 1 + πβ π£π£π£ π¬ + 1 + log π£π£π£ π¬ β 1
π¬ β 1
π¬
26
X X ? ! a z Y Y ? ! β¦ a z β¦
π(π·π β£ π·, πΈ) consists of
1) base singleton counts in πΈ
πβ Ξ©
+ πβ πΈ + log πΈ β 1 Ξ© β 1
2) number of patterns, total, and per pattern usage πβ π¬ + 1 + πβ π£π£π£ π¬ + 1 + log π£π£π£ π¬ β 1
π¬ β 1
3) per pattern π : its length, elements, and number of gaps πβ π β log π π¦ πΈ
π¦βπ
+ πβ π£πππ£ π + 1
π¬
27
X X ? ! a z Y Y ? ! β¦ a z β¦
π(π·π β£ π·, πΈ) consists of
1) base singleton counts in πΈ
πβ Ξ©
+ πβ πΈ + log πΈ β 1 Ξ© β 1
2) number of patterns, total, and per pattern usage πβ π¬ + 1 + πβ π£π£π£ π¬ + 1 + log π£π£π£ π¬ β 1
π¬ β 1
3) per pattern π : its length, elements, and number of gaps πβ π β log π π¦ πΈ
π¦βπ
+ πβ π£πππ£ π + 1
π¬
Tatti & Vreeken (2012) Bertens et al. (2016), Bhattacharyya & Vreeken (2017) for transaction data, Vreeken et al (2011) Budhathoki & Vreeken (2015) for graphs Koutra et al (2014) 28
By which we have a lossless encoding. In other words, an objective function. By MDL, our goal is now to minimise
for how to do so, please see the papers
nthe hetic data random οΌ no structure found HMM οΌ structure recovered
l data text data for interpretation
(implementation available at http://eda.mmci.uni-saarland/sqs) 29
SQS-CANDS SQS-SEARCH Ξ© |πΈ| # π·ππ·π£ |ο| |ο| Ξπ Addresses 5 295 56 15 506 138 155 5k JMLR 3 846 788 40 879 563 580 30k Moby Dick 10 277 1 22 559 215 231 10k
(Tatti & Vreeken 2012; Bhattacharyya & Vreeken 2017, Grosse & Vreeken 2017) 30
JMLR
empirical, risk minimization structural indep, component analysis prinicipal Mahalanobis, distance edit, Euclidean, pairwise
unit[ed] state[s] take oath army navy under circumst.
LOTR
he Verb Conj he
[he said that he]
Conf _ the Noun of
[and even the end of]
the Adj Noun and
[the young Hobbits and]
a c b
Serial Episodes
a
c b d
Choice-episode Ontological Episodes
A π b
The best clustering is the one that costs the least bits
Partition your data such that π π· +
(πΈπ,πΌπ)βπ·
is minimal
(similar to mixture modelling, but descriptive instead of predictive)
for itemsets, see Van Leeuwen et al (2009) 31
Mammals occurrences
for itemsets, see Van Leeuwen et al (2009) 32
k=6, MDL βoptimalβ
Split your data per cla class
Then, for uns unseen instances
n class label of model that encodes it shor
st
π π¦ πΌ1 < π π¦ πΌ2 β π π¦ πΌ1 > π(π¦ β£ πΌ2)
(for itemsets, see ECML PKDDβ06) 33
Van Leeuwen et al (2006) 34
π π¦ πΌ1 < π π¦ πΌ2 β π π¦ πΌ1 > π(π¦ β£ πΌ2)
Split per class Model πΌπ
β
per class π Encode unseen record Shortest code wins! Run algorithm Database (π classes)
One-Class Classification (aka anomaly detection)
normal situation β insufficient data for tar target
Compression models the norm rm
igh description length π(π’ β£ πΌππππ
β
)
Very nice properties
rmance nce high accuracy
versa satile no distance measure needed
ract cter erisa sation βthis part of t is incompressibleβ
Smets & Vreeken (2011) Akoglu et al (2012) 35
Akoglu et al (2012) 36
Catholic church, Vatican Washington Memorial, D.C. Thames river, Buckingham palace, plain fields, London
37
X Y Q S R Z V W
Spirtes et al (2000) Marx & Vreeken, Conditional Independence Testing by Stochastic Complexity (2019) 38
We can find the causal al ske keleton using condit itional indepen enden dence tests, but only few few edge edge dir irections ns
X Y Q S R Z V W
Spirtes et al (2000) Marx & Vreeken, Conditional Independence Testing by Stochastic Complexity (2019) 39
We can find the causal al ske keleton using condit itional indepen enden dence tests, but only few few edge edge dir irections ns
Q S R Z V W
?
X Y
If π β π, we have,
up to an additive constant,
πΏ π π + πΏ π π π β€ πΏ π π + πΏ π π π That is, we can do ca causal in l inference by identifying the factorization of the joint with the lowe west Kol
mogor
comp
(Janzing & SchΓΆlkopf, IEEE TIT 2012) 40
(GrΓΌnwald 2007) 41
We model π as π = π π + πͺ As π we consider lin linear, quad adrat atic, cu cubi bic, expone nent ntial, and recipr proca cal functions, and model the noise using a 0-mean Gaussian. We choose the π that minimizes π π π = π π + π(πͺ)
Marx & Vreeken (2017) 42
How certain are we?
Marx & Vreeken (2017) 43
β = π π + π π π) β π π + π π π)
ο§ the higher the more certain
π(π β π) π(π β π)
How certain are we? Is a given inference significant?
we have π0 =
|π πβπ βπ πβπ | 2
P π0 πΈ β π πΈ β₯ π β€ 2βπ
GrΓΌnwald (2007), Marx & Vreeken (2017) 44
β = π π + π π π) π π + π π β π π + π π π) π π + π π
ο§ the higher the more certain ο§ robust w.r.t. sample size
(TΓΌbingen 97 univariate numeric cause-effect pairs, weighted)
45 Marx & Vreeken (2017)
(TΓΌbingen 97 univariate numeric cause-effect pairs, weighted)
Marx & Vreeken (2017) 46
Inferences of state of the art algorithms orde dered ed by confi fide dence values. SLOPE is 85% accurate with π½ = 0.001
Model selection in deep learning is hard
How about an MDL approach?
47
Suppose neural network πΌ β β predicts target π§ given π¦ π§ = πΌ(π¦) How do we encode data given the model?
πΌ π = β β log π(π§π|π¦π)
π§πβπ
ο§
e.g. if π is binary, we have π = π β π , and π π πΌ π = log π + log π |π|
ο§
e.g. if π is continuous, we can encode using a zero-mean Gaussian
48
Suppose neural network πΌ β β predicts target π§ given π¦ π§ = πΌ(π¦) How do we encode the model?
ial coding ing
Dawid (1984) Barron et al (1998) GrΓΌnwald (2007) Blier & Ollivier (2018) 49
Simple, elegant idea: βUpdate your model after every messageβ That is, we re-train our network after βeveryβ new label
Dawid (1984) Barron et al (1998) GrΓΌnwald (2007) Blier & Ollivier (2018) 50
Simple, elegant idea: βUpdate your model after every messageβ π πΈ β = π(πΈπ β£ πΌπβ1)
πΈπ
Best of all, this is not a crude, but a refined MDL code!
Dawid (1984) Barron et al (1998) GrΓΌnwald (2007) 51
52