Part 2 MDL in L in Ac Actio ion Jilles V Vreeke ken 1 - - PowerPoint PPT Presentation

β–Ά
part 2 mdl in l in ac actio ion
SMART_READER_LITE
LIVE PREVIEW

Part 2 MDL in L in Ac Actio ion Jilles V Vreeke ken 1 - - PowerPoint PPT Presentation

Part 2 MDL in L in Ac Actio ion Jilles V Vreeke ken 1 Explicit Coding Ad hoc sounds bad, but is it really? Bayesian learning for instance, is in inherent ntly ly subjectiv ive, plus biasing search is a time-honoured tradition in


slide-1
SLIDE 1

Jilles V Vreeke ken

Part 2 MDL in L in Ac Actio ion

1

slide-2
SLIDE 2

Explicit Coding

Ad hoc sounds bad, but is it really?

  • Bayesian learning for instance, is in

inherent ntly ly subjectiv ive, plus

  • biasing search is a time-honoured tradition in data analysis

Using an expli licit encoding ing allows us to steer towards the type of structure we want nt to dis iscover We so also mitigate one of the practical weak spots of AIT

  • all data is a string, but wouldn’t it be nice if the structure

you found would not depend on the order of the data?

2

slide-3
SLIDE 3

Matrix Factorization

The rank of a matrix 𝑩 is

  • number of rank-1 matrices that when summed form 𝑩 (Sche

hein n rank)

3

+ + … =

𝑑3

𝑐3

𝑑1

𝑐1 𝑐2

𝑑2

π’„πŸ ∘ π’…πŸ π’„πŸ‘ ∘ π’…πŸ‘ π’„πŸ’ ∘ π’…πŸ’

𝑩

slide-4
SLIDE 4

The rank of a Boolean matrix 𝑩 is

  • number of rank-1 matrices that when summed form 𝑩 (Sche

hein n rank)

𝑑1

𝑐2

Boolean Matrix Factorization

(Miettinen et al 2006, 2008) 4

𝑑1

𝑐1 𝑐2

…

𝑑1 𝑐1 𝑑1 𝑑2 𝑑3 𝑐2 𝑐3

𝑩

π’„πŸ ∘ π’…πŸ π’„πŸ‘ ∘ π’…πŸ‘ π’„πŸ’ ∘ π’…πŸ’

+ + =

slide-5
SLIDE 5

The rank of a Boolean matrix 𝑩 is

  • number of rank-1 matrices that when summed form 𝑩 (Sche

hein n rank)

  • noise quickly inflate the β€˜true’ latent rank to min π‘œ, 𝑛

𝑐1

Boolean Matrix Factorization

(Miettinen et al 2006, 2008) 5

= + + …

𝑩

π’„πŸ ∘ π’…πŸ π’„πŸ‘ ∘ π’…πŸ‘ π’„πŸ’ ∘ π’…πŸ’

𝑐1 𝑐1

slide-6
SLIDE 6

Boolean Matrix Factorization

Noise quickly inflates the rank to min (π‘œ, 𝑛)

  • how can we determine the β€˜true’ latent rank?

(Miettinen & Vreeken 2012, 2014) 6

𝑩

π‘ͺ ∘ 𝑫

β‰ˆ

slide-7
SLIDE 7

Boolean Matrix Factorization

Separating structure and noise

  • matrices 𝐢 and 𝐷 contain structure, matrix 𝐹 contains noise

(Miettinen & Vreeken 2012, 2014) 7

βŠ•

𝑩

π‘ͺ ∘ 𝑫 𝑭

=

slide-8
SLIDE 8

Boolean Matrix Factorization

Encoding the structure

𝑀 π‘ͺ = log π‘œ + log π‘œ + log π‘œ 𝑐

π‘βˆˆπ‘ͺ

(Miettinen & Vreeken 2012, 2014) 8

βŠ•

𝑩

π‘ͺ ∘ 𝑫 𝑭

=

slide-9
SLIDE 9

Boolean Matrix Factorization

Encoding the structure

𝑀 𝑫 = log 𝑛 + log 𝑛 + log 𝑛 𝑑

π‘‘βˆˆπ‘«

(Miettinen & Vreeken 2012, 2014) 9

βŠ•

𝑩

π‘ͺ ∘ 𝑫 𝑭

=

slide-10
SLIDE 10

Boolean Matrix Factorization

Encoding the noise

𝑀 𝑭 = log π‘œπ‘› + log π‘œπ‘› 𝑭

(Miettinen & Vreeken 2012, 2014) 10

βŠ•

𝑩

π‘ͺ ∘ 𝑫 𝑭

=

slide-11
SLIDE 11

Boolean Matrix Factorization

MDL for BMF 𝑀 𝐸, 𝐼 = 𝑀 π‘ͺ + 𝑀 𝑫 + 𝑀(𝑭)

(Miettinen & Vreeken 2012, 2014) 11

βŠ•

𝑩

π‘ͺ ∘ 𝑫 𝑭

=

slide-12
SLIDE 12

Pattern Mining

The idea deal outcome of pattern mining

  • patterns that show the structure of the data
  • preferably a small set, without redundancy or noise

Frequent pattern mining does not

  • t achieve this
  • pattern explosion β†’ overly many, overly redundant results

MDL allows us to effectively pursue the ideal

  • we want a group of patterns that summarise the data well
  • we take a patt

attern s set t mining approach

(Tatti & Vreeken 2012, Bertens et al. 2016, Bhattacharyya & Vreeken 2017) (for transaction data, Vreeken et al (2011), for graphs Koutra et al (2014) 12

slide-13
SLIDE 13

Event sequences

(Tatti & Vreeken 2012, Bertens et al. 2016, Bhattacharyya & Vreeken 2017) 13

  • ne, or

multiple sequences

a a a b d c d b a b c , a a b d c d b , a a a b d c d b a , … } { { a, b, c, d, … } a a a b d c d b a b c a b d a a b c Alphabet Ξ© Data 𝐸

slide-14
SLIDE 14

Event sequences

(Tatti & Vreeken 2012, Bertens et al. 2016, Bhattacharyya & Vreeken 2017) 14

a a a b d c d b a b c , a a b d c d b , a a a b d c d b a , … } { Alphabet Ξ© { a, b, c, d, … } a a a b d c d b a b c a b d a a b c Data 𝐸 Patterns

serial episodes

a b a b a b a b

β€˜subsequences allowing gaps’

  • ne, or

multiple sequences

slide-15
SLIDE 15

Event sequences

(Tatti & Vreeken 2012, Bertens et al. 2016, Bhattacharyya & Vreeken 2017) 15

a a a b d c d b a b c , a a b d c d b , a a a b d c d b a , … } { { a, b, c, d, … } a a a b d c d b a b c a a d a a b c Patterns

serial episodes

a b a b a b a b

β€˜subsequences allowing gaps’

d c d c

  • ne, or

multiple sequences

b b Alphabet Ω Data 𝐸

slide-16
SLIDE 16

Models

As models we use code de tab tables

  • dictionary of patterns & codes
  • always contains all singletons

We use optimal prefix codes

  • easy to compute,
  • behave predictably,
  • good results,
  • more details follow

16

a b d c a b d c abc da p q ? ? ! !

slide-17
SLIDE 17

Encoding Event Sequences

17

The length of the code for pattern π‘Œ

𝑀 = βˆ’ log π‘ž

= βˆ’log (

𝑣𝑣𝑣(π‘Œ) βˆ‘π‘£π‘£π‘£(𝑍))

The length of the code stream 𝑀 π·π‘ž = βˆ‘ 𝑣𝑣𝑣 π‘Œ 𝑀( )

π‘Œβˆˆπ·π·

a a a b d c d b a b c a b d c a a a b d c d b a b c a b d c X X X X Data 𝐸:

π·π‘ˆ

1:

Encoding 1: using only singletons

π·π‘ž

slide-18
SLIDE 18

Encoding Event Sequences

18

a d b a a a b d c d b a b c abc da p q ? ? ! ! p q p ! ? ? ! ! ! ! a a a b d c d b a b c Alignment: p ! ? ! q ? ! p ! !

gap gap

a b d c a b d c Data 𝐸: Encoding 2: using patterns

π·π‘ž 𝐷𝑣 π·π‘ˆ2:

slide-19
SLIDE 19

Encoding Event Sequences

19

Data 𝐸: a a a b d c d b a b c

The length of a gap code for pattern π‘Œ 𝑀 = βˆ’ log(π‘ž )) and analogue for non-gap codes !

a d b abc da p q ? ? ! ! p q p ? ? p Encoding 2: using patterns

π·π‘ž 𝐷𝑣

? a b d c a b d c ! ? ? ! ! ! !

π·π‘ˆ2:

slide-20
SLIDE 20

Encoding Event Sequences

20

which leaves us to define 𝑀(π·π‘ˆ ∣ 𝐷) By which, the encoded size of 𝐸 given π·π‘ˆ and 𝐷 is 𝑀 𝐸 π·π‘ˆ = 𝑀 π·π‘ž π·π‘ˆ + 𝑀(𝐷𝑣 ∣ π·π‘ˆ)

slide-21
SLIDE 21

Encoding a Code T able

21

X X ? ! a z a z … Y Y ? ! …

𝑀(π·π‘ˆ ∣ 𝐷, 𝐸) consists of

slide-22
SLIDE 22

Encoding a Code T able

(Rissanen 1983) 22

X X ? ! a z a z … Y Y ? ! …

𝑀(π·π‘ˆ ∣ 𝐷, 𝐸) consists of

1) base singleton counts in 𝐸

𝑀ℕ Ξ©

+ 𝑀ℕ 𝐸 + log 𝐸 βˆ’ 1 Ξ© βˆ’ 1

slide-23
SLIDE 23

Encoding a Code T able

(Rissanen 1983) 23

X X ? ! a z Y Y ? ! … a z …

𝑀(π·π‘ˆ ∣ 𝐷, 𝐸) consists of

1) base singleton counts in 𝐸

𝑀ℕ Ξ©

+ 𝑀ℕ 𝐸 + log 𝐸 βˆ’ 1 Ξ© βˆ’ 1

slide-24
SLIDE 24

Encoding a Code T able

24

X X ? ! a z Y Y ? ! … a z …

𝑀(π·π‘ˆ ∣ 𝐷, 𝐸) consists of

1) base singleton counts in 𝐸

𝑀ℕ Ξ©

+ 𝑀ℕ 𝐸 + log 𝐸 βˆ’ 1 Ξ© βˆ’ 1

2) number of patterns, total, and per pattern usage 𝑀ℕ 𝒬 + 1 + 𝑀ℕ 𝑣𝑣𝑣 𝒬 + 1 + log 𝑣𝑣𝑣 𝒬 βˆ’ 1

𝒬 βˆ’ 1

𝒬

slide-25
SLIDE 25

Encoding a Code T able

25

X X ? ! a z Y Y ? ! … a z …

𝑀(π·π‘ˆ ∣ 𝐷, 𝐸) consists of

1) base singleton counts in 𝐸

𝑀ℕ Ξ©

+ 𝑀ℕ 𝐸 + log 𝐸 βˆ’ 1 Ξ© βˆ’ 1

2) number of patterns, total, and per pattern usage 𝑀ℕ 𝒬 + 1 + 𝑀ℕ 𝑣𝑣𝑣 𝒬 + 1 + log 𝑣𝑣𝑣 𝒬 βˆ’ 1

𝒬 βˆ’ 1

𝒬

slide-26
SLIDE 26

Encoding a Code T able

26

X X ? ! a z Y Y ? ! … a z …

𝑀(π·π‘ˆ ∣ 𝐷, 𝐸) consists of

1) base singleton counts in 𝐸

𝑀ℕ Ξ©

+ 𝑀ℕ 𝐸 + log 𝐸 βˆ’ 1 Ξ© βˆ’ 1

2) number of patterns, total, and per pattern usage 𝑀ℕ 𝒬 + 1 + 𝑀ℕ 𝑣𝑣𝑣 𝒬 + 1 + log 𝑣𝑣𝑣 𝒬 βˆ’ 1

𝒬 βˆ’ 1

3) per pattern π‘Œ : its length, elements, and number of gaps 𝑀ℕ π‘Œ βˆ’ log π‘ž 𝑦 𝐸

π‘¦βˆˆπ‘Œ

+ 𝑀ℕ π‘£π‘•π‘žπ‘£ π‘Œ + 1

𝒬

slide-27
SLIDE 27

Encoding a Code T able

27

X X ? ! a z Y Y ? ! … a z …

𝑀(π·π‘ˆ ∣ 𝐷, 𝐸) consists of

1) base singleton counts in 𝐸

𝑀ℕ Ξ©

+ 𝑀ℕ 𝐸 + log 𝐸 βˆ’ 1 Ξ© βˆ’ 1

2) number of patterns, total, and per pattern usage 𝑀ℕ 𝒬 + 1 + 𝑀ℕ 𝑣𝑣𝑣 𝒬 + 1 + log 𝑣𝑣𝑣 𝒬 βˆ’ 1

𝒬 βˆ’ 1

3) per pattern π‘Œ : its length, elements, and number of gaps 𝑀ℕ π‘Œ βˆ’ log π‘ž 𝑦 𝐸

π‘¦βˆˆπ‘Œ

+ 𝑀ℕ π‘£π‘•π‘žπ‘£ π‘Œ + 1

𝒬

slide-28
SLIDE 28

Encoding Event Sequences

Tatti & Vreeken (2012) Bertens et al. (2016), Bhattacharyya & Vreeken (2017) for transaction data, Vreeken et al (2011) Budhathoki & Vreeken (2015) for graphs Koutra et al (2014) 28

By which we have a lossless encoding. In other words, an objective function. By MDL, our goal is now to minimise

𝑀 π·π‘ˆ, 𝐸 = 𝑀 π·π‘ˆ 𝐷 + 𝑀(𝐸 ∣ π·π‘ˆ)

for how to do so, please see the papers

slide-29
SLIDE 29

Experiments

  • synt

nthe hetic data random οƒΌ no structure found HMM οƒΌ structure recovered

  • real

l data text data for interpretation

(implementation available at http://eda.mmci.uni-saarland/sqs) 29

SQS-CANDS SQS-SEARCH Ξ© |𝐸| # π·π‘œπ·π‘£ || || Δ𝑀 Addresses 5 295 56 15 506 138 155 5k JMLR 3 846 788 40 879 563 580 30k Moby Dick 10 277 1 22 559 215 231 10k

slide-30
SLIDE 30

Selected Results

(Tatti & Vreeken 2012; Bhattacharyya & Vreeken 2017, Grosse & Vreeken 2017) 30

JMLR

empirical, risk minimization structural indep, component analysis prinicipal Mahalanobis, distance edit, Euclidean, pairwise

  • PRES. ADDRESSES

unit[ed] state[s] take oath army navy under circumst.

  • econ. public expenditur
  • exec. branch. governm.

LOTR

he Verb Conj he

[he said that he]

Conf _ the Noun of

[and even the end of]

the Adj Noun and

[the young Hobbits and]

a c b

Serial Episodes

a

  • r

c b d

Choice-episode Ontological Episodes

A π’Ÿ b

slide-31
SLIDE 31

Clustering

The best clustering is the one that costs the least bits

  • similar structure (patterns) within clusters
  • different structure (patterns) between clusters

Partition your data such that 𝑀 𝐷 +

  • 𝑀(𝐸𝑗, 𝐼𝑗)

(𝐸𝑗,𝐼𝑗)∈𝐷

is minimal

(similar to mixture modelling, but descriptive instead of predictive)

for itemsets, see Van Leeuwen et al (2009) 31

slide-32
SLIDE 32

Clustering

Mammals occurrences

  • 2221 areas in Europe
  • 50x50km each
  • 123 mammals
  • no location info

for itemsets, see Van Leeuwen et al (2009) 32

k=6, MDL β€˜optimal’

slide-33
SLIDE 33

Classification

Split your data per cla class

  • induce model per class

Then, for uns unseen instances

  • assign

n class label of model that encodes it shor

  • rtest

st

𝑀 𝑦 𝐼1 < 𝑀 𝑦 𝐼2 β†’ 𝑄 𝑦 𝐼1 > 𝑄(𝑦 ∣ 𝐼2)

(for itemsets, see ECML PKDD’06) 33

slide-34
SLIDE 34

Classification by MDL

Van Leeuwen et al (2006) 34

𝑀 𝑦 𝐼1 < 𝑀 𝑦 𝐼2 β†’ 𝑄 𝑦 𝐼1 > 𝑄(𝑦 ∣ 𝐼2)

Split per class Model 𝐼𝑑

βˆ—

per class 𝑑 Encode unseen record Shortest code wins! Run algorithm Database (π‘œ classes)

slide-35
SLIDE 35

Outlier Detection

One-Class Classification (aka anomaly detection)

  • lots of data for no

normal situation – insufficient data for tar target

Compression models the norm rm

  • anomalies will have hig

igh description length 𝑀(𝑒 ∣ πΌπ‘œπ‘œπ‘œπ‘œ

βˆ—

)

Very nice properties

  • performa

rmance nce high accuracy

  • ve

versa satile no distance measure needed

  • chara

ract cter erisa sation β€˜this part of t is incompressible’

Smets & Vreeken (2011) Akoglu et al (2012) 35

slide-36
SLIDE 36

CompreX on Images

Akoglu et al (2012) 36

Catholic church, Vatican Washington Memorial, D.C. Thames river, Buckingham palace, plain fields, London

slide-37
SLIDE 37

Causal Discovery

37

X Y Q S R Z V W

slide-38
SLIDE 38

Causal Discovery

Spirtes et al (2000) Marx & Vreeken, Conditional Independence Testing by Stochastic Complexity (2019) 38

We can find the causal al ske keleton using condit itional indepen enden dence tests, but only few few edge edge dir irections ns

X Y Q S R Z V W

slide-39
SLIDE 39

Causal Inference

Spirtes et al (2000) Marx & Vreeken, Conditional Independence Testing by Stochastic Complexity (2019) 39

We can find the causal al ske keleton using condit itional indepen enden dence tests, but only few few edge edge dir irections ns

Q S R Z V W

?

X Y

slide-40
SLIDE 40

Algorithmic Markov Condition

If π‘Œ β†’ 𝑍, we have,

up to an additive constant,

𝐿 𝑄 π‘Œ + 𝐿 𝑄 𝑍 π‘Œ ≀ 𝐿 𝑄 𝑍 + 𝐿 𝑄 π‘Œ 𝑍 That is, we can do ca causal in l inference by identifying the factorization of the joint with the lowe west Kol

  • lmog

mogor

  • rov c

comp

  • mplexity

(Janzing & SchΓΆlkopf, IEEE TIT 2012) 40

slide-41
SLIDE 41

MDL and Regression

(GrΓΌnwald 2007) 41

a1 x + a0

𝑀 𝑁 + 𝑀(𝐸|𝑁)

a10 x10 + a9 x9 + … + a0 errors { }

slide-42
SLIDE 42

Modelling the Data

We model 𝑍 as 𝑍 = 𝑔 π‘Œ + π’ͺ As 𝑔 we consider lin linear, quad adrat atic, cu cubi bic, expone nent ntial, and recipr proca cal functions, and model the noise using a 0-mean Gaussian. We choose the 𝑔 that minimizes 𝑀 𝑍 π‘Œ = 𝑀 𝑔 + 𝑀(π’ͺ)

Marx & Vreeken (2017) 42

slide-43
SLIDE 43

Confidence and Significance

How certain are we?

Marx & Vreeken (2017) 43

β„‚ = 𝑀 π‘Œ + 𝑀 𝑍 π‘Œ) βˆ’ 𝑀 𝑍 + 𝑀 π‘Œ 𝑍)

 the higher the more certain

𝑀(π‘Œ β†’ 𝑍) 𝑀(𝑍 β†’ π‘Œ)

slide-44
SLIDE 44

Confidence and Significance

How certain are we? Is a given inference significant?

  • ur null hypothesis 𝑀0 is that 𝒀 and 𝒁 are only correlated,

we have 𝑀0 =

|𝑀 π‘Œβ†’π‘ βˆ’π‘€ π‘β†’π‘Œ | 2

  • we can use the no-hypercompression inequality to test significance

P 𝑀0 𝐸 βˆ’ 𝑀 𝐸 β‰₯ 𝑙 ≀ 2βˆ’π‘™

GrΓΌnwald (2007), Marx & Vreeken (2017) 44

β„‚ = 𝑀 π‘Œ + 𝑀 𝑍 π‘Œ) 𝑀 π‘Œ + 𝑀 𝑍 βˆ’ 𝑀 𝑍 + 𝑀 π‘Œ 𝑍) 𝑀 π‘Œ + 𝑀 𝑍

 the higher the more certain  robust w.r.t. sample size

slide-45
SLIDE 45

Performance on Benchmark Data

(TΓΌbingen 97 univariate numeric cause-effect pairs, weighted)

45 Marx & Vreeken (2017)

slide-46
SLIDE 46

Performance on Benchmark Data

(TΓΌbingen 97 univariate numeric cause-effect pairs, weighted)

Marx & Vreeken (2017) 46

Inferences of state of the art algorithms orde dered ed by confi fide dence values. SLOPE is 85% accurate with 𝛽 = 0.001

slide-47
SLIDE 47

Deep Learning

Model selection in deep learning is hard

  • way too many β€˜free’ parameters for standard regularizers,
  • no meaningful prior over networks, and
  • uniform prior will lead to overfitting

How about an MDL approach?

  • what is the description length of a neural network?

47

slide-48
SLIDE 48

MDL for Neural Networks

Suppose neural network 𝐼 ∈ β„‹ predicts target 𝑧 given 𝑦 𝑧 = 𝐼(𝑦) How do we encode data given the model?

  • if 𝐼(𝑦) is probabilistic, we have 𝑀 𝒛

𝐼 π’š = βˆ’ βˆ‘ log π‘ž(𝑧𝑗|𝑦𝑗)

π‘§π‘—βˆˆπ’›

  • else we can simply encode the residual error,



e.g. if 𝒛 is binary, we have 𝒇 = 𝒛 βŠ• 𝒛 , and 𝑀 𝒛 𝐼 π’š = log π‘œ + log π‘œ |𝒇|



e.g. if 𝒛 is continuous, we can encode using a zero-mean Gaussian

48

slide-49
SLIDE 49

MDL for Neural Networks

Suppose neural network 𝐼 ∈ β„‹ predicts target 𝑧 given 𝑦 𝑧 = 𝐼(𝑦) How do we encode the model?

  • we could encode all of the parameters, but that’s highly ad hoc
  • instead, we can use the notion of prequentia

ial coding ing

Dawid (1984) Barron et al (1998) GrΓΌnwald (2007) Blier & Ollivier (2018) 49

slide-50
SLIDE 50

Prequential Coding

Simple, elegant idea: β€œUpdate your model after every message” That is, we re-train our network after β€˜every’ new label

  • we initialize topology 𝐼 ∈ β„‹ with fixed weights
  • we transmit the first 𝑙 labels using 𝐼0
  • we now train 𝐼 on this first batch of 𝑙 labelled points, we obtain 𝐼1
  • we transmit the second 𝑙 labels using 𝐼1
  • we now train 𝐼 on the first two batches, and obtain 𝐼3

Dawid (1984) Barron et al (1998) GrΓΌnwald (2007) Blier & Ollivier (2018) 50

slide-51
SLIDE 51

Prequential Coding

Simple, elegant idea: β€œUpdate your model after every message” 𝑀 𝐸 β„‹ = 𝑀(𝐸𝑗 ∣ πΌπ‘—βˆ’1)

𝐸𝑗

Best of all, this is not a crude, but a refined MDL code!

  • depends fully on how 𝐼 behaves on the data
  • no arbitrary choices on how to encode 𝐼
  • within a constant of 𝑀(𝐸|πΌβˆ—), and this constant only depends on β„‹

Dawid (1984) Barron et al (1998) GrΓΌnwald (2007) 51

slide-52
SLIDE 52

Schedule

Opening Introduction to MDL MDL in Action ––––––break –––––– Stochastic Complexity MDL in Dynamic Settings

52

8:00am 8:10am 8:50am 9:30am 10:00am 11:00am