Maximum Entropy & Subjective Interestingness Jill illes V s - - PowerPoint PPT Presentation

maximum entropy subjective interestingness
SMART_READER_LITE
LIVE PREVIEW

Maximum Entropy & Subjective Interestingness Jill illes V s - - PowerPoint PPT Presentation

Maximum Entropy & Subjective Interestingness Jill illes V s Vreeken 26 June une 2015 2015 Questions of the day How can we find things that are interesting with regard to what we already know ? How can we measure subjective


slide-1
SLIDE 1

Maximum Entropy & Subjective Interestingness

Jill illes V s Vreeken

26 June une 2015 2015

slide-2
SLIDE 2

Questions of the day

How can we find things that are interesting with regard to what we already know? How can we measure subjective interestingness?

slide-3
SLIDE 3

What is interesting?

something that increa ease ses our knowledge about the data

slide-4
SLIDE 4

What is a good result?

something that red educ uces our uncertainty about the data

(ie. increases the likelihood of the data)

slide-5
SLIDE 5

What is really good?

something that, in simpl ple term rms, st strongly r red educ uces our uncertainty about the data

(maximise likelihood, but avoid overfitting)

slide-6
SLIDE 6

universe of possible datasets

Let’s make this visual

  • ur dataset D
slide-7
SLIDE 7

all possible datasets

Given what we know

  • ur dataset D

possible datasets, given current knowledge dimensions, margins

slide-8
SLIDE 8

all possible datasets

More knowledge...

  • ur dataset D

dimensions, margins, pattern P1

slide-9
SLIDE 9

all possible datasets

Fewer possibilities...

  • ur dataset D

dimensions, margins, patterns P1 and P2

slide-10
SLIDE 10

Less uncertainty.

  • ur dataset D

all possible datasets dimensions, margins, the key structure

slide-11
SLIDE 11

all possible datasets

Maximising certainty

  • ur dataset D

dimensions, margins, patterns P1 and P2 knowledge added by P2

slide-12
SLIDE 12

How can we define

‘uncertainty’ and ‘simplicity’? interpr rpreta tability ty and informativen eness ess are intrinsically subjective

slide-13
SLIDE 13

Measuring Uncertainty

We need access to the likelihood

  • od
  • f data D given background knowledge B

𝑞(𝐸 ∣ 𝐶)

such that we can calculate the gain for X

𝑞 𝐸 𝐶 ∪ 𝑌 − 𝑞(𝐸 ∣ 𝐶)

…which distribution should we use?

slide-14
SLIDE 14

Measuring Surprise

We need access to the likelihood

  • f result X given background knowledge B

𝑞(𝑌 ∣ 𝐶)

such that we can mine the data for X that have a low likelihood, that are surp rpris rising

…which distribution should we use?

slide-15
SLIDE 15

Approach 2: Maximum Entropy

‘the best distribution 𝑞∗ satisfies the background knowledge, but makes no further assumptions’

(Jaynes 1957; De Bie 2009)

slide-16
SLIDE 16

Approach 2: Maximum Entropy

‘the best distribution 𝑞∗ sati satisfies th the bac ackgr ground knowledge dge, but makes no further assumptions’ in other words, 𝑞∗ assigns the correct probability mass to the background knowledge instances: 𝑞∗ is a maximum likelihood estimator

(Jaynes 1957; De Bie 2009)

slide-17
SLIDE 17

Approach 2: Maximum Entropy

‘the best distribution 𝑞∗ satisfies the background knowledge, but ma makes no no fur urthe her a assum umption

  • ns’

in other words, 𝑞∗ spreads probability mass around as evenly as possible: 𝑞∗ does not have any specific bias

(Jaynes 1957; De Bie 2009)

slide-18
SLIDE 18

Approach 2: Maximum Entropy

‘the best distribution 𝑞∗ satisfies the background knowledge, but makes no further assumptions’

ver ery usef useful for data mining: unbiased sed measurement of subject ective e interest estingness ness

(Jaynes 1957; De Bie 2009)

slide-19
SLIDE 19

Constraints and Distributions

Let 𝐶 be our set of constraints 𝐶 = {𝑔

1, … , 𝑔 𝑜}

Let 𝐷 be the set of admissible distributions 𝐷 = 𝑞 ∈ 𝐐 𝑞 𝑔

𝑗 = 𝑞

𝑔

𝑗 for 𝑔 𝑗 ∈ 𝐶}

We need the most uni unifor

  • rmly distributed 𝑞 ∈ 𝐐
slide-20
SLIDE 20

Uniformity and Entropy

Uniformity ↔ Entropy 𝐼 𝑞 = − 𝑞(𝑌 = 𝑦)log 𝑞(𝑌 = 𝑦)

𝑦∈𝐘

tells us the ent ntrop

  • py of a (discrete) distribution 𝑞
slide-21
SLIDE 21

Maximum Entropy

We want access to the distribution 𝑞∗ with maximum mum entro ropy 𝑞𝐶

∗ = argmax𝑞∈𝐷𝐼(𝑞)

better known as the ma maximum um e ent ntropy mod model for constraints set 𝐶

slide-22
SLIDE 22

Maximum Entropy

We want access to the distribution 𝑞∗ with maximum mum entro ropy 𝑞𝐶

∗ = argmax𝑞∈𝐷𝐼(𝑞)

better known as the ma maximum um e ent ntropy mod model for constraints set 𝐶

(that’s not completely true, some esoteric exceptions exist)

It can be shown that 𝑞∗ is well defined there always exist a unique 𝑞∗ with maximum entropy for any constrained set 𝐷

slide-23
SLIDE 23

Does this make sense?

Any distribution with less-than-maximal entropy must have a reaso ason for this. Less entropy means not-as-uniform-as-possible, that is, undue peaks of probability mass. That is, reduced entropy = late tent assu assumptions, exactly what we want to avoid!

slide-24
SLIDE 24

Optimal-worst-case

Recall that through Kraft’s inequality probab ability ty di distr stribution ↔ enco codin ing The MaxEnt distribution for 𝐶 gives the mi mini nimu mum worst-case expected encoded length

  • ver an

any y distribution that satisfies this background knowledge.

slide-25
SLIDE 25

Mean and

 interval?

uniform

 variance?

Gaussian

 positive?

exponential

 discrete?

geometric

 …

But… what about distributions for like data, patterns, and stuff?

Some examples

slide-26
SLIDE 26

MaxEnt Theory

To use MaxEnt, we need theo heory y for modelling data given background knowledge Real-valued Data

 margins (Kontonasios et al. ‘11)  sets of cells (Kontonasios et al. ‘13)

Patterns

 itemset frequencies (Tatti ’06, Mampaey et al. ’11)

Binary Data

 margins (De Bie ‘09)  tiles (Tatti & Vreeken, ‘12)

slide-27
SLIDE 27

Finding the MaxEnt distribution

You can finding the MaxEnt distribution by solving the following system of linear constraints

max

𝑞(𝑦)

− 𝑞 𝑦 log 𝑞(𝑦)

𝑦

𝑡. 𝑢. 𝑞 𝑦 𝑔

𝑗 𝑦 = 𝛽𝑗 𝑦

for all 𝑗 𝑞 𝑦 = 1

𝑦

* for discrete data

slide-28
SLIDE 28

Let 𝑞 be a probability density satisfying the constraints 𝑞 𝑦 𝑔

𝑗 𝑦 𝑒𝑦 = 𝛽𝑗 𝑇

for 1 ≤ 𝑗 ≤ 𝑛 then we can write the MaxEnt distribution as 𝑞∗ = 𝑞𝜇 𝑦 ∝ exp 𝜇0 + 𝜇𝑗𝑔

𝑗 𝑦 𝑔𝑗∈𝐶

𝐸 ∉ 𝒶 𝐸 ∈ 𝒶 where we choose the lambdas, Lagrange multipliers, to satisfy the constraints, and where 𝒶 is a collection of databases s.t. 𝑞(𝐸) = 0 for all 𝑞 ∈ 𝒬

Exponential Form

(Csizar 1975)

slide-29
SLIDE 29

Solving the MaxEnt

The Lagrangian is 𝑀 𝑞 𝑦 , 𝜈, 𝜇 = − 𝑞 𝑦 log 𝑞 𝑦 + 𝜇𝑗 𝑞 𝑦 𝑔

𝑗 𝑦 − 𝛽𝑗 𝑗

+ 𝜈 𝑞 𝑦 − 1

𝑦 𝑗 𝑦

We set the derivative w.r.t. 𝑞 𝑦 to 0 and get 𝑞 𝑦 = 1/𝑎 𝜇 exp 𝜇𝑗𝑔

𝑗 𝑦 𝑗

where 𝑎 𝜇 = ∑ exp (∑ 𝜇𝑗𝑔

𝑗(𝑦)) 𝑗 𝑦

is called the partitio ion n func nction

slide-30
SLIDE 30

En Garde!

We may substitute 𝑞(𝑦) in the Lagrangian to obtain the dua ual obje ject ctiv ive 𝑀 𝜇 = log 𝑎 𝜇 − 𝜇𝑗𝛽𝑗

𝑗

Minimizing the dual gives the maximal solution to the

  • riginal problem. Moreover, it is conve

vex. x.

slide-31
SLIDE 31

The problem is convex means we can use an any y convex optimization strategy. Standard approaches include

iterative scaling, gradient descent, conjugate gradient descent, Newton’s method, etc.

Inferring the Model

slide-32
SLIDE 32

Optimization requires calculating p

for datasets and tiles this is easy asy for itemsets and frequencies, however, this is PP PP-hard rd

Inferring the Model

slide-33
SLIDE 33

MaxEnt for Binary Databases

Constraints: the expec ected ed row and column margins

  • 𝑞 𝐸

𝑒𝑗𝑗

𝑛 𝑗=1

= 𝑠

𝑗 𝐸∈ 0,1 𝑜×𝑛

  • 𝑞 𝐸

𝑒𝑗𝑗

𝑜 𝑗=1

= 𝑑

𝑗 𝐸∈ 0,1 𝑜×𝑛

(De Bie 2010)

slide-34
SLIDE 34

MaxEnt for Binary Databases

Using the Lagrangian, we can solve 𝑞 𝐸 𝑞 𝐸 = 1 𝑎 𝜇𝑗

𝑠, 𝜇𝑗 𝑑 exp 𝑒𝑗𝑗 𝜇𝑗 𝑠 + 𝜇𝑗 𝑑 𝑗,𝑗

where 𝑎 𝜇𝑗

𝑠, 𝜇𝑗 𝑑 =

  • exp 𝑒𝑗𝑗 𝜇𝑗

𝑠 + 𝜇𝑗 𝑑 𝑒𝑗𝑗∈{0,1}

slide-35
SLIDE 35

MaxEnt for Binary Databases

Using the Lagrangian, we can solve 𝑞 𝐸 𝑞 𝐸 = 1 𝑎 𝜇𝑗

𝑠, 𝜇𝑗 𝑑 exp 𝑒𝑗𝑗 𝜇𝑗 𝑠 + 𝜇𝑗 𝑑 𝑗,𝑗

Hey! 𝑞 𝐸 is a product of independent elements! That’s handy! We did id not enforce this property, it’s a consequence of MaxEnt. Following, every element is hence Bernoulli distributed, with a success probability of exp

𝜇𝑗

𝑠+𝜇𝑗 𝑑

1+exp 𝜇𝑗

𝑠+𝜇𝑗 𝑑

slide-36
SLIDE 36

Okay, say we have this 𝑞∗, what is it useful for? Given 𝑞∗ we can

 sample data from 𝑞∗, and compute empirical p-values

(just like with swap randomization)

 compute the likelihood of the observed data, and  compute how surprising our findings are given 𝑞∗,

and compute exact p-values

What have you done for me lately?

slide-37
SLIDE 37

Expected vs. Actual

Swap randomization and MaxEnt can both maintain margins. MaxEnt constrains the expec ected ed margins. Swap randomization constrains the act ctual l margins. Does this matter?

slide-38
SLIDE 38

MaxEnt Theory

To use MaxEnt, we need theo heory y for modelling data given background knowledge Real-valued Data

 margins (Kontonasios et al. ‘11)  arbitrary sets of cells (now)

allow for ite terative ve mining

Binary Data

 margins (De Bie, ‘09)  tiles (Tatti & Vreeken, ‘12)

slide-39
SLIDE 39

MaxEnt for Real-Valued Data

Current state of the art can incorporate means, ns, variance, and higher order moments, as well as histog

  • gram

am information

  • ver arbitra

trary sets of cells

(Kontonasios et al. 2013)

slide-40
SLIDE 40

MaxEnt for Real-Valued Data

.9 .8 .7 .4 .5 .5 .5 .7 .8 .9 .3 .5 .3 .5 .8 .8 .8 .6 .3 .4 .2 .7 .9 .7 .7 .3 .2 .5 .2 .8 .7 .8 .4 .4 .1 .3 .6 .9 .8 .3 .8 .3 .2 .1 .3 .4 .5 .3 .2

slide-41
SLIDE 41

MaxEnt for Real-Valued Data

.9 .8 .7 .4 .5 .5 .5 .7 .8 .9 .3 .5 .3 .5 .8 .8 .8 .6 .3 .4 .2 .7 .9 .7 .7 .3 .2 .5 .2 .8 .7 .8 .4 .4 .1 .3 .6 .9 .8 .3 .8 .3 .2 .1 .3 .4 .5 .3 .2 Pattern 1

1 − 3 × {1 − 4}

 mean 0.8

Pattern 2

2 − 3 × {3 − 5}

 mean 0.8

Pattern 3

5 − 7 × {3 − 5}

 mean 0.3

slide-42
SLIDE 42

MaxEnt for Real-Valued Data

.9 .8 .7 .4 .5 .5 .5 .6 .7 .8 .9 .3 .5 .3 .5 .6 .8 .8 .8 .6 .3 .4 .2 .6 .7 .9 .7 .7 .3 .2 .5 .6 .2 .8 .7 .8 .4 .4 .1 .5 .3 .6 .9 .8 .3 .8 .3 .6 .2 .1 .3 .4 .5 .3 .2 .3 .5 .7 .7 .6 .4 .4 .3 .5 Pattern 1

1 − 3 × {1 − 4}

 mean 0.8

Pattern 2

2 − 3 × {3 − 5}

 mean 0.8

Pattern 3

5 − 7 × {3 − 5}

 mean 0.3

slide-43
SLIDE 43

MaxEnt for Real-Valued Data

Pattern 1

1 − 3 × {1 − 4}

 mean 0.8

Pattern 2

2 − 3 × {3 − 5}

 mean 0.8

Pattern 3

5 − 7 × {3 − 5}

 mean 0.3

.5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5

slide-44
SLIDE 44

MaxEnt for Real-Valued Data

.6 .7 .7 .7 .5 .6 .4 .6 .7 .6 .6 .6 .4 .4 .6 .6 .6 .7 .7 .6 .5 .5 .3 .6 .6 .6 .7 .6 .5 .4 .5 .6 .5 .7 .6 .6 .5 .4 .3 .5 .5 .7 .7 .6 .5 .6 .3 .6 .3 .6 .6 .3 .2 .2 .2 .3 .5 .7 .7 .6 .4 .4 .4 .5 Pattern 1

1 − 3 × {1 − 4}

 mean 0.8

Pattern 2

2 − 3 × {3 − 5}

 mean 0.8

Pattern 3

5 − 7 × {3 − 5}

 mean 0.3

(Kontonasios et al., 2011)

slide-45
SLIDE 45

MaxEnt for Real-Valued Data

.8 .8 .8 .6 .4 .4 .4 .6 .8 .8 .8 .6 .4 .4 .4 .6 .8 .8 .8 .6 .4 .4 .4 .6 .8 .8 .8 .6 .4 .4 .4 .6 .2 .6 .6 .6 .4 .5 .4 .5 .3 .6 .6 .6 .6 .6 .6 .6 .1 .3 .3 .3 .4 .4 .3 .3 .5 .7 .7 .6 .4 .4 .4 .5 Pat attern 1

𝟐 − 𝟒 × {𝟐 − 𝟓}

 mean

an 𝟏. 𝟗 Pattern 2

2 − 3 × {3 − 5}

 mean 0.8

Pattern 3

5 − 7 × {3 − 5}

 mean 0.3

(Kontonasios et al. 2013)

slide-46
SLIDE 46

Simplicity?

Likelihood alone is insufficient

does not take size, or complexity into account

as practical example of our model: Information Ratio

for tiles in real valued data

slide-47
SLIDE 47

Information Ratio

slide-48
SLIDE 48

Results

It 1 It 2 It 3 It 4 It 5 Final 1. A2 B3 A3 B2 C3 A2 2. A4 B4 B2 C3 C4 B3 3. A3 B2 C3 C4 C2 A3 4. B3 A3 C4 C2 D2 B2 5. B4 C3 C2 B4 D4 C3 6. B2 C4 B4 D2 D3 C2 7. C3 C2 D2 D4 D1 D2 8. C4 D2 D4 D3 A5 D3 9. C2 D4 D3 D1 21 A5 10. D2 D3 B1 A5 B5 B5

Synthetic Data

 random Gaussian  4 ‘complexes’ (ABCD) of

5 overlapping tiles

(x2 + x3 big with low overlap)

Patterns

 real + random tiles

Task

 Rank on InfRatio,

add best to model, iterate

slide-49
SLIDE 49

Results

Real Data

 gene expression

Patterns

 Bi-clusters from

external study

Legend:

solid line histograms dashed line means/var

slide-50
SLIDE 50

Conclusions

Randomization

 simple yet powerful – difficult to extend – empirical

al p-values

Maximum Entropy modelling

 complex yet powerful  inferring can be NP-hard  analytical model can calculate exact probabilities  can be defined for anythi

hing ng …if you can derive the model…

Iterative Data Mining

 mine most informative thingy, update model, repeat.

slide-51
SLIDE 51

Thank you!

Randomization

 simple yet powerful – difficult to extend – empirical

al p-values

Maximum Entropy modelling

 complex yet powerful  inferring can be NP-hard  analytical model can calculate exact probabilities  can be defined for anythi

hing ng …if you can derive the model…

Iterative Data Mining

 mine most informative thingy, update model, repeat.