Maximum Entropy & Subjective Interestingness
Jill illes V s Vreeken
26 June une 2015 2015
Maximum Entropy & Subjective Interestingness Jill illes V s - - PowerPoint PPT Presentation
Maximum Entropy & Subjective Interestingness Jill illes V s Vreeken 26 June une 2015 2015 Questions of the day How can we find things that are interesting with regard to what we already know ? How can we measure subjective
26 June une 2015 2015
(ie. increases the likelihood of the data)
(maximise likelihood, but avoid overfitting)
universe of possible datasets
all possible datasets
possible datasets, given current knowledge dimensions, margins
all possible datasets
dimensions, margins, pattern P1
all possible datasets
dimensions, margins, patterns P1 and P2
all possible datasets dimensions, margins, the key structure
all possible datasets
dimensions, margins, patterns P1 and P2 knowledge added by P2
𝑞(𝐸 ∣ 𝐶)
𝑞 𝐸 𝐶 ∪ 𝑌 − 𝑞(𝐸 ∣ 𝐶)
…which distribution should we use?
We need access to the likelihood
𝑞(𝑌 ∣ 𝐶)
such that we can mine the data for X that have a low likelihood, that are surp rpris rising
…which distribution should we use?
‘the best distribution 𝑞∗ satisfies the background knowledge, but makes no further assumptions’
(Jaynes 1957; De Bie 2009)
‘the best distribution 𝑞∗ sati satisfies th the bac ackgr ground knowledge dge, but makes no further assumptions’ in other words, 𝑞∗ assigns the correct probability mass to the background knowledge instances: 𝑞∗ is a maximum likelihood estimator
(Jaynes 1957; De Bie 2009)
‘the best distribution 𝑞∗ satisfies the background knowledge, but ma makes no no fur urthe her a assum umption
in other words, 𝑞∗ spreads probability mass around as evenly as possible: 𝑞∗ does not have any specific bias
(Jaynes 1957; De Bie 2009)
‘the best distribution 𝑞∗ satisfies the background knowledge, but makes no further assumptions’
(Jaynes 1957; De Bie 2009)
Let 𝐶 be our set of constraints 𝐶 = {𝑔
1, … , 𝑔 𝑜}
Let 𝐷 be the set of admissible distributions 𝐷 = 𝑞 ∈ 𝐐 𝑞 𝑔
𝑗 = 𝑞
𝑔
𝑗 for 𝑔 𝑗 ∈ 𝐶}
We need the most uni unifor
Uniformity ↔ Entropy 𝐼 𝑞 = − 𝑞(𝑌 = 𝑦)log 𝑞(𝑌 = 𝑦)
𝑦∈𝐘
tells us the ent ntrop
We want access to the distribution 𝑞∗ with maximum mum entro ropy 𝑞𝐶
∗ = argmax𝑞∈𝐷𝐼(𝑞)
better known as the ma maximum um e ent ntropy mod model for constraints set 𝐶
We want access to the distribution 𝑞∗ with maximum mum entro ropy 𝑞𝐶
∗ = argmax𝑞∈𝐷𝐼(𝑞)
better known as the ma maximum um e ent ntropy mod model for constraints set 𝐶
(that’s not completely true, some esoteric exceptions exist)
It can be shown that 𝑞∗ is well defined there always exist a unique 𝑞∗ with maximum entropy for any constrained set 𝐷
Any distribution with less-than-maximal entropy must have a reaso ason for this. Less entropy means not-as-uniform-as-possible, that is, undue peaks of probability mass. That is, reduced entropy = late tent assu assumptions, exactly what we want to avoid!
Recall that through Kraft’s inequality probab ability ty di distr stribution ↔ enco codin ing The MaxEnt distribution for 𝐶 gives the mi mini nimu mum worst-case expected encoded length
any y distribution that satisfies this background knowledge.
Mean and
interval?
uniform
variance?
Gaussian
positive?
exponential
discrete?
geometric
…
margins (Kontonasios et al. ‘11) sets of cells (Kontonasios et al. ‘13)
itemset frequencies (Tatti ’06, Mampaey et al. ’11)
margins (De Bie ‘09) tiles (Tatti & Vreeken, ‘12)
You can finding the MaxEnt distribution by solving the following system of linear constraints
max
𝑞(𝑦)
− 𝑞 𝑦 log 𝑞(𝑦)
𝑦
𝑡. 𝑢. 𝑞 𝑦 𝑔
𝑗 𝑦 = 𝛽𝑗 𝑦
for all 𝑗 𝑞 𝑦 = 1
𝑦
* for discrete data
Let 𝑞 be a probability density satisfying the constraints 𝑞 𝑦 𝑔
𝑗 𝑦 𝑒𝑦 = 𝛽𝑗 𝑇
for 1 ≤ 𝑗 ≤ 𝑛 then we can write the MaxEnt distribution as 𝑞∗ = 𝑞𝜇 𝑦 ∝ exp 𝜇0 + 𝜇𝑗𝑔
𝑗 𝑦 𝑔𝑗∈𝐶
𝐸 ∉ 𝒶 𝐸 ∈ 𝒶 where we choose the lambdas, Lagrange multipliers, to satisfy the constraints, and where 𝒶 is a collection of databases s.t. 𝑞(𝐸) = 0 for all 𝑞 ∈ 𝒬
(Csizar 1975)
The Lagrangian is 𝑀 𝑞 𝑦 , 𝜈, 𝜇 = − 𝑞 𝑦 log 𝑞 𝑦 + 𝜇𝑗 𝑞 𝑦 𝑔
𝑗 𝑦 − 𝛽𝑗 𝑗
+ 𝜈 𝑞 𝑦 − 1
𝑦 𝑗 𝑦
We set the derivative w.r.t. 𝑞 𝑦 to 0 and get 𝑞 𝑦 = 1/𝑎 𝜇 exp 𝜇𝑗𝑔
𝑗 𝑦 𝑗
where 𝑎 𝜇 = ∑ exp (∑ 𝜇𝑗𝑔
𝑗(𝑦)) 𝑗 𝑦
is called the partitio ion n func nction
We may substitute 𝑞(𝑦) in the Lagrangian to obtain the dua ual obje ject ctiv ive 𝑀 𝜇 = log 𝑎 𝜇 − 𝜇𝑗𝛽𝑗
𝑗
Minimizing the dual gives the maximal solution to the
vex. x.
The problem is convex means we can use an any y convex optimization strategy. Standard approaches include
iterative scaling, gradient descent, conjugate gradient descent, Newton’s method, etc.
for datasets and tiles this is easy asy for itemsets and frequencies, however, this is PP PP-hard rd
Constraints: the expec ected ed row and column margins
𝑒𝑗𝑗
𝑛 𝑗=1
= 𝑠
𝑗 𝐸∈ 0,1 𝑜×𝑛
𝑒𝑗𝑗
𝑜 𝑗=1
= 𝑑
𝑗 𝐸∈ 0,1 𝑜×𝑛
(De Bie 2010)
Using the Lagrangian, we can solve 𝑞 𝐸 𝑞 𝐸 = 1 𝑎 𝜇𝑗
𝑠, 𝜇𝑗 𝑑 exp 𝑒𝑗𝑗 𝜇𝑗 𝑠 + 𝜇𝑗 𝑑 𝑗,𝑗
where 𝑎 𝜇𝑗
𝑠, 𝜇𝑗 𝑑 =
𝑠 + 𝜇𝑗 𝑑 𝑒𝑗𝑗∈{0,1}
Using the Lagrangian, we can solve 𝑞 𝐸 𝑞 𝐸 = 1 𝑎 𝜇𝑗
𝑠, 𝜇𝑗 𝑑 exp 𝑒𝑗𝑗 𝜇𝑗 𝑠 + 𝜇𝑗 𝑑 𝑗,𝑗
Hey! 𝑞 𝐸 is a product of independent elements! That’s handy! We did id not enforce this property, it’s a consequence of MaxEnt. Following, every element is hence Bernoulli distributed, with a success probability of exp
𝜇𝑗
𝑠+𝜇𝑗 𝑑
1+exp 𝜇𝑗
𝑠+𝜇𝑗 𝑑
Okay, say we have this 𝑞∗, what is it useful for? Given 𝑞∗ we can
sample data from 𝑞∗, and compute empirical p-values
(just like with swap randomization)
compute the likelihood of the observed data, and compute how surprising our findings are given 𝑞∗,
and compute exact p-values
Swap randomization and MaxEnt can both maintain margins. MaxEnt constrains the expec ected ed margins. Swap randomization constrains the act ctual l margins. Does this matter?
margins (Kontonasios et al. ‘11) arbitrary sets of cells (now)
allow for ite terative ve mining
margins (De Bie, ‘09) tiles (Tatti & Vreeken, ‘12)
(Kontonasios et al. 2013)
.9 .8 .7 .4 .5 .5 .5 .7 .8 .9 .3 .5 .3 .5 .8 .8 .8 .6 .3 .4 .2 .7 .9 .7 .7 .3 .2 .5 .2 .8 .7 .8 .4 .4 .1 .3 .6 .9 .8 .3 .8 .3 .2 .1 .3 .4 .5 .3 .2
.9 .8 .7 .4 .5 .5 .5 .7 .8 .9 .3 .5 .3 .5 .8 .8 .8 .6 .3 .4 .2 .7 .9 .7 .7 .3 .2 .5 .2 .8 .7 .8 .4 .4 .1 .3 .6 .9 .8 .3 .8 .3 .2 .1 .3 .4 .5 .3 .2 Pattern 1
1 − 3 × {1 − 4}
mean 0.8
Pattern 2
2 − 3 × {3 − 5}
mean 0.8
Pattern 3
5 − 7 × {3 − 5}
mean 0.3
.9 .8 .7 .4 .5 .5 .5 .6 .7 .8 .9 .3 .5 .3 .5 .6 .8 .8 .8 .6 .3 .4 .2 .6 .7 .9 .7 .7 .3 .2 .5 .6 .2 .8 .7 .8 .4 .4 .1 .5 .3 .6 .9 .8 .3 .8 .3 .6 .2 .1 .3 .4 .5 .3 .2 .3 .5 .7 .7 .6 .4 .4 .3 .5 Pattern 1
1 − 3 × {1 − 4}
mean 0.8
Pattern 2
2 − 3 × {3 − 5}
mean 0.8
Pattern 3
5 − 7 × {3 − 5}
mean 0.3
Pattern 1
1 − 3 × {1 − 4}
mean 0.8
Pattern 2
2 − 3 × {3 − 5}
mean 0.8
Pattern 3
5 − 7 × {3 − 5}
mean 0.3
.5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5
.6 .7 .7 .7 .5 .6 .4 .6 .7 .6 .6 .6 .4 .4 .6 .6 .6 .7 .7 .6 .5 .5 .3 .6 .6 .6 .7 .6 .5 .4 .5 .6 .5 .7 .6 .6 .5 .4 .3 .5 .5 .7 .7 .6 .5 .6 .3 .6 .3 .6 .6 .3 .2 .2 .2 .3 .5 .7 .7 .6 .4 .4 .4 .5 Pattern 1
1 − 3 × {1 − 4}
mean 0.8
Pattern 2
2 − 3 × {3 − 5}
mean 0.8
Pattern 3
5 − 7 × {3 − 5}
mean 0.3
(Kontonasios et al., 2011)
.8 .8 .8 .6 .4 .4 .4 .6 .8 .8 .8 .6 .4 .4 .4 .6 .8 .8 .8 .6 .4 .4 .4 .6 .8 .8 .8 .6 .4 .4 .4 .6 .2 .6 .6 .6 .4 .5 .4 .5 .3 .6 .6 .6 .6 .6 .6 .6 .1 .3 .3 .3 .4 .4 .3 .3 .5 .7 .7 .6 .4 .4 .4 .5 Pat attern 1
𝟐 − 𝟒 × {𝟐 − 𝟓}
mean
an 𝟏. 𝟗 Pattern 2
2 − 3 × {3 − 5}
mean 0.8
Pattern 3
5 − 7 × {3 − 5}
mean 0.3
(Kontonasios et al. 2013)
does not take size, or complexity into account
for tiles in real valued data
It 1 It 2 It 3 It 4 It 5 Final 1. A2 B3 A3 B2 C3 A2 2. A4 B4 B2 C3 C4 B3 3. A3 B2 C3 C4 C2 A3 4. B3 A3 C4 C2 D2 B2 5. B4 C3 C2 B4 D4 C3 6. B2 C4 B4 D2 D3 C2 7. C3 C2 D2 D4 D1 D2 8. C4 D2 D4 D3 A5 D3 9. C2 D4 D3 D1 21 A5 10. D2 D3 B1 A5 B5 B5
Synthetic Data
random Gaussian 4 ‘complexes’ (ABCD) of
5 overlapping tiles
(x2 + x3 big with low overlap)
Patterns
real + random tiles
Task
Rank on InfRatio,
add best to model, iterate
Real Data
gene expression
Patterns
Bi-clusters from
external study
Legend:
solid line histograms dashed line means/var
simple yet powerful – difficult to extend – empirical
al p-values
complex yet powerful inferring can be NP-hard analytical model can calculate exact probabilities can be defined for anythi
hing ng …if you can derive the model…
mine most informative thingy, update model, repeat.
simple yet powerful – difficult to extend – empirical
al p-values
complex yet powerful inferring can be NP-hard analytical model can calculate exact probabilities can be defined for anythi
hing ng …if you can derive the model…
mine most informative thingy, update model, repeat.