Determining Significance
Jilles Vreeken
19 June 2015 2015
Determining Significance Jilles Vreeken 19 June 2015 2015 - - PowerPoint PPT Presentation
Determining Significance Jilles Vreeken 19 June 2015 2015 Question of the day How can we find things that are interesting with regard to what we already know ? What is interesting? something that in increas eases our knowledge about the
19 June 2015 2015
(ie. increases the likelihood of the data)
(maximise likelihood, but avoid overfitting)
…which distribution should we use?
We need access to the lik ikel elihoo ihood d
such that we can mine the data for X that have a low likelihood, that are surpri rprising ing
…which distribution should we use?
We need access to the lik ikel elihoo ihood d
such that we can mine the data for X that have a low likelihood, that are surpri rprising ing
…which distribution should we use?
This is called the p-value
We need access to the lik ikel elihoo ihood d
such that we can mine the data for X that have a low likelihood, that are surpri rprising ing
…which distribution should we use?
This is called the p-value
The p-value corresponds to the Frequentist probability of the result being generated by the null-hypothesis
We need access to the lik ikel elihoo ihood d
such that we can mine the data for X that have a low likelihood, that are surpri rprising ing
…which distribution should we use?
This is called the p-value
We do not want to have to choose a distribution We want to be able to test significance against what t we we already ady know. That is, our null hypothesis is ‘The results are explained by what t we we kn know about the data’ But, what do we know about the data? And, how do we test against this?
1.
2.
3.
Original data Random data #1 Random data #2 Random data #N
...
score(X | D)
1.
2.
3.
Original data Random data #1 Random data #2 Random data #N
...
score(X | D)
The fraction of better ‘randoms’ is the empirical p-value
Let 𝐸 be our data and 𝐶 our background knowledge. Let 𝑉(𝐶) be the space of all data that satisfies 𝐶. Let 𝑇 ⊆ 𝑉(𝐶) be a uniform form random sample of 𝑉(𝐶). Let 𝑆(𝐸) be a single gle number our data mining method results.
(e.g., the frequency of an itemset, the number of frequent itemsets at a chosen minsup, the average value
The empirical irical 𝒒-value lue of 𝑆 𝐸 being ‘big’ then is 𝐸′ ∈ 𝑇 𝑆 𝐸′ ≥ 𝑆 𝐸 + 1 𝑇 + 1
Let 𝐸 be our data and 𝐶 our background knowledge. Let 𝑉(𝐶) be the space of all data that satisfies 𝐶. Let 𝑇 ⊆ 𝑉(𝐶) be a uniform form random sample of 𝑉(𝐶). Let 𝑆(𝐸) be a single gle number our data mining method results.
(e.g., the frequency of an itemset, the number of frequent itemsets at a chosen minsup, the average value
The empirical irical 𝒒-value lue of 𝑆 𝐸 being ‘big’ then is 𝐸′ ∈ 𝑇 𝑆 𝐸′ ≥ 𝑆 𝐸 + 1 𝑇 + 1
The empirical irical 𝒒-value lue of 𝑆 𝐸 being ‘big’ is 𝐸′ ∈ 𝑇 𝑆 𝐸′ ≥ 𝑆 𝐸 + 1 𝑇 + 1 We have the +1’s to avoid 0s. If 𝑇 = 𝑉(𝐶) this is an exact ct test, and then the +1s are not needed Clearly, the bigger the sample 𝑇 the better. It controls the maximum accuracy, the resolut
ion of the empirical p-value. If you want to measure significance at 𝑞 = 0.05 you need at least 20 samples (and rather, many many more)
The empirical irical 𝒒-value lue of 𝑆 𝐸 being ‘big’ is 𝐸′ ∈ 𝑇 𝑆 𝐸′ ≥ 𝑆 𝐸 + 1 𝑇 + 1 We have the +1’s to avoid 0s. If 𝑇 = 𝑉(𝐶) this is an exact ct test, and then the +1s are not needed Clearly, the bigger the sample 𝑇 the better. It controls the maximum accuracy, the resolut
ion of the empirical p-value. If you want to measure significance at 𝑞 = 0.05 you need at least 20 samples (and rather, many many more)
1.
2.
3.
Original data Random data #1 Random data #2 Random data #N
...
score(X | D)
The fraction of better ‘randoms’ is the empirical p-value
1.
2.
3.
Original data Random data #1 Random data #2 Random data #N
...
score(X | D)
The fraction of better ‘randoms’ is the empirical p-value
So, we now we just need lots of data sets that are
maintain our background knowledge, completely random otherwise
How can we get our hands on such data? and, how do we sample it unifor
mly y at rando dom? This depends on the type of data, and the type(s) of background knowledge we want to maintain.
For now, let us simply consider binar ary y data ata.
Let there be data
(swap randomization, Gionis et al. 2005)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Say we only know overall density. How to sample random data?
(swap randomization, Gionis et al. 2005)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 27
Didactically, let us instead consider a Monte-Carlo Markov Chain
Very simple scheme
(swap randomization, Gionis et al. 2005)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 27
Margins are easy understandable for binary data, how can we sample data with same margins?
(swap randomization, Gionis et al. 2005)
1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
By MCMC!
(swap randomization, Gionis et al. 2005)
1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
By MCMC!
(swap randomization, Gionis et al. 2005)
1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
By MCMC!
(swap randomization, Gionis et al. 2005)
1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
𝑬 𝑬’𝟐 𝑬’𝟓 𝑬’𝟒 𝑬’𝟑
The neighbours 𝐸𝑗
′ ∈ 𝑉(𝐶) of 𝐸 are
all reachable with 1 swap from 𝐸
swap
For unbiased testing, we need to sample uniformly from 𝑉(𝐶). Are all datasets in 𝑉 𝐶 reachable from 𝐸 by swapping?
Can the ‘swap-graph’ of 𝑉 𝐶 be disconnected?
For unbiased testing, we need to sample uniformly from 𝑉(𝐶). Are all datasets in 𝑉 𝐶 reachable from 𝐸 by swapping?
Theorem [Ryser ‘57]. If 𝐵, 𝐶 ∈ 𝑁 𝑠, 𝑑 , then A is reachable from B with a finite number of swaps
A path through this graph is called a chain.
Subsequent states in Markov chains are dependent. Which means, subsequent samples are dependent. This is not a proble lem m if we let the chain co conve verge ge between drawing samples, but estimating mixing time is hard. If we would simply take the original data as the starting point to sample random data, all samples will be bia iased ed.
Besag-Clifford correction helps to avoid biased MCMC sampling. In a nutshell, we first run the chain 𝑙 steps back ckwar ard d and then 𝑚 times 𝑙 steps forwar ward Original and random data are now interchangeable. For time-reversible chains forward d = back ckwar ard
For unbiased testing, we need to sample uniform formly from 𝑉(𝐶). However, not every state 𝑌 has the same degree. That is, we cannot always reach the same number of datasets 𝑍 ∈ 𝑉(𝐶) in one step (swap). The resulting Markov chain hence does not t have a uniform stationary distribution. How can we fix this?
1)
add self loops ps to ensure each state has the same degree.
2)
use the Metropol
is-Hasti astings ngs algorithm
In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap.
1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. Otherwise, we self-loop, count the step, and continue.
1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. Otherwise, we self-loop, count the step, and continue. This chain now does have a uniform stationary distribution, but also a very long burn-in time.
In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. Otherwise, we self-loop, count the step, and continue. This chain now does have a uniform form stat tationa ionary distri stribut bution
but also a ve very long burn-in time. 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27
Let 𝑂(𝑌) be the number of neighbours of state 𝑌. For Metropolis-Hastings, we select 𝑍 ∈ 𝑂(𝑌) u.a.r. and make the transition with probability min
𝑂 𝑌 𝑂 𝑍 , 1 , selecting 𝑍 using
rejection sampling. That is, we try random pairs (𝑗, 𝑘), (𝑙, 𝑚) for 𝑌 and if it defines a swap box, applying that swap gives us 𝑍. MH prob
ably converges faster than self-loops. However, skipping all detail, computing 𝑂(𝑍) takes 𝑃 min 𝑜, 𝑛 , which is considerably more time per step than with self-loops :-/
8019015 908576 10449902 7150992 65152 1978
Clearly, row and column margins are a rather simple type of background knowledge. Can we incorporate more advanced knowledge? The answer is: yes, and no Hanhijärvi et al. showed that, beyond margi gins ns, we can include clust ster r margin gins – easy, just only consider rows inside a cluster exact ct ite temset mset margi gins ns – but that this is intractable in general appr proxi xima mate e ite temset mset margi gins ns – with very high mixing time
(Hanhijarvi et al. 2009)
Swap randomization per se works only for binary data, the main idea can be extended to other data types. Swap rotati tion
real-val alued ued data with appr proxi ximat matel ely y the same
mean and variance on rows and columns value distribution (histogram) on rows and columns
(swap rotation, Ojala et al. 2009)
𝑘 ⋮ 𝑗 … 𝛽 … ⋮
One-element changes
resample from [0,1] or 𝐸 add a value 𝛽 ∈ −𝑡, 𝑡
Four-element changes
swap-rot
ate e if 𝑏 = 𝑏′ and 𝑐 = 𝑐′ equals to swap
mask
preserves row and column sums
swap-di
discr cret etised ed discretise the data into 𝑙 bins, apply swap-randomization
resample
𝑘 ⋮ 𝑗 … +𝛽 … ⋮
add
𝑘1 𝑘2 ⋮ ⋮ 𝑗1 … 𝑏 … 𝑐 … ⋮ ⋮ 𝑗2 … 𝑐’ … 𝑏’ … ⋮ ⋮ 𝑘1 𝑘2 ⋮ ⋮ 𝑗1 … 𝑐′ … 𝑏 … ⋮ ⋮ 𝑗2 … 𝑏′ … 𝑐 … ⋮ ⋮
swap-rotate
𝑘1 𝑘2 ⋮ ⋮ 𝑗1 … +𝛽 … −𝛽 … ⋮ ⋮ 𝑗2 … −𝛽 … +𝛽 … ⋮ ⋮
mask
The Metropolis algorithm performs a local change and accepts the result with a certain probability. If 𝑌 is the original data, and 𝑍 is the result, we accept it with probability 𝑑 × exp{−𝑥𝐹 𝑌, 𝑍 } where 𝑑 is a normalisation constant, 𝑥 is a weight parameter, and 𝐹(𝑌, 𝑍) is a difference measure between 𝑌 and 𝑍. How to measure the difference between distributions?
𝑀1 norm between unnormalized cdf’s, or comparing histograms
(swap rotation, Ojala et al. 2009)
The Metropolis algorithm performs a local change and accepts the result with a certain probability. If 𝑌 is the original data, and 𝑍 is the result, we accept it with probability 𝑑 × exp{−𝑥𝐹 𝑌, 𝑍 } where 𝑑 is a normalisation constant, 𝑥 is a weight parameter, and 𝐹(𝑌, 𝑍) is a difference measure between 𝑌 and 𝑍. How to measure the difference between distributions?
𝑀1 norm between unnormalized cdf’s, or comparing histograms
(swap rotation, Ojala et al. 2009)
(swap rotation, Ojala et al. 2009)
Many ways to test static tic null hypothesis
assuming distribution, swap-randomization, MaxEnt
ranking ng based on static c signific ficanc ance mining the to top-k k most significant patterns, but not suited for iterative mining
variations exists for many data types (Ojala ‘09, Henelius et al ’13) can be pushed beyond margins (see Hanhijärvi et al 2009) but… has key disadvantages
results that follow from what you know are boring but, choosing a good model (and test) is difficult
simple yet powerful, available for many data types, slow, difficult to extend to maintain higher order statistics, and ‘only’ gives empiri
rical al p-values
results that follow from what you know are boring but, choosing a good model (and test) is difficult
simple yet powerful, available for many data types, slow, difficult to extend to maintain higher order statistics, and ‘only’ gives empiri
rical al p-values