 
              Determining Significance Jilles Vreeken 19 June 2015 2015
Question of the day How can we find things that are interesting with regard to what we already know ?
What is interesting? something that in increas eases our knowledge about the data
What is a good result? something that reduces ces our uncertainty about the data (ie. increases the likelihood of the data)
What is really good? something that, in sim imple le terms, strong ongly ly reduces ces our uncertainty about the data (maximise likelihood, but avoid overfitting)
Measuring Uncertainty We need access to the li likelihood elihood of data D given background knowledge B such that we can calculate the ga gain in for X …which distribution should we use?
Measuring Surprise We need access to the lik ikel elihoo ihood d of result X given background knowledge B such that we can mine the data for X that have a low likelihood, that are surpri rprising ing …which distribution should we use?
Measuring Surprise We need access to the lik ikel elihoo ihood d of result X given background knowledge B This is called the p-value of result X such that we can mine the data for X that have a low likelihood, that are surpri rprising ing …which distribution should we use?
Measuring Surprise We need access to the lik ikel elihoo ihood d of result X given background knowledge B This is called the The p-value corresponds to the p-value Frequentist probability of the result being of result X generated by the null-hypothesis such that we can mine the data for X that have a low likelihood, that are surpri rprising ing …which distribution should we use?
Measuring Surprise We need access to the lik ikel elihoo ihood d of result X given background knowledge B This is called the p-value of result X such that we can mine the data for X that have a low likelihood, that are surpri rprising ing …which distribution should we use?
Background Knowledge We do not want to have to choose a distribution We want to be able to test significance against what t we we already ady know. That is, our null hypothesis is ‘The results are explained by what t we we kn know about the data’ But, what do we know about the data? And, how do we test against this?
Approach 1: Randomization Mine original data 1. Mine random data 2. Determine probability 3. Random Random Random Original ... data #1 data #2 data #N data score ( X | D )
Approach 1: Randomization Mine original data 1. Mine random data 2. Determine probability The fraction of better ‘ randoms ’ is the 3. empirical p-value of result X Random Random Random Original ... data #1 data #2 data #N data score ( X | D )
Empirical p -values Let 𝐸 be our data and 𝐶 our background knowledge. Let 𝑉(𝐶) be the space of all data that satisfies 𝐶 . Let 𝑇 ⊆ 𝑉(𝐶) be a uniform form random sample of 𝑉(𝐶) . Let 𝑆(𝐸) be a single gle number our data mining method results. (e.g., the frequency of an itemset, the number of frequent itemsets at a chosen minsup, the average value over some area, the clustering error, the compressed size of the data, the accuracy, etc, etc) The empirical irical 𝒒 -value lue of 𝑆 𝐸 being ‘big’ then is 𝐸 ′ ∈ 𝑇 𝑆 𝐸 ′ ≥ 𝑆 𝐸 + 1 𝑇 + 1
Empirical p -values Let 𝐸 be our data and 𝐶 our background knowledge. Let 𝑉(𝐶) be the space of all data that satisfies 𝐶 . Let 𝑇 ⊆ 𝑉(𝐶) be a uniform form random sample of 𝑉(𝐶) . Let 𝑆(𝐸) be a single gle number our data mining method results. (e.g., the frequency of an itemset, the number of frequent itemsets at a chosen minsup, the average value over some area, the clustering error, the compressed size of the data, the accuracy, etc, etc) The empirical irical 𝒒 -value lue of 𝑆 𝐸 being ‘big’ then is 𝐸 ′ ∈ 𝑇 𝑆 𝐸 ′ ≥ 𝑆 𝐸 + 1 𝑇 + 1
More on empirical p -values The empirical irical 𝒒 -value lue of 𝑆 𝐸 being ‘big’ is 𝐸 ′ ∈ 𝑇 𝑆 𝐸 ′ ≥ 𝑆 𝐸 + 1 𝑇 + 1 We have the +1’s to avoid 0s. If 𝑇 = 𝑉(𝐶) this is an exact ct test, and then the +1s are not needed Clearly, the bigger the sample 𝑇 the better. It controls the maximum accuracy, the resolut olution ion of the empirical p-value. If you want to measure significance at 𝑞 = 0.05 you need at least 20 samples (and rather, many many more)
More on empirical p -values The empirical irical 𝒒 -value lue of 𝑆 𝐸 being ‘big’ is 𝐸 ′ ∈ 𝑇 𝑆 𝐸 ′ ≥ 𝑆 𝐸 + 1 𝑇 + 1 We have the +1’s to avoid 0s. If 𝑇 = 𝑉(𝐶) this is an exact ct test, and then the +1s are not needed Clearly, the bigger the sample 𝑇 the better. It controls the maximum accuracy, the resolut olution ion of the empirical p-value. If you want to measure significance at 𝑞 = 0.05 you need at least 20 samples (and rather, many many more)
Approach 1: Randomization Mine original data 1. Mine random data 2. Determine probability The fraction of better ‘ randoms ’ is the 3. empirical p-value of result X Random Random Random Original ... data #1 data #2 data #N data score ( X | D )
Approach 1: Randomization Mine original data 1. Mine random data 2. Determine probability The fraction of better ‘ randoms ’ is the 3. empirical p-value of result X Random Random Random Original ... data #1 data #2 data #N data score ( X | D )
Random Data So, we now we just need lots of data sets that are  maintain our background knowledge,  completely random otherwise How can we get our hands on such data? and, how do we sample it unifor orml mly y at rando dom? This depends on the type of data, and the type(s) of background knowledge we want to maintain.
Exa xample mple: : Binary Data For now, let us simply consider binar ary y data ata.
Exa xample mple: : Binary Data Let there be data 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 (swap randomization, Gionis et al. 2005)
Exa xample mple: : Binary Data Say we only know overall density. How to sample random data? 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 27 (swap randomization, Gionis et al. 2005)
Exa xample mple: : Binary Data Didactically, let us instead consider a Monte-Carlo Markov Chain 1 1 1 0 1 1 1 Very simple scheme 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1. select two cells at random, 1 1 1 1 0 0 1 2. swap values, 0 1 1 1 0 0 0 3. repeat until convergence. 0 1 1 1 0 1 0 0 0 0 0 1 0 0 27 (swap randomization, Gionis et al. 2005)
Swap Randomization Margins are easy understandable for binary data, how can we sample data with same margins? 1 1 1 0 1 1 1 6 0 1 1 0 1 0 1 4 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)
Swap Randomization By MCMC! 1 1 1 0 1 1 1 6 0 1 1 0 1 0 1 4 1. randomly find submatrix 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)
Swap Randomization By MCMC! 1 1 1 0 1 1 1 6 0 1 1 0 1 0 1 4 1. randomly find submatrix 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 2. swap values 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)
Swap Randomization By MCMC! 1 1 1 1 1 1 0 0 1 1 1 1 1 1 6 6 0 1 1 0 1 1 1 1 0 0 0 1 0 1 4 4 1. randomly find submatrix 1 1 1 1 1 1 1 0 0 1 0 0 0 0 4 4 1 1 1 1 1 1 1 1 1 0 0 0 1 0 5 5 0 0 1 1 1 1 0 1 0 0 0 0 1 0 3 3 0 0 1 1 1 1 1 1 0 0 1 1 0 0 4 4 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 2. swap values 3 3 6 6 6 6 4 4 3 3 2 2 3 3 27 27 3. repeat until convergence (swap randomization, Gionis et al. 2005)
Hopping through Sample Space 𝑬’ 𝟐 𝑬’ 𝟓 𝑬 𝑬’ 𝟑 swap 𝑬’ 𝟒 The neighbours ′ ∈ 𝑉(𝐶) of 𝐸 are 𝐸 𝑗 all reachable with 1 swap from 𝐸
Subtle issue For unbiased testing, we need to sample uniformly from 𝑉(𝐶) . Are all datasets in 𝑉 𝐶 reachable from 𝐸 by swapping? Can the ‘swap - graph’ of 𝑉 𝐶 be disconnected?
Subtle issue For unbiased testing, we need to sample uniformly from 𝑉(𝐶) . Are all datasets in 𝑉 𝐶 reachable from 𝐸 by swapping? Theorem [Ryser ‘57]. If 𝐵, 𝐶 ∈ 𝑁 𝑠, 𝑑 , then A is reachable from B with a finite number of swaps
Hopping through Sample Space A path through this graph is called a chain.
Beware! Subsequent states in Markov chains are dependent. Which means, subsequent samples are dependent. This is not a proble lem m if we let the chain co conve verge ge between drawing samples, but estimating mixing time is hard. If we would simply take the original data as the starting point to sample random data, all samples will be bia iased ed.
Recommend
More recommend