Determining Significance Jilles Vreeken 19 June 2015 2015 - - PowerPoint PPT Presentation

determining significance
SMART_READER_LITE
LIVE PREVIEW

Determining Significance Jilles Vreeken 19 June 2015 2015 - - PowerPoint PPT Presentation

Determining Significance Jilles Vreeken 19 June 2015 2015 Question of the day How can we find things that are interesting with regard to what we already know ? What is interesting? something that in increas eases our knowledge about the


slide-1
SLIDE 1

Determining Significance

Jilles Vreeken

19 June 2015 2015

slide-2
SLIDE 2

Question of the day

How can we find things that are interesting with regard to what we already know?

slide-3
SLIDE 3

What is interesting?

something that in increas eases our knowledge about the data

slide-4
SLIDE 4

What is a good result?

something that reduces ces our uncertainty about the data

(ie. increases the likelihood of the data)

slide-5
SLIDE 5

What is really good?

something that, in sim imple le terms, strong

  • ngly

ly reduces ces our uncertainty about the data

(maximise likelihood, but avoid overfitting)

slide-6
SLIDE 6

Measuring Uncertainty

We need access to the li likelihood elihood

  • f data D given background knowledge B

such that we can calculate the ga gain in for X

…which distribution should we use?

slide-7
SLIDE 7

Measuring Surprise

We need access to the lik ikel elihoo ihood d

  • f result X given background knowledge B

such that we can mine the data for X that have a low likelihood, that are surpri rprising ing

…which distribution should we use?

slide-8
SLIDE 8

Measuring Surprise

We need access to the lik ikel elihoo ihood d

  • f result X given background knowledge B

such that we can mine the data for X that have a low likelihood, that are surpri rprising ing

…which distribution should we use?

This is called the p-value

  • f result X
slide-9
SLIDE 9

Measuring Surprise

We need access to the lik ikel elihoo ihood d

  • f result X given background knowledge B

such that we can mine the data for X that have a low likelihood, that are surpri rprising ing

…which distribution should we use?

This is called the p-value

  • f result X

The p-value corresponds to the Frequentist probability of the result being generated by the null-hypothesis

slide-10
SLIDE 10

Measuring Surprise

We need access to the lik ikel elihoo ihood d

  • f result X given background knowledge B

such that we can mine the data for X that have a low likelihood, that are surpri rprising ing

…which distribution should we use?

This is called the p-value

  • f result X
slide-11
SLIDE 11

We do not want to have to choose a distribution We want to be able to test significance against what t we we already ady know. That is, our null hypothesis is ‘The results are explained by what t we we kn know about the data’ But, what do we know about the data? And, how do we test against this?

Background Knowledge

slide-12
SLIDE 12

1.

Mine original data

2.

Mine random data

3.

Determine probability

Approach 1: Randomization

Original data Random data #1 Random data #2 Random data #N

...

score(X | D)

slide-13
SLIDE 13

1.

Mine original data

2.

Mine random data

3.

Determine probability

Approach 1: Randomization

Original data Random data #1 Random data #2 Random data #N

...

score(X | D)

The fraction of better ‘randoms’ is the empirical p-value

  • f result X
slide-14
SLIDE 14

Empirical p-values

Let 𝐸 be our data and 𝐶 our background knowledge. Let 𝑉(𝐶) be the space of all data that satisfies 𝐶. Let 𝑇 ⊆ 𝑉(𝐶) be a uniform form random sample of 𝑉(𝐶). Let 𝑆(𝐸) be a single gle number our data mining method results.

(e.g., the frequency of an itemset, the number of frequent itemsets at a chosen minsup, the average value

  • ver some area, the clustering error, the compressed size of the data, the accuracy, etc, etc)

The empirical irical 𝒒-value lue of 𝑆 𝐸 being ‘big’ then is 𝐸′ ∈ 𝑇 𝑆 𝐸′ ≥ 𝑆 𝐸 + 1 𝑇 + 1

slide-15
SLIDE 15

Empirical p-values

Let 𝐸 be our data and 𝐶 our background knowledge. Let 𝑉(𝐶) be the space of all data that satisfies 𝐶. Let 𝑇 ⊆ 𝑉(𝐶) be a uniform form random sample of 𝑉(𝐶). Let 𝑆(𝐸) be a single gle number our data mining method results.

(e.g., the frequency of an itemset, the number of frequent itemsets at a chosen minsup, the average value

  • ver some area, the clustering error, the compressed size of the data, the accuracy, etc, etc)

The empirical irical 𝒒-value lue of 𝑆 𝐸 being ‘big’ then is 𝐸′ ∈ 𝑇 𝑆 𝐸′ ≥ 𝑆 𝐸 + 1 𝑇 + 1

slide-16
SLIDE 16

More on empirical p-values

The empirical irical 𝒒-value lue of 𝑆 𝐸 being ‘big’ is 𝐸′ ∈ 𝑇 𝑆 𝐸′ ≥ 𝑆 𝐸 + 1 𝑇 + 1 We have the +1’s to avoid 0s. If 𝑇 = 𝑉(𝐶) this is an exact ct test, and then the +1s are not needed Clearly, the bigger the sample 𝑇 the better. It controls the maximum accuracy, the resolut

  • lution

ion of the empirical p-value. If you want to measure significance at 𝑞 = 0.05 you need at least 20 samples (and rather, many many more)

slide-17
SLIDE 17

More on empirical p-values

The empirical irical 𝒒-value lue of 𝑆 𝐸 being ‘big’ is 𝐸′ ∈ 𝑇 𝑆 𝐸′ ≥ 𝑆 𝐸 + 1 𝑇 + 1 We have the +1’s to avoid 0s. If 𝑇 = 𝑉(𝐶) this is an exact ct test, and then the +1s are not needed Clearly, the bigger the sample 𝑇 the better. It controls the maximum accuracy, the resolut

  • lution

ion of the empirical p-value. If you want to measure significance at 𝑞 = 0.05 you need at least 20 samples (and rather, many many more)

slide-18
SLIDE 18

1.

Mine original data

2.

Mine random data

3.

Determine probability

Approach 1: Randomization

Original data Random data #1 Random data #2 Random data #N

...

score(X | D)

The fraction of better ‘randoms’ is the empirical p-value

  • f result X
slide-19
SLIDE 19

1.

Mine original data

2.

Mine random data

3.

Determine probability

Approach 1: Randomization

Original data Random data #1 Random data #2 Random data #N

...

score(X | D)

The fraction of better ‘randoms’ is the empirical p-value

  • f result X
slide-20
SLIDE 20

So, we now we just need lots of data sets that are

 maintain our background knowledge,  completely random otherwise

How can we get our hands on such data? and, how do we sample it unifor

  • rml

mly y at rando dom? This depends on the type of data, and the type(s) of background knowledge we want to maintain.

Random Data

slide-21
SLIDE 21

For now, let us simply consider binar ary y data ata.

Exa xample mple: : Binary Data

slide-22
SLIDE 22

Let there be data

Exa xample mple: : Binary Data

(swap randomization, Gionis et al. 2005)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-23
SLIDE 23

Say we only know overall density. How to sample random data?

Exa xample mple: : Binary Data

(swap randomization, Gionis et al. 2005)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 27

slide-24
SLIDE 24

Didactically, let us instead consider a Monte-Carlo Markov Chain

Very simple scheme

  • 1. select two cells at random,
  • 2. swap values,
  • 3. repeat until convergence.

Exa xample mple: : Binary Data

(swap randomization, Gionis et al. 2005)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 27

slide-25
SLIDE 25

Margins are easy understandable for binary data, how can we sample data with same margins?

Swap Randomization

(swap randomization, Gionis et al. 2005)

1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

slide-26
SLIDE 26

By MCMC!

  • 1. randomly find submatrix

Swap Randomization

(swap randomization, Gionis et al. 2005)

1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

slide-27
SLIDE 27

By MCMC!

  • 1. randomly find submatrix
  • 2. swap values

Swap Randomization

(swap randomization, Gionis et al. 2005)

1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

slide-28
SLIDE 28

By MCMC!

  • 1. randomly find submatrix
  • 2. swap values
  • 3. repeat until convergence

Swap Randomization

(swap randomization, Gionis et al. 2005)

1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

slide-29
SLIDE 29

Hopping through Sample Space

𝑬 𝑬’𝟐 𝑬’𝟓 𝑬’𝟒 𝑬’𝟑

The neighbours 𝐸𝑗

′ ∈ 𝑉(𝐶) of 𝐸 are

all reachable with 1 swap from 𝐸

swap

slide-30
SLIDE 30

Subtle issue

For unbiased testing, we need to sample uniformly from 𝑉(𝐶). Are all datasets in 𝑉 𝐶 reachable from 𝐸 by swapping?

Can the ‘swap-graph’ of 𝑉 𝐶 be disconnected?

slide-31
SLIDE 31

Subtle issue

For unbiased testing, we need to sample uniformly from 𝑉(𝐶). Are all datasets in 𝑉 𝐶 reachable from 𝐸 by swapping?

Theorem [Ryser ‘57]. If 𝐵, 𝐶 ∈ 𝑁 𝑠, 𝑑 , then A is reachable from B with a finite number of swaps

slide-32
SLIDE 32

Hopping through Sample Space

A path through this graph is called a chain.

slide-33
SLIDE 33

Subsequent states in Markov chains are dependent. Which means, subsequent samples are dependent. This is not a proble lem m if we let the chain co conve verge ge between drawing samples, but estimating mixing time is hard. If we would simply take the original data as the starting point to sample random data, all samples will be bia iased ed.

Beware!

slide-34
SLIDE 34

Besag-Clifford correction helps to avoid biased MCMC sampling. In a nutshell, we first run the chain 𝑙 steps back ckwar ard d and then 𝑚 times 𝑙 steps forwar ward Original and random data are now interchangeable. For time-reversible chains forward d = back ckwar ard

Besag-Clifford Correction

slide-35
SLIDE 35

More Subtle Issues

For unbiased testing, we need to sample uniform formly from 𝑉(𝐶). However, not every state 𝑌 has the same degree. That is, we cannot always reach the same number of datasets 𝑍 ∈ 𝑉(𝐶) in one step (swap). The resulting Markov chain hence does not t have a uniform stationary distribution. How can we fix this?

1)

add self loops ps to ensure each state has the same degree.

2)

use the Metropol

  • polis

is-Hasti astings ngs algorithm

slide-36
SLIDE 36

Self-Loops

In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

slide-37
SLIDE 37

Self-Loops

In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

slide-38
SLIDE 38

Self-Loops

In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

slide-39
SLIDE 39

1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

Self-Loops

In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap.

slide-40
SLIDE 40

1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

Self-Loops

In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. Otherwise, we self-loop, count the step, and continue.

slide-41
SLIDE 41

1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

Self-Loops

In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. Otherwise, we self-loop, count the step, and continue. This chain now does have a uniform stationary distribution, but also a very long burn-in time.

slide-42
SLIDE 42

Self-Loops

In every state 𝑌 we select, uniformly at random, two cells (𝑗, 𝑘) and (𝑙, 𝑚) such that 𝑗 ≠ 𝑙 and 𝑘 ≠ 𝑚 with 𝑌𝑗𝑘 = 𝑌𝑙𝑚 = 1. If 𝑌𝑗𝑚 = 𝑌𝑙𝑘 = 0 they form a swap box, and we swap. Otherwise, we self-loop, count the step, and continue. This chain now does have a uniform form stat tationa ionary distri stribut bution

  • n,

but also a ve very long burn-in time. 1 1 1 1 1 1 6 1 1 1 1 4 1 1 1 1 4 1 1 1 1 1 5 1 1 1 3 1 1 1 1 4 1 1 3 6 6 4 3 2 3 27

slide-43
SLIDE 43

Swapping with Metropolis-Hastings

Let 𝑂(𝑌) be the number of neighbours of state 𝑌. For Metropolis-Hastings, we select 𝑍 ∈ 𝑂(𝑌) u.a.r. and make the transition with probability min

𝑂 𝑌 𝑂 𝑍 , 1 , selecting 𝑍 using

rejection sampling. That is, we try random pairs (𝑗, 𝑘), (𝑙, 𝑚) for 𝑌 and if it defines a swap box, applying that swap gives us 𝑍. MH prob

  • babl

ably converges faster than self-loops. However, skipping all detail, computing 𝑂(𝑍) takes 𝑃 min 𝑜, 𝑛 , which is considerably more time per step than with self-loops :-/

slide-44
SLIDE 44

Time till convergence

8019015 908576 10449902 7150992 65152 1978

slide-45
SLIDE 45

What if I know more about my data?

Clearly, row and column margins are a rather simple type of background knowledge. Can we incorporate more advanced knowledge? The answer is: yes, and no Hanhijärvi et al. showed that, beyond margi gins ns, we can include clust ster r margin gins – easy, just only consider rows inside a cluster exact ct ite temset mset margi gins ns – but that this is intractable in general appr proxi xima mate e ite temset mset margi gins ns – with very high mixing time

(Hanhijarvi et al. 2009)

slide-46
SLIDE 46

Beyond Binary

Swap randomization per se works only for binary data, the main idea can be extended to other data types. Swap rotati tion

  • n for example, allows us to sample

real-val alued ued data with appr proxi ximat matel ely y the same

 mean and variance on rows and columns  value distribution (histogram) on rows and columns

(swap rotation, Ojala et al. 2009)

slide-47
SLIDE 47

Transitions in Real-Valued Data

𝑘 ⋮ 𝑗 … 𝛽 … ⋮

One-element changes

 resample from [0,1] or 𝐸  add a value 𝛽 ∈ −𝑡, 𝑡

Four-element changes

 swap-rot

  • tat

ate e if 𝑏 = 𝑏′ and 𝑐 = 𝑐′ equals to swap

 mask

preserves row and column sums

 swap-di

discr cret etised ed discretise the data into 𝑙 bins, apply swap-randomization

resample

𝑘 ⋮ 𝑗 … +𝛽 … ⋮

add

𝑘1 𝑘2 ⋮ ⋮ 𝑗1 … 𝑏 … 𝑐 … ⋮ ⋮ 𝑗2 … 𝑐’ … 𝑏’ … ⋮ ⋮ 𝑘1 𝑘2 ⋮ ⋮ 𝑗1 … 𝑐′ … 𝑏 … ⋮ ⋮ 𝑗2 … 𝑏′ … 𝑐 … ⋮ ⋮

swap-rotate

𝑘1 𝑘2 ⋮ ⋮ 𝑗1 … +𝛽 … −𝛽 … ⋮ ⋮ 𝑗2 … −𝛽 … +𝛽 … ⋮ ⋮

mask

slide-48
SLIDE 48

Accepting a Change

The Metropolis algorithm performs a local change and accepts the result with a certain probability. If 𝑌 is the original data, and 𝑍 is the result, we accept it with probability 𝑑 × exp{−𝑥𝐹 𝑌, 𝑍 } where 𝑑 is a normalisation constant, 𝑥 is a weight parameter, and 𝐹(𝑌, 𝑍) is a difference measure between 𝑌 and 𝑍. How to measure the difference between distributions?

 𝑀1 norm between unnormalized cdf’s, or  comparing histograms

(swap rotation, Ojala et al. 2009)

slide-49
SLIDE 49

Accepting a Change

The Metropolis algorithm performs a local change and accepts the result with a certain probability. If 𝑌 is the original data, and 𝑍 is the result, we accept it with probability 𝑑 × exp{−𝑥𝐹 𝑌, 𝑍 } where 𝑑 is a normalisation constant, 𝑥 is a weight parameter, and 𝐹(𝑌, 𝑍) is a difference measure between 𝑌 and 𝑍. How to measure the difference between distributions?

 𝑀1 norm between unnormalized cdf’s, or  comparing histograms

(swap rotation, Ojala et al. 2009)

slide-50
SLIDE 50

Swapping in Action

(swap rotation, Ojala et al. 2009)

slide-51
SLIDE 51

Many ways to test static tic null hypothesis

assuming distribution, swap-randomization, MaxEnt

What can we use this for?

ranking ng based on static c signific ficanc ance mining the to top-k k most significant patterns, but not suited for iterative mining

Static Models

slide-52
SLIDE 52

For ite iterative rative data mining, we need models that can maintain the type of information (eg. patterns) that we min ine Randomization is powerful

 variations exists for many data types (Ojala ‘09, Henelius et al ’13)  can be pushed beyond margins (see Hanhijärvi et al 2009)  but… has key disadvantages

Dynamic Models

slide-53
SLIDE 53

Conclusions

Significance testing is important

 results that follow from what you know are boring  but, choosing a good model (and test) is difficult

Randomization

 simple yet powerful, available for many data types,  slow, difficult to extend to maintain higher order statistics,  and ‘only’ gives empiri

rical al p-values

slide-54
SLIDE 54

Thank you!

Significance testing is important

 results that follow from what you know are boring  but, choosing a good model (and test) is difficult

Randomization

 simple yet powerful, available for many data types,  slow, difficult to extend to maintain higher order statistics,  and ‘only’ gives empiri

rical al p-values