Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 - - PowerPoint PPT Presentation

β–Ά
discovering correlation
SMART_READER_LITE
LIVE PREVIEW

Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 - - PowerPoint PPT Presentation

Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 2015 Questions of the day What is correl elatio ion, how can we measure it, and how can di disc scover it? Correlation the relationship between things that happen or


slide-1
SLIDE 1

Discovering Correlation

Jill illes V s Vreeken

5 5 June 2015 2015

slide-2
SLIDE 2

Questions of the day

What is correl elatio ion, how can we measure it, and how can di disc scover it?

slide-3
SLIDE 3

Correlation

β€˜the relationship between things that happen or change together’

(Merriam-Webster)

slide-4
SLIDE 4

𝜍 = 0.947

slide-5
SLIDE 5

Correlation

β€˜a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or

  • ccur together in a way not expected on the

basis of chance alone’

(Merriam-Webster)

slide-6
SLIDE 6

Correlation

β€˜a relation ion existing between phenomena or things or between mathematical or statistical variables which te tend to to vary, be associated, or

  • ccur toget

ether er in n a way no y not exp expected o

  • n the

basis of chance a alon

  • ne’

(Merriam-Webster)

slide-7
SLIDE 7

Good Ol’ Pearson

Pearson product-moment correlation coefficient

  • ne of the most well-known measures for correlation

πœπ‘Œ,𝑍 = 𝑑𝑑𝑑𝑑 π‘Œ, 𝑍 = 𝐹 π‘Œ βˆ’ πœˆπ‘Œ 𝑍 βˆ’ πœˆπ‘ πœπ‘Œπœπ‘ That is, covariance divided by standard deviation. Pearson detects only lin linear correlations

slide-8
SLIDE 8

Pearson in action

(Wikipedia, yes really)

slide-9
SLIDE 9

𝜍 = 0.998

slide-10
SLIDE 10

Chance alone…

Last week, we discussed Shannon e entropy and mutual i information Can we use these to measure correlation? Yes, we can! Shannon entropy works very well for discrete data: e.g. low-entropy sets for continuous valued data: …

slide-11
SLIDE 11

Shannon entropy for continuous

As discussed last week, to compute β„Ž π‘Œ = βˆ’ 𝑔 𝑦 log 𝑔 𝑦 𝑒𝑦

𝐘

We need to estimate the probability density function, choose a step-size, and then hope for the best. If we don’t know the distribution, we can use kernel density estimation – which requires choosing a kernel and a bandwidth. KDE is well-behaved for univariate, but estimating multivariate densities is very difficult, especially for high dimensionalities.

slide-12
SLIDE 12

MIC MIC: Maximal Information Coefficient

A few years back, there was a big stir about MIC, a measure for non-linear correlations between pairs of variables.

The main idea in a nutshell:

If we want to measure the correlation of real-valued π‘Œ and 𝑍, why not discretize ize the data, and compute mutual information!? That is, just find those π‘Œπ‘Œ and π‘π‘Œ such that 𝐽(π‘Œπ‘Œ; π‘π‘Œ) is maximal, and treat that value as the correlation measure.

(Reshef et al, 2011)

slide-13
SLIDE 13

MIC in a pic

Given 𝐸 βŠ‚ ℝ2 and integers 𝑦 and 𝑧, π½βˆ— 𝐸, 𝑦, 𝑧 = max 𝐽(𝐸|𝐻) with 𝐻 over all grids of 𝑦 cols, 𝑧 rows. Normalise this score by independence 𝑁 𝐸 𝑦,𝑧 = π½βˆ— 𝐸, 𝑦, 𝑧 log min 𝑦, 𝑧 And return the maximum 𝑁𝐽𝑁 𝐸 = max

𝑦𝑧<𝐢(π‘œ){𝑁 𝐸 𝑦,𝑧}

slide-14
SLIDE 14

Mining with MIC

MIC is strictly defined for pairs s of variables es which means… β€˜Mining’ is ea easy! y! We have to measure MIC for ever ery p y pair

  • f attributes in our data, which we can

then order by their MIC score.

slide-15
SLIDE 15

BAD MIC BAD

MIC is a nice idea, but… stric ictly ly for pairs, heuris istic ic optimization, doesn’t like lin linear, and doesn’t like noise se at all And that are just a few of its drawbacks… Can we salvage the nice part?

(Simon and Tibshirani, 2011)

slide-16
SLIDE 16

Cumulative Distributions

𝐺(𝑦) = 𝑄(π‘Œ ≀ 𝑦) cdf df can be computed directly from data no no assumptions necessary

slide-17
SLIDE 17

Identifying Interacting Subspaces

slide-18
SLIDE 18

Entropy has been defined for cumulative distribution functions! β„Žπ·π· π‘Œ = βˆ’ 𝑄 π‘Œ ≀ 𝑦 log 𝑄 π‘Œ ≀ 𝑦 𝑒𝑦

𝑒𝑒𝑒 π‘Œ

As 0 ≀ 𝑄 π‘Œ ≀ 𝑦 ≀ 1 we obtain β„Žπ·π· π‘Œ β‰₯ 0 (!)

Cumulative Entropy

(Rao et al, 2004, 2005)

slide-19
SLIDE 19

How do we compute β„Žπ·π·(π‘Œ) in practice? Easy. Let π‘Œ1 ≀ β‹― ≀ π‘Œπ‘œ be i.i.d. random samples of continuous random variable π‘Œ

β„Žπ·π· π‘Œ = βˆ’ π‘Œπ‘—+1 βˆ’ π‘Œπ‘— 𝑗 π‘œ log 𝑗 π‘œ

π‘œβˆ’1 𝑗=1

Cumulative Entropy

(Rao et al, 2004, 2005, Crescenzo & Longobardi 2009)

slide-20
SLIDE 20

First things first. We need

β„Žπ·π· π‘Œ | 𝑍 = ∫ β„Žπ·π· π‘Œ 𝑧 π‘ž 𝑧 𝑒𝑧

which, in practice, means

β„Žπ·π· π‘Œ | 𝑍 = β„Žπ·π· π‘Œ 𝑧 π‘ž(𝑧)

π‘§βˆˆπ‘

with 𝑧 a discrete bin of data points over 𝑍, and π‘ž 𝑧 = 𝑧

π‘œ

How do we bin 𝑍 into 𝑧? We ca can n si simpl mply cl clus uster Y Y

Multivariate Cumulative Entropy (1)

(Nguyen et al, 2013)

slide-21
SLIDE 21

First things first. We need

β„Žπ·π· π‘Œ | 𝑍 = ∫ β„Žπ·π· π‘Œ 𝑧 π‘ž 𝑧 𝑒𝑧

which, in practice, means

β„Žπ·π· π‘Œ | 𝑍 = β„Žπ·π· π‘Œ 𝑧 π‘ž(𝑧)

π‘§βˆˆπ‘

with 𝑧 a discrete bin of data points over 𝑍, and π‘ž 𝑧 = 𝑧

π‘œ

How do we bin 𝑍 into 𝑧? Fin Find t the dis iscretis isation

  • n of Y

Y such such th that at β„Žπ·π· π‘Œ 𝑍 is mi mini nima mal

Multivariate Cumulative Entropy (2)

(Nguyen et al, 2014)

slide-22
SLIDE 22

We cannot (realistically) calculate β„Žπ·π· π‘Œ1, … , π‘Œπ‘’ in one go yet… entropy has a factorization property, so, what we can do is

β„Žπ·π· π‘Œπ‘— βˆ’

𝑗=2

β„Žπ·π·(π‘Œπ‘—|π‘Œ1, … , π‘Œπ‘—βˆ’1)

𝑗=2

Cumulative Mutual Information

(Nguyen et al, 2013)

slide-23
SLIDE 23

super simple: a priori-style

Mining for Interaction

slide-24
SLIDE 24

CMI: use apriori principle, mine all attribute sets with β„Žπ·π· ≀ 𝜏

Mining interacting attributes

(Nguyen et al, 2013ab)

slide-25
SLIDE 25

Measuring Multivariate Correlations

MIC is exclusively defined for pairs

 score and approach does not scale up to higher dimensions

Entrez, MAC

 Multivariate Maximal Correlation Analysis

(Nguyen et al, 2014)

slide-26
SLIDE 26

Maximal Correlation Analysis

The maxim imal co l correla latio ion of a set of real-valued random variables π‘Œπ‘— 𝑗=1

𝑒

is defined as

π‘π‘‘π‘‘π‘‘βˆ— π‘Œ1, … , π‘Œπ‘’ = max

𝑔

1,…,𝑔 𝑛 𝑁𝑑𝑑𝑑(𝑔

1 π‘Œ1 , … , 𝑔 𝑒 π‘Œπ‘’ )

where 𝑁𝑑𝑑𝑑 is a correlation measure, 𝑔

𝑗 ∢ 𝑒𝑑𝑒 π‘Œπ‘— β†’ 𝐡𝑗 is drawn from a pre-specified class

  • f functions, and 𝐡𝑗 βŠ† ℝ
slide-27
SLIDE 27

T

  • tal Correlation

Finds the ch chain in of pairwise grids that min inim imiz izes the entropy, that maxim imize izes correlation The total co l correlatio ion of a dataset D is 𝐽 𝐸 = 𝐼 π‘Œπ‘— βˆ’ 𝐼(π‘Œ1, … , π‘Œπ‘’)

𝑗=1

(Nguyen et al, 2014)

slide-28
SLIDE 28

Maximal Discretized Correlation

Let’s say our data is real valued, but that we have a discretization grid 𝐻, then we have 𝐽 𝐸𝐻 = 𝐼 π‘Œπ‘—

𝑕𝑗 βˆ’ 𝐼(π‘Œ1 𝑕1, … , π‘Œπ‘’ 𝑕𝑛) 𝑗

To find the maxim imal co l correla latio ion, we hence need to find that grid 𝐻 for 𝐸 such that 𝐽(𝐸𝐻) is maximized.

(Nguyen et al, 2014)

slide-29
SLIDE 29

Normalizing the Score

However, 𝐽(𝐸𝐻) strongly depends on the number of bins π‘œπ‘— for attribute 𝑗. So, we should normalize by an upper bound. 𝐽 𝐸𝐻 ≀ log π‘œπ‘— βˆ’ max({log π‘œπ‘—}𝑗=1

𝑒 ) 𝑗

(Nguyen et al, 2014)

slide-30
SLIDE 30

Normalizing the Score

However, 𝐽(𝐸𝐻) depends on the number of bins π‘œπ‘— for attribute 𝑗. So, we should normalize. We know 𝐽 𝐸𝐻 ≀ log π‘œπ‘— βˆ’ max({log π‘œπ‘—}𝑗=1

𝑒 ) 𝑗

by which we define π½π‘œ 𝐸𝐻 = 𝐽 𝐸𝐻 βˆ‘ log π‘œπ‘— βˆ’ max ({log π‘œπ‘—}𝑗=1

𝑒 }) 𝑗

as the nor

  • rmaliz

lized t tot

  • tal

l cor

  • rrela

latio ion

(Nguyen et al, 2014)

slide-31
SLIDE 31

MAC

After all that, we can now finally introduce MAC. 𝑁𝐡𝑁 𝐸 = max

𝐻={𝑕1,…,𝑕𝑛} βˆ€π‘—β‰ π‘˜π‘œπ‘—Γ—π‘œπ‘˜<𝑂1βˆ’πœ—

π½π‘œ(𝐸𝐻) How do we compute MAC? How do we choose G? Through cumulative entropy!

(Nguyen et al, 2014)

slide-32
SLIDE 32

Linear Circle

GOOD MAC GOOD

slide-33
SLIDE 33

NICE MAC NICE

20% noise 80% noise

slide-34
SLIDE 34

super simple: a priori-style

Mining with MAC

slide-35
SLIDE 35

PRETTY MAC PRETTY

20% noise 80% noise

slide-36
SLIDE 36

Comparability of Scores

So, we use a priori… but… are CMI, MIC, MAC, etc (anti)-monotonic? Is any meaningful correlation score monotonic?

slide-37
SLIDE 37

𝜍 = 0.985

Spurious Correlations

slide-38
SLIDE 38

Correlation does not imply…

Correlation means a co co-rela lation ion is observed, which does es no not imply a casual relation.

slide-39
SLIDE 39

Correlation does not imply…

If π‘Œ and 𝑍 are strongly correlated, this may have many reasons. Besides spurious, it may be that π‘Œ and 𝑍 are the result of an unobserved process π‘Ž. Next week we’ll investigate whether we can somehow tell if π‘Œ causes 𝑍 or vice versa.

slide-40
SLIDE 40

Correlation does not imply…

slide-41
SLIDE 41

Conclusions

Correlation is almost anything deviating from chance Measuring multivariate correlation is difficult

 especially if you want to be non

  • n-param

ametric

 even more so if you want to measure non

  • n-line

inear interactions

Entropy and Mutual Information are powerful tools

 Shannon entropy for nominal data  cumulative entropy for ordinal data  discretise smartly for multivariate CE

slide-42
SLIDE 42

𝜍 = 0.870

Thank you!