Discovering Correlation
Jill illes V s Vreeken
5 5 June 2015 2015
Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 - - PowerPoint PPT Presentation
Discovering Correlation Jill illes V s Vreeken 5 5 June 2015 2015 Questions of the day What is correl elatio ion, how can we measure it, and how can di disc scover it? Correlation the relationship between things that happen or
5 5 June 2015 2015
(Merriam-Webster)
π = 0.947
(Merriam-Webster)
(Merriam-Webster)
Pearson product-moment correlation coefficient
ππ,π = ππππ π, π = πΉ π β ππ π β ππ ππππ That is, covariance divided by standard deviation. Pearson detects only lin linear correlations
(Wikipedia, yes really)
π = 0.998
Last week, we discussed Shannon e entropy and mutual i information Can we use these to measure correlation? Yes, we can! Shannon entropy works very well for discrete data: e.g. low-entropy sets for continuous valued data: β¦
As discussed last week, to compute β π = β π π¦ log π π¦ ππ¦
π
We need to estimate the probability density function, choose a step-size, and then hope for the best. If we donβt know the distribution, we can use kernel density estimation β which requires choosing a kernel and a bandwidth. KDE is well-behaved for univariate, but estimating multivariate densities is very difficult, especially for high dimensionalities.
A few years back, there was a big stir about MIC, a measure for non-linear correlations between pairs of variables.
The main idea in a nutshell:
If we want to measure the correlation of real-valued π and π, why not discretize ize the data, and compute mutual information!? That is, just find those ππ and ππ such that π½(ππ; ππ) is maximal, and treat that value as the correlation measure.
(Reshef et al, 2011)
Given πΈ β β2 and integers π¦ and π§, π½β πΈ, π¦, π§ = max π½(πΈ|π») with π» over all grids of π¦ cols, π§ rows. Normalise this score by independence π πΈ π¦,π§ = π½β πΈ, π¦, π§ log min π¦, π§ And return the maximum ππ½π πΈ = max
π¦π§<πΆ(π){π πΈ π¦,π§}
MIC is a nice idea, butβ¦ stric ictly ly for pairs, heuris istic ic optimization, doesnβt like lin linear, and doesnβt like noise se at all And that are just a few of its drawbacksβ¦ Can we salvage the nice part?
(Simon and Tibshirani, 2011)
πΊ(π¦) = π(π β€ π¦) cdf df can be computed directly from data no no assumptions necessary
Entropy has been defined for cumulative distribution functions! βπ·π· π = β π π β€ π¦ log π π β€ π¦ ππ¦
πππ π
As 0 β€ π π β€ π¦ β€ 1 we obtain βπ·π· π β₯ 0 (!)
(Rao et al, 2004, 2005)
How do we compute βπ·π·(π) in practice? Easy. Let π1 β€ β― β€ ππ be i.i.d. random samples of continuous random variable π
βπ·π· π = β ππ+1 β ππ π π log π π
πβ1 π=1
(Rao et al, 2004, 2005, Crescenzo & Longobardi 2009)
First things first. We need
βπ·π· π | π = β« βπ·π· π π§ π π§ ππ§
which, in practice, means
βπ·π· π | π = βπ·π· π π§ π(π§)
π§βπ
with π§ a discrete bin of data points over π, and π π§ = π§
π
How do we bin π into π§? We ca can n si simpl mply cl clus uster Y Y
(Nguyen et al, 2013)
First things first. We need
βπ·π· π | π = β« βπ·π· π π§ π π§ ππ§
which, in practice, means
βπ·π· π | π = βπ·π· π π§ π(π§)
π§βπ
with π§ a discrete bin of data points over π, and π π§ = π§
π
How do we bin π into π§? Fin Find t the dis iscretis isation
Y such such th that at βπ·π· π π is mi mini nima mal
(Nguyen et al, 2014)
We cannot (realistically) calculate βπ·π· π1, β¦ , ππ in one go yetβ¦ entropy has a factorization property, so, what we can do is
βπ·π· ππ β
π=2
βπ·π·(ππ|π1, β¦ , ππβ1)
π=2
(Nguyen et al, 2013)
super simple: a priori-style
CMI: use apriori principle, mine all attribute sets with βπ·π· β€ π
(Nguyen et al, 2013ab)
MIC is exclusively defined for pairs
ο§ score and approach does not scale up to higher dimensions
Entrez, MAC
ο§ Multivariate Maximal Correlation Analysis
(Nguyen et al, 2014)
The maxim imal co l correla latio ion of a set of real-valued random variables ππ π=1
π
is defined as
ππππβ π1, β¦ , ππ = max
π
1,β¦,π π ππππ(π
1 π1 , β¦ , π π ππ )
where ππππ is a correlation measure, π
π βΆ πππ ππ β π΅π is drawn from a pre-specified class
Finds the ch chain in of pairwise grids that min inim imiz izes the entropy, that maxim imize izes correlation The total co l correlatio ion of a dataset D is π½ πΈ = πΌ ππ β πΌ(π1, β¦ , ππ)
π=1
(Nguyen et al, 2014)
Letβs say our data is real valued, but that we have a discretization grid π», then we have π½ πΈπ» = πΌ ππ
ππ β πΌ(π1 π1, β¦ , ππ ππ) π
To find the maxim imal co l correla latio ion, we hence need to find that grid π» for πΈ such that π½(πΈπ») is maximized.
(Nguyen et al, 2014)
However, π½(πΈπ») strongly depends on the number of bins ππ for attribute π. So, we should normalize by an upper bound. π½ πΈπ» β€ log ππ β max({log ππ}π=1
π ) π
(Nguyen et al, 2014)
However, π½(πΈπ») depends on the number of bins ππ for attribute π. So, we should normalize. We know π½ πΈπ» β€ log ππ β max({log ππ}π=1
π ) π
by which we define π½π πΈπ» = π½ πΈπ» β log ππ β max ({log ππ}π=1
π }) π
as the nor
lized t tot
l cor
latio ion
(Nguyen et al, 2014)
After all that, we can now finally introduce MAC. ππ΅π πΈ = max
π»={π1,β¦,ππ} βπβ πππΓππ<π1βπ
π½π(πΈπ») How do we compute MAC? How do we choose G? Through cumulative entropy!
(Nguyen et al, 2014)
Linear Circle
20% noise 80% noise
super simple: a priori-style
20% noise 80% noise
π = 0.985
Correlation is almost anything deviating from chance Measuring multivariate correlation is difficult
ο§ especially if you want to be non
ametric
ο§ even more so if you want to measure non
inear interactions
Entropy and Mutual Information are powerful tools
ο§ Shannon entropy for nominal data ο§ cumulative entropy for ordinal data ο§ discretise smartly for multivariate CE
π = 0.870