Discovering Causation
Jill illes V s Vreeken
12 12 June 2015 2015
Discovering Causation Jill illes V s Vreeken 12 12 June 2015 - - PowerPoint PPT Presentation
Discovering Causation Jill illes V s Vreeken 12 12 June 2015 2015 Questions of the day What is causatio ion, how can we measure it, and how can di disc scover it? Causality the relationship between something that happens or exists
12 12 June 2015 2015
(Merriam-Webster)
Correlation does not tell us anything about causality Instead, we should talk about dependence.
Godzillia zillian different definitions of ‘cause’
equally many inference frameworks not all with solid foundations;
many highly specific, most require strong assumptions
𝑄 cause 𝑄 effect cause > 𝑄 effect 𝑄 cause effect
𝑄 cause 𝑄 effect cause > 𝑄 effect 𝑄 cause effect
(rough bastardization of Markov condition)
𝐵 𝐶 𝑋 𝑎 𝑍 𝑌 𝐿 𝑀 𝑁
Parents of 𝑋 Descendants of 𝑋 Non-descendants
𝑌 𝑍 𝑎 𝑌 𝑍 𝑎 𝑌 𝑍 𝑎 𝑌 𝑍 𝑎
Reichenbach’s common mmon cause se principle links causality and probability if X and Y are statistically dependent then either When 𝑎 screens 𝑌 and 𝑍 from each other, given 𝑎, 𝑌 and 𝑍 become in independ ndent nt.
𝑌 𝑍 𝑌 𝑎 𝑍 𝑌 𝑍
Any distribution generated by a Markovian model 𝑁 can be factorized as 𝑄 𝑌1, 𝑌2, … , 𝑌𝑜 = 𝑄(𝑌𝑗 ∣ 𝑞𝑏𝑗)
𝑗
where 𝑌1, 𝑌2, … , 𝑌𝑜 are the endogenous variables in 𝑁, and 𝑞𝑏𝑗 are (values of) the endogenous “parents”
(Spirtes, Glymour, Scheines 1982; Pearl 2009)
Endogenous variable: A factor in a causal model or causal system whose value is determined by the states of other variables in the system; contrasted with an exogenous variable
For all distinct variables 𝑌 and 𝑍 in the variable set 𝑊, if 𝑌 does not cause 𝑍, then Pr 𝑌 𝑍, 𝑞𝑏𝑌 = Pr (𝑌 ∣ 𝑞𝑏𝑌) That is, we can weed eed out edges from a causal graph – we can identify DAGs up to Markov equivalence class. We are una unable to choose among these
𝑍 𝑌 𝑎 𝑋 𝑍 𝑌 𝑎 𝑋 𝑍 𝑌 𝑎 𝑋
Traditional causal inference methods rely on condit itional in independ ndenc nce tests and hence require at least three observed variables That is, they canno nnot distinguish between 𝑌 → 𝑍 and 𝑍 → 𝑌 as 𝑞 𝑦 𝑞 𝑧 𝑦 = 𝑞 𝑧 𝑞(𝑦 ∣ 𝑧) are just factorisations of 𝑞(𝑦, 𝑧) But, but, that’s exactly what we want to know!
Let’s take another look at the definition of causality. ‘the relationship between something that happens or exists and the thing that causes it’ So, essentially, if 𝑌 cause 𝑍, we can wiggle 𝑍 by wiggling 𝑌, while when we cannot wiggle 𝑌 by wiggling 𝑍. But… when we only have experimental data we cannot do any wiggling ourselves…
Whenever the joint distribution 𝑞(𝑌, 𝑍) admi mits s a model in one direction, e.g. 𝑍 = 𝑔 𝑌 + 𝑂 with 𝑂 ∥ 𝑌, but does not admit it the reversed model, 𝑌 = 𝑍 + 𝑂 with 𝑂 ∥ 𝑍 We can infer 𝑌 → 𝑍
(Peters et al. 2010)
(Janzing et al. 2012) 𝑧-value with large 𝐼(𝑌 ∣ 𝑧) and large density 𝑄(𝑧) 𝑦 𝑧
(Janzing et al. 2012) 𝑦 𝑧 𝑔(𝑦) 𝑞(𝑦) 𝑞(𝑧) “If the structure of density of 𝑞(𝑦) is not correlated with the slope of 𝑔, then the flat regions of 𝑔 induce peaks in 𝑞(𝑧). The causal hypothesis 𝑍 → 𝑌 is thus implausible because the causal mechanism 𝑔−1 appears to be adjusted to the “input” distribution 𝑞(𝑧).”
If 𝑞 cause 𝑞(effect ∣ cause) is simpl pler er than 𝑞 effect 𝑞(cause ∣ effect) then cause → effect
but, how to measure ‘simpler’? what about having not 𝑞 but 𝑞̂? is model complexity alone enough? is data complexity alone enough? and, what if there is is no distribution?
(Sun et al. 2006, Janzing et al. 2012)
Given two objects X and Y of your favorite types
e.g. bags of observations, or two objects of arbitrary type
Say whether
X and Y are independent, X causes Y (or vice versa), or X and Y are correlated
ript ptive complexity Without parameters, without assuming distributions.
(assuming, for the time being, no hidden confounders)
The Kolmogorov complexity of a binary string 𝑡 is the length of the shortest program 𝑙(𝑡) for a universal Turing Machine 𝑉 that generates 𝑡 and hal alts ts.
(Kolmogorov, 1963)
𝑙(𝑡) is the sim imple lest way to generate s alg lgorit ithmica ically lly if 𝑡 represents data that contains causal dependencies there will be evidence in 𝑙(𝑡) how can we get this evidence out? by using cond
wo-pa part complexity
The cond
is the length of the shortest program 𝑙(𝑡) for a universal Turing Machine 𝑉 th that at gi given str string 𝒖 as i inp nput ut generates 𝑡 and halts.
Serialising X and Y into a string 𝑡 we have 𝐿 𝐘, 𝐙 for the length of the shor hortest program 𝑙(X,Y) that generates X and Y Intuitively, this will factor out differently depending
right?
(a similar idea is explored by Janzing & Schölkopf, 2008, 2010)
Information, however, is symmetric 𝐿 𝐘, 𝐙 ≜ 𝐿 𝐘 + 𝐿 𝐙 ∣ 𝐘 ≜ 𝐿 𝐙 + 𝐿 𝐘 ∣ 𝐙
(equality up to a constant, Zvonkin & Levin, 1970)
Instead of factorizing 𝐿(𝐘, 𝐙), we have to look at the effec effect of conditioning. That is, if knowing 𝐘 makes the algorithmic description of 𝐙 easier er, 𝐘 is lik likely ly an (algorithmic) cause se of 𝐙 We have to identify the strongest dir irection n of in information between 𝐘 and 𝐙
So, how about we just regard 𝐿 𝐘 𝐙) and 𝐿 𝐙 𝐘) ? Close, but no cigar. If 𝐿(𝐘) is much larger than 𝐿(𝐙), directly comparing 𝐿 𝐘 𝐙 and 𝐿 𝐙 𝐘 will be biased to the simplest ‘cause’.
What we should therefore do, is norma malise se as then we can determine the strongest dir irection n of in informatio ion between 𝐘 and 𝐙 by comparing
Δ𝐘→𝐙 =
𝐿 𝐙 𝐘) 𝐿 𝐙
and
Δ𝐙→𝐘 =
𝐿 𝐘 𝐙) 𝐿 𝐘
(Vreeken, SDM 2015)
If X and Y are algorithmically indepen enden dent, we will see that
Δ𝐘→𝐙 =
𝐿 𝐙 𝐘) 𝐿 𝐙
and
Δ𝐙→𝐘 =
𝐿 𝐘 𝐙) 𝐿 𝐘
both approach 1
as neither data will give much information about the other
Last, when 𝐘 and 𝐙 are only algorithmically correlated, we will see Δ𝐘→𝐙 = 𝐿 𝐙 𝐘) 𝐿 𝐙 ≈ Δ𝐙→𝐘 = 𝐿 𝐘 𝐙) 𝐿 𝐘
as both carry approximately equal amounts
However, if 𝐘 algorithmically causes es Y, we will see
Δ𝐘→𝐙 =
𝐿 𝐙 𝐘) 𝐿 𝐙
approach 0, and
Δ𝐙→𝐘 =
𝐿 𝐘 𝐙) 𝐿 𝐘
approach 1
as for a large part X explains Y
X gives more information about Y than vice versa: there is an alg lgorithmic ic ca causal l co connect ctio ion! distances to 0, 1, and each other tell us the str strength For objects and collections.
Catches any alg lgorit ithmic ic dependency.
…but, can we actually implement this?
We can approximate by lossless comp
here we need compressors that incorporate conditional description of mod models and dat ata
Do we have such compressors?
not really – never any explicit use for but… we can define them!
For sets of observations X and Y
high-dimensional continuous data fast, non-parametric, noise tolerant, ve
very g good resul ults ts
Estimates directed information by entropy
average number of bits to describe (x,y) assuming using normalised cumulative resp. Shannon entropy (non-)linear functional (non-)deterministic causation
Cumulative entropy, however, is
defined ed for univariate variables. We estimate the multivariate cumulative entropy of 𝐘 as ℎ 𝐘 = ℎ 𝐘1 + ℎ 𝐘2 𝐘1 + ℎ 𝐘3 𝐘1, 𝐘2 + ⋯ ℎ(𝐘𝑛|𝐘1, … , 𝐘𝑛−1) Other factorisations are possible, and may, in fact, give better approximations…
(Vreeken 2015; Nguyen et al. 2014)
Δ𝐘→𝐙 = 1 2 ℎ 𝐙 𝐘) ℎ𝑣 𝐙 + 𝐼 𝐘′, 𝐙′ 𝐼𝑣 𝐘𝐘 + 𝐼𝑣 𝐙𝐘 and accordingly for Δ𝐙→𝐘
Δ𝐘→𝐙 = 1 2 ℎ 𝐙 𝐘) ℎ𝑣 𝐙 + 𝐼 𝐘′, 𝐙′ 𝐼𝑣 𝐘𝐘 + 𝐼𝑣 𝐙𝐘 and accordingly for Δ𝐙→𝐘
Cos Cost o
the d data
(cumulative entropy)
Cos Cost o
the m mod
(Shannon entropy)
Benchmark data Age → Marital Status Education level → Income Gender → Income Real data
# of Roman Catholic family members → # of married couples in the family # of family members with high edu. → # of family members with high status average income of whole family → # of home owners
who caused it?
uninfected neighbors
exone
being a culprit
the more easy
asy to reach the footprint, the bet better
(Prakash, Vreeken & Faloutsos, ICDM’12)
(Pearl)
𝑎1 𝑎2 𝑎3 𝑋
1
𝑋
2
𝑍 𝑌 𝑋
3
Causal inference
important, difficult, in rapid development
Causal inference by alg lgorit ithmic ic co comple lexit ity
solid foundations, clear interpretation, non-parametric for any pair of objects of any sort, for type
pe and toke ken causation
instantiation for multivariate real-valued data works very well
Ongoing
how deep does the rabbit hole go?
Causal inference
important, difficult, in rapid development
Causal inference by alg lgorit ithmic ic co comple lexit ity
solid foundations, clear interpretation, non-parametric for any pair of objects of any sort, for type
pe and toke ken causation
instantiation for multivariate real-valued data works very well
Ongoing
how deep does the rabbit hole go?