Discovering Causation Jill illes V s Vreeken 12 12 June 2015 - - PowerPoint PPT Presentation

discovering causation
SMART_READER_LITE
LIVE PREVIEW

Discovering Causation Jill illes V s Vreeken 12 12 June 2015 - - PowerPoint PPT Presentation

Discovering Causation Jill illes V s Vreeken 12 12 June 2015 2015 Questions of the day What is causatio ion, how can we measure it, and how can di disc scover it? Causality the relationship between something that happens or exists


slide-1
SLIDE 1

Discovering Causation

Jill illes V s Vreeken

12 12 June 2015 2015

slide-2
SLIDE 2

Questions of the day

What is causatio ion, how can we measure it, and how can di disc scover it?

slide-3
SLIDE 3

Causality

‘the relationship between something that happens or exists and the thing that causes it’

(Merriam-Webster)

slide-4
SLIDE 4

Correlation vs. Causation

Correlation does not tell us anything about causality Instead, we should talk about dependence.

slide-5
SLIDE 5

Dependence vs. Causation

slide-6
SLIDE 6

Causal Inference

slide-7
SLIDE 7

What is causal inference?

‘reasoning to the conclusion that something is,

  • r is likely to be, the cause of something else’

Godzillia zillian different definitions of ‘cause’

 equally many inference frameworks  not all with solid foundations;

many highly specific, most require strong assumptions

slide-8
SLIDE 8

Naïve approach

If

𝑄 cause 𝑄 effect cause > 𝑄 effect 𝑄 cause effect

then cause → effect

slide-9
SLIDE 9

Naïve approach

If

𝑄 cause 𝑄 effect cause > 𝑄 effect 𝑄 cause effect

then cause → effect

(rough bastardization of Markov condition)

slide-10
SLIDE 10

Causal Graphs

𝐵 𝐶 𝑋 𝑎 𝑍 𝑌 𝐿 𝑀 𝑁

Parents of 𝑋 Descendants of 𝑋 Non-descendants

  • f 𝑋
slide-11
SLIDE 11

Choices…

𝑌 𝑍 𝑎 𝑌 𝑍 𝑎 𝑌 𝑍 𝑎 𝑌 𝑍 𝑎

slide-12
SLIDE 12

Statistical Causality

Reichenbach’s common mmon cause se principle links causality and probability if X and Y are statistically dependent then either When 𝑎 screens 𝑌 and 𝑍 from each other, given 𝑎, 𝑌 and 𝑍 become in independ ndent nt.

𝑌 𝑍 𝑌 𝑎 𝑍 𝑌 𝑍

slide-13
SLIDE 13

Causal Markov Condition

Any distribution generated by a Markovian model 𝑁 can be factorized as 𝑄 𝑌1, 𝑌2, … , 𝑌𝑜 = 𝑄(𝑌𝑗 ∣ 𝑞𝑏𝑗)

𝑗

where 𝑌1, 𝑌2, … , 𝑌𝑜 are the endogenous variables in 𝑁, and 𝑞𝑏𝑗 are (values of) the endogenous “parents”

  • f 𝑌𝑗 in the causal diagram associated with 𝑁

(Spirtes, Glymour, Scheines 1982; Pearl 2009)

Endogenous variable: A factor in a causal model or causal system whose value is determined by the states of other variables in the system; contrasted with an exogenous variable

slide-14
SLIDE 14

In other words…

For all distinct variables 𝑌 and 𝑍 in the variable set 𝑊, if 𝑌 does not cause 𝑍, then Pr 𝑌 𝑍, 𝑞𝑏𝑌 = Pr (𝑌 ∣ 𝑞𝑏𝑌) That is, we can weed eed out edges from a causal graph – we can identify DAGs up to Markov equivalence class. We are una unable to choose among these

𝑍 𝑌 𝑎 𝑋 𝑍 𝑌 𝑎 𝑋 𝑍 𝑌 𝑎 𝑋

slide-15
SLIDE 15

Three is a crowd

Traditional causal inference methods rely on condit itional in independ ndenc nce tests and hence require at least three observed variables That is, they canno nnot distinguish between 𝑌 → 𝑍 and 𝑍 → 𝑌 as 𝑞 𝑦 𝑞 𝑧 𝑦 = 𝑞 𝑧 𝑞(𝑦 ∣ 𝑧) are just factorisations of 𝑞(𝑦, 𝑧) But, but, that’s exactly what we want to know!

slide-16
SLIDE 16

Wiggle Wiggle

Let’s take another look at the definition of causality. ‘the relationship between something that happens or exists and the thing that causes it’ So, essentially, if 𝑌 cause 𝑍, we can wiggle 𝑍 by wiggling 𝑌, while when we cannot wiggle 𝑌 by wiggling 𝑍. But… when we only have experimental data we cannot do any wiggling ourselves…

slide-17
SLIDE 17

Additive Noise Models

Whenever the joint distribution 𝑞(𝑌, 𝑍) admi mits s a model in one direction, e.g. 𝑍 = 𝑔 𝑌 + 𝑂 with 𝑂 ∥ 𝑌, but does not admit it the reversed model, 𝑌 = 𝑕 𝑍 + 𝑂 with 𝑂 ∥ 𝑍 We can infer 𝑌 → 𝑍

(Peters et al. 2010)

slide-18
SLIDE 18

May the Noise be With you

(Janzing et al. 2012) 𝑧-value with large 𝐼(𝑌 ∣ 𝑧) and large density 𝑄(𝑧) 𝑦 𝑧

slide-19
SLIDE 19

May the Noise be With you

(Janzing et al. 2012) 𝑦 𝑧 𝑔(𝑦) 𝑞(𝑦) 𝑞(𝑧) “If the structure of density of 𝑞(𝑦) is not correlated with the slope of 𝑔, then the flat regions of 𝑔 induce peaks in 𝑞(𝑧). The causal hypothesis 𝑍 → 𝑌 is thus implausible because the causal mechanism 𝑔−1 appears to be adjusted to the “input” distribution 𝑞(𝑧).”

slide-20
SLIDE 20

Plausible Markov Kernels

If 𝑞 cause 𝑞(effect ∣ cause) is simpl pler er than 𝑞 effect 𝑞(cause ∣ effect) then cause → effect

but, how to measure ‘simpler’? what about having not 𝑞 but 𝑞̂? is model complexity alone enough? is data complexity alone enough? and, what if there is is no distribution?

(Sun et al. 2006, Janzing et al. 2012)

slide-21
SLIDE 21

My approach

Given two objects X and Y of your favorite types

e.g. bags of observations, or two objects of arbitrary type

Say whether

 X and Y are independent,  X causes Y (or vice versa), or  X and Y are correlated

  • n basis of descri

ript ptive complexity Without parameters, without assuming distributions.

(assuming, for the time being, no hidden confounders)

slide-22
SLIDE 22

Kolmogorov Complexity

The Kolmogorov complexity of a binary string 𝑡 is the length of the shortest program 𝑙(𝑡) for a universal Turing Machine 𝑉 that generates 𝑡 and hal alts ts.

(Kolmogorov, 1963)

slide-23
SLIDE 23

Elementary, my dear Watson

𝑙(𝑡) is the sim imple lest way to generate s alg lgorit ithmica ically lly if 𝑡 represents data that contains causal dependencies there will be evidence in 𝑙(𝑡) how can we get this evidence out? by using cond

  • nditiona
  • nal two

wo-pa part complexity

slide-24
SLIDE 24

Conditional Complexity

The cond

  • nditiona
  • nal Kolmogorov complexity of a string 𝑡

is the length of the shortest program 𝑙(𝑡) for a universal Turing Machine 𝑉 th that at gi given str string 𝒖 as i inp nput ut generates 𝑡 and halts.

slide-25
SLIDE 25

Kolmo-causal

Serialising X and Y into a string 𝑡 we have 𝐿 𝐘, 𝐙 for the length of the shor hortest program 𝑙(X,Y) that generates X and Y Intuitively, this will factor out differently depending

  • n how 𝐘 and 𝐙 are related,

right?

(a similar idea is explored by Janzing & Schölkopf, 2008, 2010)

slide-26
SLIDE 26

Wrench in the works

Information, however, is symmetric 𝐿 𝐘, 𝐙 ≜ 𝐿 𝐘 + 𝐿 𝐙 ∣ 𝐘 ≜ 𝐿 𝐙 + 𝐿 𝐘 ∣ 𝐙

(equality up to a constant, Zvonkin & Levin, 1970)

slide-27
SLIDE 27

Direction of information

Instead of factorizing 𝐿(𝐘, 𝐙), we have to look at the effec effect of conditioning. That is, if knowing 𝐘 makes the algorithmic description of 𝐙 easier er, 𝐘 is lik likely ly an (algorithmic) cause se of 𝐙 We have to identify the strongest dir irection n of in information between 𝐘 and 𝐙

slide-28
SLIDE 28

Conditional Complexity

So, how about we just regard 𝐿 𝐘 𝐙) and 𝐿 𝐙 𝐘) ? Close, but no cigar. If 𝐿(𝐘) is much larger than 𝐿(𝐙), directly comparing 𝐿 𝐘 𝐙 and 𝐿 𝐙 𝐘 will be biased to the simplest ‘cause’.

slide-29
SLIDE 29

Normalised

What we should therefore do, is norma malise se as then we can determine the strongest dir irection n of in informatio ion between 𝐘 and 𝐙 by comparing

Δ𝐘→𝐙 =

𝐿 𝐙 𝐘) 𝐿 𝐙

and

Δ𝐙→𝐘 =

𝐿 𝐘 𝐙) 𝐿 𝐘

(Vreeken, SDM 2015)

slide-30
SLIDE 30

Characterising independence

If X and Y are algorithmically indepen enden dent, we will see that

Δ𝐘→𝐙 =

𝐿 𝐙 𝐘) 𝐿 𝐙

and

Δ𝐙→𝐘 =

𝐿 𝐘 𝐙) 𝐿 𝐘

both approach 1

as neither data will give much information about the other

slide-31
SLIDE 31

Characterising correlation

Last, when 𝐘 and 𝐙 are only algorithmically correlated, we will see Δ𝐘→𝐙 = 𝐿 𝐙 𝐘) 𝐿 𝐙 ≈ Δ𝐙→𝐘 = 𝐿 𝐘 𝐙) 𝐿 𝐘

as both carry approximately equal amounts

  • f information about each other
slide-32
SLIDE 32

Characterising causation

However, if 𝐘 algorithmically causes es Y, we will see

Δ𝐘→𝐙 =

𝐿 𝐙 𝐘) 𝐿 𝐙

approach 0, and

Δ𝐙→𝐘 =

𝐿 𝐘 𝐙) 𝐿 𝐘

approach 1

as for a large part X explains Y

slide-33
SLIDE 33

Inference by Complexity

X gives more information about Y than vice versa: there is an alg lgorithmic ic ca causal l co connect ctio ion! distances to 0, 1, and each other tell us the str strength For objects and collections.

  • Descriptive. No parameters. No priors.

Catches any alg lgorit ithmic ic dependency.

…but, can we actually implement this?

slide-34
SLIDE 34

Implementing our Rule

We can approximate by lossless comp

  • mpression
  • n

here we need compressors that incorporate conditional description of mod models and dat ata

Do we have such compressors?

not really – never any explicit use for but… we can define them!

slide-35
SLIDE 35

ERGO

For sets of observations X and Y

 high-dimensional continuous data  fast, non-parametric, noise tolerant, ve

very g good resul ults ts

Estimates directed information by entropy

 average number of bits to describe (x,y) assuming  using normalised cumulative resp. Shannon entropy  (non-)linear functional (non-)deterministic causation

slide-36
SLIDE 36

Multivariate Cumulative Entropy

Cumulative entropy, however, is

  • nly defi

defined ed for univariate variables. We estimate the multivariate cumulative entropy of 𝐘 as ℎ 𝐘 = ℎ 𝐘1 + ℎ 𝐘2 𝐘1 + ℎ 𝐘3 𝐘1, 𝐘2 + ⋯ ℎ(𝐘𝑛|𝐘1, … , 𝐘𝑛−1) Other factorisations are possible, and may, in fact, give better approximations…

(Vreeken 2015; Nguyen et al. 2014)

slide-37
SLIDE 37

Entropy-based Direction of Information

Δ𝐘→𝐙 = 1 2 ℎ 𝐙 𝐘) ℎ𝑣 𝐙 + 𝐼 𝐘′, 𝐙′ 𝐼𝑣 𝐘𝐘 + 𝐼𝑣 𝐙𝐘 and accordingly for Δ𝐙→𝐘

slide-38
SLIDE 38

Entropy-based Direction of Information

Δ𝐘→𝐙 = 1 2 ℎ 𝐙 𝐘) ℎ𝑣 𝐙 + 𝐼 𝐘′, 𝐙′ 𝐼𝑣 𝐘𝐘 + 𝐼𝑣 𝐙𝐘 and accordingly for Δ𝐙→𝐘

Cos Cost o

  • f t

the d data

(cumulative entropy)

Cos Cost o

  • f t

the m mod

  • del

(Shannon entropy)

slide-39
SLIDE 39

Results

slide-40
SLIDE 40

Results

slide-41
SLIDE 41

Results

slide-42
SLIDE 42

Results

slide-43
SLIDE 43

More results

Benchmark data Age → Marital Status Education level → Income Gender → Income Real data

# of Roman Catholic family members → # of married couples in the family # of family members with high edu. → # of family members with high status average income of whole family → # of home owners

slide-44
SLIDE 44

Example: Who are the Culprits?

Suppose a graph in which an epidemic spreads

 who caused it?

Main ideas

 uninfected neighbors

exone

  • nerate you from

being a culprit

 the more easy

asy to reach the footprint, the bet better

(Prakash, Vreeken & Faloutsos, ICDM’12)

slide-45
SLIDE 45

The Golden Rule

“No causal claim can be established by a purely statistical method, be it propensity scores, regression, stratification, or any other distribution-based design”

(Pearl)

slide-46
SLIDE 46

Blocking Back Doors

𝑎1 𝑎2 𝑎3 𝑋

1

𝑋

2

𝑍 𝑌 𝑋

3

slide-47
SLIDE 47

Conclusions

Causal inference

 important, difficult, in rapid development

Causal inference by alg lgorit ithmic ic co comple lexit ity

 solid foundations, clear interpretation, non-parametric  for any pair of objects of any sort, for type

pe and toke ken causation

 instantiation for multivariate real-valued data works very well

Ongoing

 how deep does the rabbit hole go?

slide-48
SLIDE 48

Causal inference

 important, difficult, in rapid development

Causal inference by alg lgorit ithmic ic co comple lexit ity

 solid foundations, clear interpretation, non-parametric  for any pair of objects of any sort, for type

pe and toke ken causation

 instantiation for multivariate real-valued data works very well

Ongoing

 how deep does the rabbit hole go?

Thank you!