[PPT] - What does strong causal influence mean? Joint work with David PowerPoint Presentation

SLIDE 1

Dominik Janzing

Max Planck Institute for Intelligent Systems T¨ ubingen, Germany 1

What does ’strong causal influence’ mean?

Joint work with David Balduzzi, Moritz Grosse-Wentrup and Bernhard Sch¨

lkopf

SLIDE 2

X1 X2 X3 X4

¡ ¡

quantify the strength of Xi→Xj

Given:

causally sufficient set of variables X1, . . . , Xn
causal DAG G
all causal conditionals P(xj|paj) even for values paj with probability zero

(more than just knowing P(X1, . . . , Xn)

Quantifying strength of an arrow:

SLIDE 3

¡ ¡

Motivation:

Z X Y W Z X Y W

Maybe, the true causal DAG is always complete if we also account for weak

interactions. Which ones are so weak that we can neglect them?

SLIDE 4

¡ ¡

Strength of a set of arrows

Idea:

strength of an arrow measures its relevance for understanding the behav-

ior of the system under inverventions

strength of a set of arrows measures their relevance for understanding

the behavior of the system under interventions

if each arrow in S is irrelevant then S could still be relevant

SLIDE 5

Z X Y W Z X Y W

¡ ¡

Note:

this picture is misleading because for a set S of arrows

each element may have negligible strength
but jointly they are not negligible
ur causal strength will not be subadditive over the edges!

SLIDE 6

¡ ¡

Information theoretic approach

don’t consider approaches that involve expectations, variances, etc. (ANOVA, ACE. . . )

advantages of information theory

variables may have different domains
quantities are invariant under rescaling
related to thermodynamics
better for non-statistical generalizations

SLIDE 7

¡ ¡

Some related work

Avin, Sphitser, Pearl: Identifiability of path-specific effects, 2005.
Pearl: direct and indirect effects, 2001.
Robins, Greenland: Identifiability and exchangeability of direct and indi-

rect effects, 1992.

Holland: Causal inference, path analysis, and recursive structural equa-

tion models, 1988.

do not achieve our goal because:

measure impact of switching X from x to x0 for one particular pair (x, x0)
n Y when other paths are blocked
we want an overall score of the strength of X → Y without referring to

particular pairs

SLIDE 8

¡ ¡

Axiomatic approach: Let S be a set of arrows.

Let CS denote its strength.
Postulate desired properties of CS.

SLIDE 9

¡ ¡

Postulate 0 Causal Markov condition:

Z X Y

S

Z X Y DAG G DAG GS

if CS = 0 then P is also Markov w.r.t. GS (after removing all arrows in S)

SLIDE 10

¡ ¡

Postulate 1 Mutual information:

X Y

for this simple DAG we postulate CX→Y = I(X; Y )

(all the dependences are due to the influence of X on Y , hence the strength of dependences can be a measure of the strength of the influence)

SLIDE 11

¡ ¡

X Y

Alternative option:

CX→Y := capacity of the information channel P(Y |do(X)) = P(Y |X) defined by maximizing I(X; Y ) over all possible input distributions Q(X)

requires knowing P(Y |x) also for x-values that never/seldom occur
quantifies the potential influence rather than the actual one
nevertheless an interesting option

SLIDE 12

¡ ¡

Potential strength vs actual strength

Assume a medical study shows that

changing cholesterol within the range of values occurring in humans has

no impact on life expectancy

increasing it by 10 times compared to the highest observed value had a

strong impact Which statement would you prefer:

“cholesterol has a strong impact on life expectancy”
“cholesterol would have a strong impact on life expectancy if it was much

higher than it is”

SLIDE 13

¡ ¡

Postulate 2

Y X Z Y X Z

Z is irrelevant in both cases ξX→Y is determined by P(Y |PAY ) and P(PAY ) Locality:

SLIDE 14

¡ ¡

Postulate 3: CX→Y ≥ I(X; Y |PAX

Y )

X Y PAY

X (parents of Y without X)

X Y

Quantitative causal Markov cond: No other arrow can generate non-zero dependence I(X; Y |PAX

Y )

Idea: removing X → Y would imply I(X; Y |PAX

Y ) = 0

SLIDE 15

¡ ¡

Postulate 4: Heredity: (subsets of irrelevant sets of arrows are irrevalent) If T ⊃ S then CT = 0 ⇒ CS = 0

SLIDE 16

¡ ¡

Apart from the postulates. . . Consider a simple communication scenario for which we might agree on how C should read...

SLIDE 17

¡ ¡

Each variable Xj consists of kj bits
some of the bits are set uniformly at random
the remaining ones are copied from parents

Toy model with partial copy operations: i.e. structural equation model Xj = fj(PAj, Uj) where

every Xj and Uj is a vector of bits
every fj is a restriction map

SLIDE 18

1 1 1

X Y

1 1

¡ ¡

2. Y copies some of them
1. X sets all its bits randomly

1 1 1

X Y

1 1 0

3. Y sets the remaining ones randomly

Example with X → Y

1 1 1 1

X Y

1 1 1 1 1

X Y

SLIDE 19

¡ ¡

Do we agree that. . .

. . . CX→Y should be the number of bits that Y takes from X?

(for the simple DAG X → Y this number equals I(X; Y ))

SLIDE 20

¡ ¡

X Z Y

a)

Z X Y

b)

doesn’t account for the fact that part of the dependences are due to a) the confounder Z b) the indirect influence via Z

Why I(X; Y ) is an inappropriate measure for general DAGs

SLIDE 21

¡ ¡

X Z Y

a)

Z X Y

b)

First guess: I(X; Y |Z)

qualitatively, it behaves correctly:

screens off the path involving Z

quantitatively wrong because. . .

SLIDE 22

¡ ¡

Fails even for a simple copy scenario

1

Z

1 1

Y X

1

Z Y X

1 1 1

Z Y X

1 1 1 1

Z Y X

1 1 1

1) 2) 3) 4)

I(X; Y |Z) = 0 because X and Y are constants when conditioned on Z
we would like to have CX→Y = 1

SLIDE 23

¡ ¡

X Z Y

a)

Why I(X; Y |Z) is inappropriate

b)

weakening Z → Y converts a) into b), where CX→Y = I(X; Y )

SLIDE 24

¡ ¡

Idea:

formalized by Ay & Polani (2006) in terms of Pearl’s do-calculus
defined family of information theoretic quantities called “Information Flow”

Measure strength of X on Y by the impact of interventions on X (while adjusting other variables)

SLIDE 25

¡ ¡

does not solve our problem

Ay and Polani’s Information Flow measures an interesting quantity

(something related to causality)

we don’t consider it a good measure for the strength of an arrow
arguments follow

SLIDE 26

¡ ¡

First attempt:

X Z Y

a)

The strength of X → Y is the mutual information between I(X; Y ) in a scenario where

X is subjected to a randomized intervention

SLIDE 27

¡ ¡

Fails because...

X Z Y

X, Y, Z binary
P(Z) uniform
Y = X ⊕ Z

X and Y are independent both with respect to the

observed distribution
distribution obtained by randomizing X

SLIDE 28

¡ ¡

X Z Y

a)

Second attempt:

Question: X is randomized according to which distribution?

The strength of X → Y is the conditional mutual information I(X; Y |Z) in a scenario where

X is subjected to a randomized intervention

SLIDE 29

¡ ¡

X Z Y

a)

Second attempt, Version I

The strength of X → Y is the conditional mutual information between I(X; Y |Z) in a scenario where

X is subjected to a randomized intervention
X distributed according to P(X|Z)

SLIDE 30

¡ ¡

Fails because. . .

X Z Y

=

If X is a copy of Z,

given Z, X is a constant
I(X; Y |Z) = 0 also for the post-interventional distribution

SLIDE 31

¡ ¡

X Z Y

a)

Second attempt, Version II

The strength of X → Y is the conditional mutual information between I(X; Y |Z) in a scenario where

X is subjected to a randomized intervention
X distributed according to P(X)

SLIDE 32

¡ ¡

X Z Y

a)

Violates Postulate 3: there is a contrived example where strength of X → Y would be smaller than I(X; Y |Z)

SLIDE 33

Z Y X

¡ ¡

Violates Postulate 3:

random bit

k bits

randomized for Z = 1
set to zero for Z = 0

k bits

copied from X for Z = 1
set to 1 for Z = 0

I(X; Y |Z) = k/2 because k bits are copied in half of the cases for X and Z independent, copying occurs only in 1/4 of the cases

SLIDE 34

¡ ¡

Hence. . .
defining strength of an arrow by intervention on nodes seems difficult
we now define the strength by intervention on edges

SLIDE 35

X Z Y

P(X) P(Z)

X Z Y

S

¡ ¡

Our approach: measure impact of ‘deleting arrows’

To define the strength of S, cut every edge in S and feed the open end with an independent copy

defines new distribution PS(x, y, z) := P(x, z) P

x0,z0 P(y|x0, z0)P(x0)P(z0)

CS := D(PkPS)

SLIDE 36

X Z Y

P(X) P(Z)

¡ ¡

Idea of ‘edge deletion’

edges are electrical wires
attacker cuts some wires
feeds the open ends with random input
distribution of input chosen like observed marginal distribution
only distribution that is locally accessible

SLIDE 37

¡ ¡

why product distribution?

X Z Y

P(X)P(Z)

X Z Y

P(X,Z)

ur edge deletion

‘source exclusion’ by Ay & Krakauer (2006)

not accessible to local attacker
Postulate 4 fails

SLIDE 38

¡ ¡

X Z Y

S

1

X Z Y

S

1 X Z Y

S

D(PkPS) = number of corrupted bits (in agreement with what we expect) Applying our measure to our toy model

SLIDE 39

vaccinated

r not

Age

infected

r not

¡ ¡

Quantifying the impact of a vaccine

PS corresponds to an experiment where

vaccine is randomly redistributed regardless of Age

(keeping the fraction of treated subjects)

the random variable vaccinated is reinterpreted as

‘intention to get vaccinated’

SLIDE 40

X Z Y

a)

¡ ¡

= ⊕ P(Z) = uniform X = Z Y = X ⊕ Z XOR-Example

Y is always 0
Y is uniformly distributed after deleting X → Y
Y remains independent of X
I(X; Y ) = 0 and I(X; Y |Z) = 0
CX→Y = 1
Ay and Krakauer’s definition yields zero strength

SLIDE 41

E D B1 B2 B2k+1

...

= majority of Bj

= = =

¡ ¡

Failure of subadditivity redundancy code: bit E is copied to all Bj

removing less than half of the arrows Bj → D has no impact
each arrow has strength zero
all arrow together have strength 1

SLIDE 42

¡ ¡

application to time series

why time series also require a new measure of causal strength

SLIDE 43

¡ ¡

Granger causality

Xt-2 Yt-2 Xt-1 Xt Xt+1 Yt-1 Yt Yt+1

...

measures

relevance of past Xt−1, Xt−2, . . . for predicting Yt
if past Yt−1, Yt−2, . . . is known

SLIDE 44

¡ ¡

Xt-2 Yt-2 Xt-1 Xt Xt+1 Yt-1 Yt Yt+1

...

Transfer Entropy information theoretic version I(Yt ; Xt−1, Xt−2, . . . |Yt−1, Yt−2, . . . )

SLIDE 45

¡ ¡

Criticizing Granger causality and Transfer Entropy

(Ay & Polani 2006)

Xt-2 Yt-2 Xt-1 Xt Xt+1 Yt-1 Yt Yt+1

assume perfect copy

past of Y allows for predicting Y without X
Granger causality and TE are zero
interventions on X clearly change Y

SLIDE 46

¡ ¡

Xt-2 Yt-2 Xt-1 Xt Xt+1 Yt-1 Yt Yt+1

...

S

Applying our measure to time series CS quantifies effect of all X on Yt+1

(applying this to the example of Ay & Polani yields a reasonable result)

SLIDE 47

¡ ¡

Conclusions:

none of the existing measures appeared to be conceptually right for mea-

suring strength of sets of edges

our measure satisfies our postulates
relies on interventions on edges
clear operational meaning (does not refer to counterfactuals)
definitions that rely on interventions on nodes failed although they seem

more straightforward

replacing Transfer Entropy (Granger causality) with our measure seems

reasonable

SLIDE 48

48

Thank you for listening!

Reference:

D. Janzing, D. Balduzzi, M. Grosse-Wentrup, B. Sch¨
lkopf: