SLIDE 1 Dominik Janzing
Max Planck Institute for Intelligent Systems T¨ ubingen, Germany 1
What does ’strong causal influence’ mean?
Joint work with David Balduzzi, Moritz Grosse-Wentrup and Bernhard Sch¨
SLIDE 2 X1 X2 X3 X4
¡ ¡
quantify the strength of Xi→Xj
Given:
- causally sufficient set of variables X1, . . . , Xn
- causal DAG G
- all causal conditionals P(xj|paj) even for values paj with probability zero
(more than just knowing P(X1, . . . , Xn)
Quantifying strength of an arrow:
SLIDE 3 ¡ ¡
Motivation:
Z X Y W Z X Y W
Maybe, the true causal DAG is always complete if we also account for weak
- interactions. Which ones are so weak that we can neglect them?
SLIDE 4 ¡ ¡
Strength of a set of arrows
Idea:
- strength of an arrow measures its relevance for understanding the behav-
ior of the system under inverventions
- strength of a set of arrows measures their relevance for understanding
the behavior of the system under interventions
- if each arrow in S is irrelevant then S could still be relevant
SLIDE 5 Z X Y W Z X Y W
¡ ¡
Note:
this picture is misleading because for a set S of arrows
- each element may have negligible strength
- but jointly they are not negligible
- ur causal strength will not be subadditive over the edges!
SLIDE 6 ¡ ¡
Information theoretic approach
don’t consider approaches that involve expectations, variances, etc. (ANOVA, ACE. . . )
advantages of information theory
- variables may have different domains
- quantities are invariant under rescaling
- related to thermodynamics
- better for non-statistical generalizations
SLIDE 7 ¡ ¡
Some related work
- Avin, Sphitser, Pearl: Identifiability of path-specific effects, 2005.
- Pearl: direct and indirect effects, 2001.
- Robins, Greenland: Identifiability and exchangeability of direct and indi-
rect effects, 1992.
- Holland: Causal inference, path analysis, and recursive structural equa-
tion models, 1988.
do not achieve our goal because:
- measure impact of switching X from x to x0 for one particular pair (x, x0)
- n Y when other paths are blocked
- we want an overall score of the strength of X → Y without referring to
particular pairs
SLIDE 8 ¡ ¡
Axiomatic approach: Let S be a set of arrows.
- Let CS denote its strength.
- Postulate desired properties of CS.
SLIDE 9
¡ ¡
Postulate 0 Causal Markov condition:
Z X Y
S
Z X Y DAG G DAG GS
if CS = 0 then P is also Markov w.r.t. GS (after removing all arrows in S)
SLIDE 10
¡ ¡
Postulate 1 Mutual information:
X Y
for this simple DAG we postulate CX→Y = I(X; Y )
(all the dependences are due to the influence of X on Y , hence the strength of dependences can be a measure of the strength of the influence)
SLIDE 11 ¡ ¡
X Y
Alternative option:
CX→Y := capacity of the information channel P(Y |do(X)) = P(Y |X) defined by maximizing I(X; Y ) over all possible input distributions Q(X)
- requires knowing P(Y |x) also for x-values that never/seldom occur
- quantifies the potential influence rather than the actual one
- nevertheless an interesting option
SLIDE 12 ¡ ¡
Potential strength vs actual strength
Assume a medical study shows that
- changing cholesterol within the range of values occurring in humans has
no impact on life expectancy
- increasing it by 10 times compared to the highest observed value had a
strong impact Which statement would you prefer:
- “cholesterol has a strong impact on life expectancy”
- “cholesterol would have a strong impact on life expectancy if it was much
higher than it is”
SLIDE 13
¡ ¡
Postulate 2
Y X Z Y X Z
Z is irrelevant in both cases ξX→Y is determined by P(Y |PAY ) and P(PAY ) Locality:
SLIDE 14 ¡ ¡
Postulate 3: CX→Y ≥ I(X; Y |PAX
Y )
X Y PAY
X (parents of Y without X)
X Y
Quantitative causal Markov cond: No other arrow can generate non-zero dependence I(X; Y |PAX
Y )
Idea: removing X → Y would imply I(X; Y |PAX
Y ) = 0
SLIDE 15
¡ ¡
Postulate 4: Heredity: (subsets of irrelevant sets of arrows are irrevalent) If T ⊃ S then CT = 0 ⇒ CS = 0
SLIDE 16
¡ ¡
Apart from the postulates. . . Consider a simple communication scenario for which we might agree on how C should read...
SLIDE 17 ¡ ¡
- Each variable Xj consists of kj bits
- some of the bits are set uniformly at random
- the remaining ones are copied from parents
Toy model with partial copy operations: i.e. structural equation model Xj = fj(PAj, Uj) where
- every Xj and Uj is a vector of bits
- every fj is a restriction map
SLIDE 18 1 1 1
X Y
1 1
¡ ¡
- 2. Y copies some of them
- 1. X sets all its bits randomly
1 1 1
X Y
1 1 0
- 3. Y sets the remaining ones randomly
Example with X → Y
1 1 1 1
X Y
1 1 1 1 1
X Y
SLIDE 19
¡ ¡
Do we agree that. . .
. . . CX→Y should be the number of bits that Y takes from X?
(for the simple DAG X → Y this number equals I(X; Y ))
SLIDE 20
¡ ¡
X Z Y
a)
Z X Y
b)
doesn’t account for the fact that part of the dependences are due to a) the confounder Z b) the indirect influence via Z
Why I(X; Y ) is an inappropriate measure for general DAGs
SLIDE 21 ¡ ¡
X Z Y
a)
Z X Y
b)
First guess: I(X; Y |Z)
- qualitatively, it behaves correctly:
screens off the path involving Z
- quantitatively wrong because. . .
SLIDE 22 ¡ ¡
Fails even for a simple copy scenario
1
Z
1 1
Y X
1
Z Y X
1 1 1
Z Y X
1 1 1 1
Z Y X
1 1 1
1) 2) 3) 4)
- I(X; Y |Z) = 0 because X and Y are constants when conditioned on Z
- we would like to have CX→Y = 1
SLIDE 23
¡ ¡
X Z Y
a)
Why I(X; Y |Z) is inappropriate
b)
weakening Z → Y converts a) into b), where CX→Y = I(X; Y )
SLIDE 24 ¡ ¡
Idea:
- formalized by Ay & Polani (2006) in terms of Pearl’s do-calculus
- defined family of information theoretic quantities called “Information Flow”
Measure strength of X on Y by the impact of interventions on X (while adjusting other variables)
SLIDE 25 ¡ ¡
does not solve our problem
- Ay and Polani’s Information Flow measures an interesting quantity
(something related to causality)
- we don’t consider it a good measure for the strength of an arrow
- arguments follow
SLIDE 26 ¡ ¡
First attempt:
X Z Y
a)
The strength of X → Y is the mutual information between I(X; Y ) in a scenario where
- X is subjected to a randomized intervention
SLIDE 27 ¡ ¡
Fails because...
X Z Y
- X, Y, Z binary
- P(Z) uniform
- Y = X ⊕ Z
X and Y are independent both with respect to the
- observed distribution
- distribution obtained by randomizing X
SLIDE 28 ¡ ¡
X Z Y
a)
Second attempt:
Question: X is randomized according to which distribution?
The strength of X → Y is the conditional mutual information I(X; Y |Z) in a scenario where
- X is subjected to a randomized intervention
SLIDE 29 ¡ ¡
X Z Y
a)
Second attempt, Version I
The strength of X → Y is the conditional mutual information between I(X; Y |Z) in a scenario where
- X is subjected to a randomized intervention
- X distributed according to P(X|Z)
SLIDE 30 ¡ ¡
Fails because. . .
X Z Y
=
If X is a copy of Z,
- given Z, X is a constant
- I(X; Y |Z) = 0 also for the post-interventional distribution
SLIDE 31 ¡ ¡
X Z Y
a)
Second attempt, Version II
The strength of X → Y is the conditional mutual information between I(X; Y |Z) in a scenario where
- X is subjected to a randomized intervention
- X distributed according to P(X)
SLIDE 32
¡ ¡
X Z Y
a)
Violates Postulate 3: there is a contrived example where strength of X → Y would be smaller than I(X; Y |Z)
SLIDE 33 Z Y X
¡ ¡
Violates Postulate 3:
random bit
k bits
- randomized for Z = 1
- set to zero for Z = 0
k bits
- copied from X for Z = 1
- set to 1 for Z = 0
I(X; Y |Z) = k/2 because k bits are copied in half of the cases for X and Z independent, copying occurs only in 1/4 of the cases
SLIDE 34 ¡ ¡
- Hence. . .
- defining strength of an arrow by intervention on nodes seems difficult
- we now define the strength by intervention on edges
SLIDE 35 X Z Y
P(X) P(Z)
X Z Y
S
¡ ¡
Our approach: measure impact of ‘deleting arrows’
To define the strength of S, cut every edge in S and feed the open end with an independent copy
defines new distribution PS(x, y, z) := P(x, z) P
x0,z0 P(y|x0, z0)P(x0)P(z0)
CS := D(PkPS)
SLIDE 36 X Z Y
P(X) P(Z)
¡ ¡
Idea of ‘edge deletion’
- edges are electrical wires
- attacker cuts some wires
- feeds the open ends with random input
- distribution of input chosen like observed marginal distribution
- only distribution that is locally accessible
SLIDE 37 ¡ ¡
why product distribution?
X Z Y
P(X)P(Z)
X Z Y
P(X,Z)
‘source exclusion’ by Ay & Krakauer (2006)
- not accessible to local attacker
- Postulate 4 fails
SLIDE 38
¡ ¡
X Z Y
S
1
X Z Y
S
1
X Z Y
S
D(PkPS) = number of corrupted bits (in agreement with what we expect) Applying our measure to our toy model
SLIDE 39 vaccinated
Age
infected
¡ ¡
Quantifying the impact of a vaccine
PS corresponds to an experiment where
- vaccine is randomly redistributed regardless of Age
(keeping the fraction of treated subjects)
- the random variable vaccinated is reinterpreted as
‘intention to get vaccinated’
SLIDE 40 X Z Y
a)
¡ ¡
= ⊕ P(Z) = uniform X = Z Y = X ⊕ Z XOR-Example
- Y is always 0
- Y is uniformly distributed after deleting X → Y
- Y remains independent of X
- I(X; Y ) = 0 and I(X; Y |Z) = 0
- CX→Y = 1
- Ay and Krakauer’s definition yields zero strength
SLIDE 41 E D B1 B2 B2k+1
...
= majority of Bj
= = =
¡ ¡
Failure of subadditivity redundancy code: bit E is copied to all Bj
- removing less than half of the arrows Bj → D has no impact
- each arrow has strength zero
- all arrow together have strength 1
SLIDE 42
¡ ¡
application to time series
why time series also require a new measure of causal strength
SLIDE 43 ¡ ¡
Granger causality
Xt-2 Yt-2 Xt-1 Xt Xt+1 Yt-1 Yt Yt+1
...
measures
- relevance of past Xt−1, Xt−2, . . . for predicting Yt
- if past Yt−1, Yt−2, . . . is known
SLIDE 44
¡ ¡
Xt-2 Yt-2 Xt-1 Xt Xt+1 Yt-1 Yt Yt+1
...
Transfer Entropy information theoretic version I(Yt ; Xt−1, Xt−2, . . . |Yt−1, Yt−2, . . . )
SLIDE 45 ¡ ¡
Criticizing Granger causality and Transfer Entropy
(Ay & Polani 2006)
Xt-2 Yt-2 Xt-1 Xt Xt+1 Yt-1 Yt Yt+1
assume perfect copy
- past of Y allows for predicting Y without X
- Granger causality and TE are zero
- interventions on X clearly change Y
SLIDE 46
¡ ¡
Xt-2 Yt-2 Xt-1 Xt Xt+1 Yt-1 Yt Yt+1
...
S
Applying our measure to time series CS quantifies effect of all X on Yt+1
(applying this to the example of Ay & Polani yields a reasonable result)
SLIDE 47 ¡ ¡
Conclusions:
- none of the existing measures appeared to be conceptually right for mea-
suring strength of sets of edges
- our measure satisfies our postulates
- relies on interventions on edges
- clear operational meaning (does not refer to counterfactuals)
- definitions that rely on interventions on nodes failed although they seem
more straightforward
- replacing Transfer Entropy (Granger causality) with our measure seems
reasonable
SLIDE 48 48
Thank you for listening!
Reference:
- D. Janzing, D. Balduzzi, M. Grosse-Wentrup, B. Sch¨
- lkopf:
Quantifying causal influences, to appear in Annals of Statistics.